[PATCH -mm 00/25] VM pageout scalability improvements (V10)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH -mm 00/25] VM pageout scalability improvements (V10)
@ 2008-06-06 20:28 Rik van Riel, Rik van Riel
  2008-06-06 20:28 ` [PATCH -mm 01/25] move isolate_lru_page() to vmscan.c Rik van Riel, Rik van Riel
                   ` (25 more replies)
  0 siblings, 26 replies; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro

On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not
only does it use up CPU time, but it also provokes lock contention
and can leave large systems under memory presure in a catatonic state.

Against 2.6.26-rc2-mm1

This patch series improves VM scalability by:

1) putting filesystem backed, swap backed and non-reclaimable pages
   onto their own LRUs, so the system only scans the pages that it
   can/should evict from memory

2) switching to two handed clock replacement for the anonymous LRUs,
   so the number of pages that need to be scanned when the system
   starts swapping is bound to a reasonable number

3) keeping non-reclaimable pages off the LRU completely, so the
   VM does not waste CPU time scanning them.  Currently only
   ramfs and SHM_LOCKED pages are kept on the noreclaim list,
   mlock()ed VMAs will be added later

More info on the overall design can be found at:

	http://linux-mm.org/PageReplacementDesign

An all-in-one patch can be found at:

	http://people.redhat.com/riel/splitvm/

Changelog:
- merge in all of Lee's mlock changes

- fix shrink_list memcgroup balancing (KOSAKI Motohiro)
- fix balancing stats in shrink_active_list (Daisuke Nishimura)

- make sure previously active pagecache pages get reactivated
  on the first access (Rik van Riel)
- compile fix when !CONFIG_SWAP (MinChan Kim)
- clean up page-flags.h defines when !CONFIG_NORECLAIM_LRU
  (Lee Schermerhorn)
- fix some race conditions around moving pages to and from
  the noreclaim list (Lee Schermerhorn, KOSAKI Motohiro)
- use putback_lru_page() for page migration (Lee Schermerhorn)
- fix potential SHM_UNLOCK race in scan_mapping_noreclaim_pages()
  (Lee Schermerhorn, KOSAKI Motohiro)

- improve swap space freeing to deal with COW shared space
  (Lee Schermerhorn, Daisuke Nishimura & Minchan Kim)
- clean up PG_swapbacked setting in swapin path (Minchan Kim)
- properly invoke shrink_active_list for background aging (Minchan Kim)
- add authorship info to all patches (Rik van Riel)
- clean up (or move below ---) the comments for the commit logs (Rik van Riel)
- after some tests, reduce default swappiness to 20 for now (Rik van Riel)

- several code cleanups (minchan Kim)
- noreclaim patch refactoring and improvements (Lee Schermerhorn)
- several PROT_NONE and vma merging fixes (KOSAKI Motohiro)
- SMP bugfixes and efficiency improvements (Rik van Riel, Lee Schermerhorn)
- fix NUMA node stats printing (Lee Schermerhorn)
- remove the mlocked-VMA-noreclaim code for now, it still has
  bugs on IA64 and is holding up the merge (Rik van Riel)

- make page_alloc.c compile without CONFIG_NORECLAIM_MLOCK (minchan Kim)
- BUG() does not take an argument (minchan Kim) 
- clean up is_active_lru and is_file_lru (Andy Whitcroft)
- clean up shrink_active_list temp list names (KOSAKI Motohiro)
- add total active & inactive memory totals for vmstat -a (KOSAKI Motohiro)
- only try global anon page aging on global lru scans (KOSAKI Motohiro)
- make function descriptions follow the kernel-doc format (Rik van Riel)
- simplify mlock_vma_pages_range and munlock_vma_pages_range (Lee Schermerhorn)
- remove some more arguments, rename to mlock_vma_pages_all (Lee Schermerhorn)
- many code cleanups (Lee Schermerhorn)
- pass correct vma arg to mlock_vma_pages_range from do_brk (Rik van Riel)
- port to 2.6.25-rc3-mm1

- pull the memcontrol lru arrayification earlier into the patch series
- use a pagevec array similar to the lru array
- clean up the code in various places
- improved pageout balancing and reduced pageout cpu use

- fix compilation on PPC and without memcontrol
- make page_is_pagecache more readable
- replace get_scan_ratio with correct version

- merge memcontroller split LRU code into the main split LRU patch,
  since it is not functionally different (it was split up only to help
  people who had seen the last version of the patch series review it)
- drop the page_file_cache debugging patch, since it never triggered
- reintroduce code to not scan anon list if swap is full
- add code to scan anon list if page cache is very small already
- use lumpy reclaim more aggressively for smaller order > 1 allocations
Subject: [PATCH -mm 00/25] 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 01/25] move isolate_lru_page() to vmscan.c
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-06 20:28 ` [PATCH -mm 02/25] Use an indexed array for LRU variables Rik van Riel, Rik van Riel
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, Nick Piggin,
	Lee Schermerhorn

[-- Attachment #1: np-01-move-and-rework-isolate_lru_page-v2.patch --]
[-- Type: text/plain, Size: 7400 bytes --]

From: Nick Piggin <npiggin@suse.de>

isolate_lru_page logically belongs to be in vmscan.c than migrate.c.

It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c. However a subsequent
patch needs to make use of it in the core mm, so we can happily move it
to vmscan.c.

Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list. Callers can do that.

	Note that we now have '__isolate_lru_page()', that does
	something quite different, visible outside of vmscan.c
	for use with memory controller.  Methinks we need to
	rationalize these names/purposes.	--lts

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

---

V1 -> V2 [lts]:
+  fix botched merge -- add back "get_page_unless_zero()"

  From: Nick Piggin <npiggin@suse.de>
  To: Linux Memory Management <linux-mm@kvack.org>
  Subject: [patch 1/4] mm: move and rework isolate_lru_page
  Date:	Mon, 12 Mar 2007 07:38:44 +0100 (CET)

 include/linux/migrate.h |    3 ---
 mm/internal.h           |    2 ++
 mm/mempolicy.c          |    9 +++++++--
 mm/migrate.c            |   34 +++-------------------------------
 mm/vmscan.c             |   45 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 57 insertions(+), 36 deletions(-)

Index: linux-2.6.26-rc2-mm1/include/linux/migrate.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/migrate.h	2008-05-23 14:21:22.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/migrate.h	2008-05-23 14:21:32.000000000 -0400
@@ -25,7 +25,6 @@ static inline int vma_migratable(struct 
 	return 1;
 }
 
-extern int isolate_lru_page(struct page *p, struct list_head *pagelist);
 extern int putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
 			struct page *, struct page *);
@@ -42,8 +41,6 @@ extern int migrate_vmas(struct mm_struct
 static inline int vma_migratable(struct vm_area_struct *vma)
 					{ return 0; }
 
-static inline int isolate_lru_page(struct page *p, struct list_head *list)
-					{ return -ENOSYS; }
 static inline int putback_lru_pages(struct list_head *l) { return 0; }
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private) { return -ENOSYS; }
Index: linux-2.6.26-rc2-mm1/mm/internal.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/internal.h	2008-05-23 14:21:22.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/internal.h	2008-05-23 14:21:32.000000000 -0400
@@ -34,6 +34,8 @@ static inline void __put_page(struct pag
 	atomic_dec(&page->_count);
 }
 
+extern int isolate_lru_page(struct page *page);
+
 extern void __free_pages_bootmem(struct page *page, unsigned int order);
 
 /*
Index: linux-2.6.26-rc2-mm1/mm/migrate.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/migrate.c	2008-05-23 14:21:22.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/migrate.c	2008-05-23 14:21:32.000000000 -0400
@@ -37,36 +37,6 @@
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 /*
- * Isolate one page from the LRU lists. If successful put it onto
- * the indicated list with elevated page count.
- *
- * Result:
- *  -EBUSY: page not on LRU list
- *  0: page removed from LRU list and added to the specified list.
- */
-int isolate_lru_page(struct page *page, struct list_head *pagelist)
-{
-	int ret = -EBUSY;
-
-	if (PageLRU(page)) {
-		struct zone *zone = page_zone(page);
-
-		spin_lock_irq(&zone->lru_lock);
-		if (PageLRU(page) && get_page_unless_zero(page)) {
-			ret = 0;
-			ClearPageLRU(page);
-			if (PageActive(page))
-				del_page_from_active_list(zone, page);
-			else
-				del_page_from_inactive_list(zone, page);
-			list_add_tail(&page->lru, pagelist);
-		}
-		spin_unlock_irq(&zone->lru_lock);
-	}
-	return ret;
-}
-
-/*
  * migrate_prep() needs to be called before we start compiling a list of pages
  * to be migrated using isolate_lru_page().
  */
@@ -895,7 +865,9 @@ static int do_move_pages(struct mm_struc
 				!migrate_all)
 			goto put_and_set;
 
-		err = isolate_lru_page(page, &pagelist);
+		err = isolate_lru_page(page);
+		if (!err)
+			list_add_tail(&page->lru, &pagelist);
 put_and_set:
 		/*
 		 * Either remove the duplicate refcount from
Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-05-23 14:21:22.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-05-23 14:21:32.000000000 -0400
@@ -816,6 +816,51 @@ static unsigned long clear_active_flags(
 	return nr_active;
 }
 
+/**
+ * isolate_lru_page - tries to isolate a page from its LRU list
+ * @page: page to isolate from its LRU list
+ *
+ * Isolates a @page from an LRU list, clears PageLRU and adjusts the
+ * vmstat statistic corresponding to whatever LRU list the page was on.
+ *
+ * Returns 0 if the page was removed from an LRU list.
+ * Returns -EBUSY if the page was not on an LRU list.
+ *
+ * The returned page will have PageLRU() cleared.  If it was found on
+ * the active list, it will have PageActive set.  That flag may need
+ * to be cleared by the caller before letting the page go.
+ *
+ * The vmstat statistic corresponding to the list on which the page was
+ * found will be decremented.
+ *
+ * Restrictions:
+ * (1) Must be called with an elevated refcount on the page. This is a
+ *     fundamentnal difference from isolate_lru_pages (which is called
+ *     without a stable reference).
+ * (2) the lru_lock must not be held.
+ * (3) interrupts must be enabled.
+ */
+int isolate_lru_page(struct page *page)
+{
+	int ret = -EBUSY;
+
+	if (PageLRU(page)) {
+		struct zone *zone = page_zone(page);
+
+		spin_lock_irq(&zone->lru_lock);
+		if (PageLRU(page) && get_page_unless_zero(page)) {
+			ret = 0;
+			ClearPageLRU(page);
+			if (PageActive(page))
+				del_page_from_active_list(zone, page);
+			else
+				del_page_from_inactive_list(zone, page);
+		}
+		spin_unlock_irq(&zone->lru_lock);
+	}
+	return ret;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
Index: linux-2.6.26-rc2-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/mempolicy.c	2008-05-23 14:21:22.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/mempolicy.c	2008-05-23 14:21:32.000000000 -0400
@@ -93,6 +93,8 @@
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
 
+#include "internal.h"
+
 /* Internal flags */
 #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0)	/* Skip checks for continuous vmas */
 #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1)		/* Invert check for nodemask */
@@ -758,8 +760,11 @@ static void migrate_page_add(struct page
 	/*
 	 * Avoid migrating a page that is shared with others.
 	 */
-	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1)
-		isolate_lru_page(page, pagelist);
+	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
+		if (!isolate_lru_page(page)) {
+			list_add_tail(&page->lru, pagelist);
+		}
+	}
 }
 
 static struct page *new_node_page(struct page *page, unsigned long node, int **x)

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 02/25] Use an indexed array for LRU variables
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
  2008-06-06 20:28 ` [PATCH -mm 01/25] move isolate_lru_page() to vmscan.c Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-07  1:04   ` Andrew Morton
  2008-06-06 20:28 ` [PATCH -mm 03/25] use an array for the LRU pagevecs Rik van Riel, Rik van Riel
                   ` (23 subsequent siblings)
  25 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro,
	Christoph Lameter

[-- Attachment #1: cl-use-indexed-array-of-lru-lists.patch --]
[-- Type: text/plain, Size: 22261 bytes --]

From: Christoph Lameter <clameter@sgi.com>

Currently we are defining explicit variables for the inactive
and active list. An indexed array can be more generic and avoid
repeating similar code in several places in the reclaim code.

We are saving a few bytes in terms of code size:

Before:

   text    data     bss     dec     hex filename
4097753  573120 4092484 8763357  85b7dd vmlinux

After:

   text    data     bss     dec     hex filename
4097729  573120 4092484 8763333  85b7c5 vmlinux

Having an easy way to add new lru lists may ease future work on
the reclaim code.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
V3 [riel]: memcontrol LRU arrayification

V1 -> V2 [lts]:
+ Remove extraneous  __dec_zone_state(zone, NR_ACTIVE) pointed
  out by Mel G.


 include/linux/memcontrol.h |   17 +-----
 include/linux/mm_inline.h  |   33 ++++++++----
 include/linux/mmzone.h     |   17 ++++--
 mm/memcontrol.c            |  115 ++++++++++++++++---------------------------
 mm/page_alloc.c            |    9 +--
 mm/swap.c                  |    2 
 mm/vmscan.c                |  120 +++++++++++++++++++++------------------------
 mm/vmstat.c                |    3 -
 8 files changed, 146 insertions(+), 170 deletions(-)

Index: linux-2.6.26-rc2-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mmzone.h	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mmzone.h	2008-05-23 14:21:33.000000000 -0400
@@ -81,8 +81,8 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
-	NR_INACTIVE,
-	NR_ACTIVE,
+	NR_INACTIVE,	/* must match order of LRU_[IN]ACTIVE */
+	NR_ACTIVE,	/*  "     "     "   "       "         */
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
@@ -107,6 +107,13 @@ enum zone_stat_item {
 #endif
 	NR_VM_ZONE_STAT_ITEMS };
 
+enum lru_list {
+	LRU_INACTIVE,	/* must match order of NR_[IN]ACTIVE */
+	LRU_ACTIVE,	/*  "     "     "   "       "        */
+	NR_LRU_LISTS };
+
+#define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
+
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
@@ -251,10 +258,8 @@ struct zone {
 
 	/* Fields commonly accessed by the page reclaim scanner */
 	spinlock_t		lru_lock;	
-	struct list_head	active_list;
-	struct list_head	inactive_list;
-	unsigned long		nr_scan_active;
-	unsigned long		nr_scan_inactive;
+	struct list_head	list[NR_LRU_LISTS];
+	unsigned long		nr_scan[NR_LRU_LISTS];
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
Index: linux-2.6.26-rc2-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mm_inline.h	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mm_inline.h	2008-05-23 14:21:33.000000000 -0400
@@ -1,40 +1,51 @@
 static inline void
+add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
+{
+	list_add(&page->lru, &zone->list[l]);
+	__inc_zone_state(zone, NR_INACTIVE + l);
+}
+
+static inline void
+del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
+{
+	list_del(&page->lru);
+	__dec_zone_state(zone, NR_INACTIVE + l);
+}
+
+static inline void
 add_page_to_active_list(struct zone *zone, struct page *page)
 {
-	list_add(&page->lru, &zone->active_list);
-	__inc_zone_state(zone, NR_ACTIVE);
+	add_page_to_lru_list(zone, page, LRU_ACTIVE);
 }
 
 static inline void
 add_page_to_inactive_list(struct zone *zone, struct page *page)
 {
-	list_add(&page->lru, &zone->inactive_list);
-	__inc_zone_state(zone, NR_INACTIVE);
+	add_page_to_lru_list(zone, page, LRU_INACTIVE);
 }
 
 static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
-	list_del(&page->lru);
-	__dec_zone_state(zone, NR_ACTIVE);
+	del_page_from_lru_list(zone, page, LRU_ACTIVE);
 }
 
 static inline void
 del_page_from_inactive_list(struct zone *zone, struct page *page)
 {
-	list_del(&page->lru);
-	__dec_zone_state(zone, NR_INACTIVE);
+	del_page_from_lru_list(zone, page, LRU_INACTIVE);
 }
 
 static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
+	enum lru_list l = LRU_INACTIVE;
+
 	list_del(&page->lru);
 	if (PageActive(page)) {
 		__ClearPageActive(page);
-		__dec_zone_state(zone, NR_ACTIVE);
-	} else {
-		__dec_zone_state(zone, NR_INACTIVE);
+		l = LRU_ACTIVE;
 	}
+	__dec_zone_state(zone, NR_INACTIVE + l);
 }
 
Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/page_alloc.c	2008-05-23 14:21:33.000000000 -0400
@@ -3444,6 +3444,7 @@ static void __paginginit free_area_init_
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
 		unsigned long size, realsize, memmap_pages;
+		enum lru_list l;
 
 		size = zone_spanned_pages_in_node(nid, j, zones_size);
 		realsize = size - zone_absent_pages_in_node(nid, j,
@@ -3494,10 +3495,10 @@ static void __paginginit free_area_init_
 		zone->prev_priority = DEF_PRIORITY;
 
 		zone_pcp_init(zone);
-		INIT_LIST_HEAD(&zone->active_list);
-		INIT_LIST_HEAD(&zone->inactive_list);
-		zone->nr_scan_active = 0;
-		zone->nr_scan_inactive = 0;
+		for_each_lru(l) {
+			INIT_LIST_HEAD(&zone->list[l]);
+			zone->nr_scan[l] = 0;
+		}
 		zap_zone_vm_stats(zone);
 		zone->flags = 0;
 		if (!size)
Index: linux-2.6.26-rc2-mm1/mm/swap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/swap.c	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/swap.c	2008-05-23 14:21:33.000000000 -0400
@@ -117,7 +117,7 @@ static void pagevec_move_tail(struct pag
 			spin_lock(&zone->lru_lock);
 		}
 		if (PageLRU(page) && !PageActive(page)) {
-			list_move_tail(&page->lru, &zone->inactive_list);
+			list_move_tail(&page->lru, &zone->list[LRU_INACTIVE]);
 			pgmoved++;
 		}
 	}
Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-05-23 14:21:32.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-05-23 14:21:33.000000000 -0400
@@ -791,10 +791,10 @@ static unsigned long isolate_pages_globa
 					int active)
 {
 	if (active)
-		return isolate_lru_pages(nr, &z->active_list, dst,
+		return isolate_lru_pages(nr, &z->list[LRU_ACTIVE], dst,
 						scanned, order, mode);
 	else
-		return isolate_lru_pages(nr, &z->inactive_list, dst,
+		return isolate_lru_pages(nr, &z->list[LRU_INACTIVE], dst,
 						scanned, order, mode);
 }
 
@@ -945,10 +945,7 @@ static unsigned long shrink_inactive_lis
 			VM_BUG_ON(PageLRU(page));
 			SetPageLRU(page);
 			list_del(&page->lru);
-			if (PageActive(page))
-				add_page_to_active_list(zone, page);
-			else
-				add_page_to_inactive_list(zone, page);
+			add_page_to_lru_list(zone, page, PageActive(page));
 			if (!pagevec_add(&pvec, page)) {
 				spin_unlock_irq(&zone->lru_lock);
 				__pagevec_release(&pvec);
@@ -1116,8 +1113,8 @@ static void shrink_active_list(unsigned 
 	int pgdeactivate = 0;
 	unsigned long pgscanned;
 	LIST_HEAD(l_hold);	/* The pages which were snipped off */
-	LIST_HEAD(l_inactive);	/* Pages to go onto the inactive_list */
-	LIST_HEAD(l_active);	/* Pages to go onto the active_list */
+	LIST_HEAD(l_active);
+	LIST_HEAD(l_inactive);
 	struct page *page;
 	struct pagevec pvec;
 	int reclaim_mapped = 0;
@@ -1169,7 +1166,7 @@ static void shrink_active_list(unsigned 
 		VM_BUG_ON(!PageActive(page));
 		ClearPageActive(page);
 
-		list_move(&page->lru, &zone->inactive_list);
+		list_move(&page->lru, &zone->list[LRU_INACTIVE]);
 		mem_cgroup_move_lists(page, false);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
@@ -1199,7 +1196,7 @@ static void shrink_active_list(unsigned 
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
 
-		list_move(&page->lru, &zone->active_list);
+		list_move(&page->lru, &zone->list[LRU_ACTIVE]);
 		mem_cgroup_move_lists(page, true);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
@@ -1219,65 +1216,64 @@ static void shrink_active_list(unsigned 
 	pagevec_release(&pvec);
 }
 
+static unsigned long shrink_list(enum lru_list l, unsigned long nr_to_scan,
+	struct zone *zone, struct scan_control *sc, int priority)
+{
+	if (l == LRU_ACTIVE) {
+		shrink_active_list(nr_to_scan, zone, sc, priority);
+		return 0;
+	}
+	return shrink_inactive_list(nr_to_scan, zone, sc);
+}
+
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
 static unsigned long shrink_zone(int priority, struct zone *zone,
 				struct scan_control *sc)
 {
-	unsigned long nr_active;
-	unsigned long nr_inactive;
+	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
 	unsigned long nr_reclaimed = 0;
+	enum lru_list l;
 
 	if (scan_global_lru(sc)) {
 		/*
 		 * Add one to nr_to_scan just to make sure that the kernel
 		 * will slowly sift through the active list.
 		 */
-		zone->nr_scan_active +=
-			(zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
-		nr_active = zone->nr_scan_active;
-		zone->nr_scan_inactive +=
-			(zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
-		nr_inactive = zone->nr_scan_inactive;
-		if (nr_inactive >= sc->swap_cluster_max)
-			zone->nr_scan_inactive = 0;
-		else
-			nr_inactive = 0;
-
-		if (nr_active >= sc->swap_cluster_max)
-			zone->nr_scan_active = 0;
-		else
-			nr_active = 0;
+		for_each_lru(l) {
+			zone->nr_scan[l] += (zone_page_state(zone,
+					NR_INACTIVE + l)  >> priority) + 1;
+			nr[l] = zone->nr_scan[l];
+			if (nr[l] >= sc->swap_cluster_max)
+				zone->nr_scan[l] = 0;
+			else
+				nr[l] = 0;
+		}
 	} else {
 		/*
 		 * This reclaim occurs not because zone memory shortage but
 		 * because memory controller hits its limit.
 		 * Then, don't modify zone reclaim related data.
 		 */
-		nr_active = mem_cgroup_calc_reclaim_active(sc->mem_cgroup,
-					zone, priority);
+		nr[LRU_ACTIVE] = mem_cgroup_calc_reclaim(sc->mem_cgroup,
+					zone, priority, LRU_ACTIVE);
 
-		nr_inactive = mem_cgroup_calc_reclaim_inactive(sc->mem_cgroup,
-					zone, priority);
+		nr[LRU_INACTIVE] = mem_cgroup_calc_reclaim(sc->mem_cgroup,
+					zone, priority, LRU_INACTIVE);
 	}
 
-
-	while (nr_active || nr_inactive) {
-		if (nr_active) {
-			nr_to_scan = min(nr_active,
+	while (nr[LRU_ACTIVE] || nr[LRU_INACTIVE]) {
+		for_each_lru(l) {
+			if (nr[l]) {
+				nr_to_scan = min(nr[l],
 					(unsigned long)sc->swap_cluster_max);
-			nr_active -= nr_to_scan;
-			shrink_active_list(nr_to_scan, zone, sc, priority);
-		}
+				nr[l] -= nr_to_scan;
 
-		if (nr_inactive) {
-			nr_to_scan = min(nr_inactive,
-					(unsigned long)sc->swap_cluster_max);
-			nr_inactive -= nr_to_scan;
-			nr_reclaimed += shrink_inactive_list(nr_to_scan, zone,
-								sc);
+				nr_reclaimed += shrink_list(l, nr_to_scan,
+							zone, sc, priority);
+			}
 		}
 	}
 
@@ -1790,6 +1786,7 @@ static unsigned long shrink_all_zones(un
 {
 	struct zone *zone;
 	unsigned long nr_to_scan, ret = 0;
+	enum lru_list l;
 
 	for_each_zone(zone) {
 
@@ -1799,28 +1796,25 @@ static unsigned long shrink_all_zones(un
 		if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY)
 			continue;
 
-		/* For pass = 0 we don't shrink the active list */
-		if (pass > 0) {
-			zone->nr_scan_active +=
-				(zone_page_state(zone, NR_ACTIVE) >> prio) + 1;
-			if (zone->nr_scan_active >= nr_pages || pass > 3) {
-				zone->nr_scan_active = 0;
+		for_each_lru(l) {
+			/* For pass = 0 we don't shrink the active list */
+			if (pass == 0 && l == LRU_ACTIVE)
+				continue;
+
+			zone->nr_scan[l] +=
+				(zone_page_state(zone, NR_INACTIVE + l)
+								>> prio) + 1;
+			if (zone->nr_scan[l] >= nr_pages || pass > 3) {
+				zone->nr_scan[l] = 0;
 				nr_to_scan = min(nr_pages,
-					zone_page_state(zone, NR_ACTIVE));
-				shrink_active_list(nr_to_scan, zone, sc, prio);
+					zone_page_state(zone,
+							NR_INACTIVE + l));
+				ret += shrink_list(l, nr_to_scan, zone,
+								sc, prio);
+				if (ret >= nr_pages)
+					return ret;
 			}
 		}
-
-		zone->nr_scan_inactive +=
-			(zone_page_state(zone, NR_INACTIVE) >> prio) + 1;
-		if (zone->nr_scan_inactive >= nr_pages || pass > 3) {
-			zone->nr_scan_inactive = 0;
-			nr_to_scan = min(nr_pages,
-				zone_page_state(zone, NR_INACTIVE));
-			ret += shrink_inactive_list(nr_to_scan, zone, sc);
-			if (ret >= nr_pages)
-				return ret;
-		}
 	}
 
 	return ret;
Index: linux-2.6.26-rc2-mm1/mm/vmstat.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmstat.c	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmstat.c	2008-05-23 14:21:33.000000000 -0400
@@ -772,7 +772,8 @@ static void zoneinfo_show_print(struct s
 		   zone->pages_low,
 		   zone->pages_high,
 		   zone->pages_scanned,
-		   zone->nr_scan_active, zone->nr_scan_inactive,
+		   zone->nr_scan[LRU_ACTIVE],
+		   zone->nr_scan[LRU_INACTIVE],
 		   zone->spanned_pages,
 		   zone->present_pages);
 
Index: linux-2.6.26-rc2-mm1/include/linux/memcontrol.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/memcontrol.h	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/memcontrol.h	2008-05-23 14:21:33.000000000 -0400
@@ -67,10 +67,8 @@ extern void mem_cgroup_note_reclaim_prio
 extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
 							int priority);
 
-extern long mem_cgroup_calc_reclaim_active(struct mem_cgroup *mem,
-				struct zone *zone, int priority);
-extern long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
-				struct zone *zone, int priority);
+extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
+					int priority, enum lru_list lru);
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 static inline void page_reset_bad_cgroup(struct page *page)
@@ -152,14 +150,9 @@ static inline void mem_cgroup_record_rec
 {
 }
 
-static inline long mem_cgroup_calc_reclaim_active(struct mem_cgroup *mem,
-					struct zone *zone, int priority)
-{
-	return 0;
-}
-
-static inline long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
-					struct zone *zone, int priority)
+static inline long mem_cgroup_calc_reclaim(struct mem_cgroup *mem,
+					struct zone *zone, int priority,
+					enum lru_list lru)
 {
 	return 0;
 }
Index: linux-2.6.26-rc2-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/memcontrol.c	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/memcontrol.c	2008-05-23 14:21:33.000000000 -0400
@@ -32,6 +32,7 @@
 #include <linux/fs.h>
 #include <linux/seq_file.h>
 #include <linux/vmalloc.h>
+#include <linux/mm_inline.h>
 
 #include <asm/uaccess.h>
 
@@ -85,22 +86,13 @@ static s64 mem_cgroup_read_stat(struct m
 /*
  * per-zone information in memory controller.
  */
-
-enum mem_cgroup_zstat_index {
-	MEM_CGROUP_ZSTAT_ACTIVE,
-	MEM_CGROUP_ZSTAT_INACTIVE,
-
-	NR_MEM_CGROUP_ZSTAT,
-};
-
 struct mem_cgroup_per_zone {
 	/*
 	 * spin_lock to protect the per cgroup LRU
 	 */
 	spinlock_t		lru_lock;
-	struct list_head	active_list;
-	struct list_head	inactive_list;
-	unsigned long count[NR_MEM_CGROUP_ZSTAT];
+	struct list_head	lists[NR_LRU_LISTS];
+	unsigned long		count[NR_LRU_LISTS];
 };
 /* Macro for accessing counter */
 #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
@@ -227,7 +219,7 @@ page_cgroup_zoneinfo(struct page_cgroup 
 }
 
 static unsigned long mem_cgroup_get_all_zonestat(struct mem_cgroup *mem,
-					enum mem_cgroup_zstat_index idx)
+					enum lru_list idx)
 {
 	int nid, zid;
 	struct mem_cgroup_per_zone *mz;
@@ -289,11 +281,9 @@ static void __mem_cgroup_remove_list(str
 			struct page_cgroup *pc)
 {
 	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
+	int lru = !!from;
 
-	if (from)
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) -= 1;
-	else
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) -= 1;
+	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
 
 	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, false);
 	list_del(&pc->lru);
@@ -302,37 +292,35 @@ static void __mem_cgroup_remove_list(str
 static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
 				struct page_cgroup *pc)
 {
-	int to = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
+	int lru = LRU_INACTIVE;
+
+	if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+		lru += LRU_ACTIVE;
+
+	MEM_CGROUP_ZSTAT(mz, lru) += 1;
+	list_add(&pc->lru, &mz->lists[lru]);
 
-	if (!to) {
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) += 1;
-		list_add(&pc->lru, &mz->inactive_list);
-	} else {
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) += 1;
-		list_add(&pc->lru, &mz->active_list);
-	}
 	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
 }
 
 static void __mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
 {
-	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
 	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
+	int lru = LRU_INACTIVE;
 
-	if (from)
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) -= 1;
-	else
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) -= 1;
+	if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+		lru += LRU_ACTIVE;
 
-	if (active) {
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) += 1;
+	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
+
+	if (active)
 		pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
-		list_move(&pc->lru, &mz->active_list);
-	} else {
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) += 1;
+	else
 		pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
-		list_move(&pc->lru, &mz->inactive_list);
-	}
+
+	lru = !!active;
+	MEM_CGROUP_ZSTAT(mz, lru) += 1;
+	list_move(&pc->lru, &mz->lists[lru]);
 }
 
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
@@ -401,8 +389,8 @@ long mem_cgroup_reclaim_imbalance(struct
 {
 	unsigned long active, inactive;
 	/* active and inactive are the number of pages. 'long' is ok.*/
-	active = mem_cgroup_get_all_zonestat(mem, MEM_CGROUP_ZSTAT_ACTIVE);
-	inactive = mem_cgroup_get_all_zonestat(mem, MEM_CGROUP_ZSTAT_INACTIVE);
+	active = mem_cgroup_get_all_zonestat(mem, LRU_ACTIVE);
+	inactive = mem_cgroup_get_all_zonestat(mem, LRU_INACTIVE);
 	return (long) (active / (inactive + 1));
 }
 
@@ -433,28 +421,17 @@ void mem_cgroup_record_reclaim_priority(
  * (see include/linux/mmzone.h)
  */
 
-long mem_cgroup_calc_reclaim_active(struct mem_cgroup *mem,
-				   struct zone *zone, int priority)
+long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
+					int priority, enum lru_list lru)
 {
-	long nr_active;
+	long nr_pages;
 	int nid = zone->zone_pgdat->node_id;
 	int zid = zone_idx(zone);
 	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
 
-	nr_active = MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE);
-	return (nr_active >> priority);
-}
-
-long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
-					struct zone *zone, int priority)
-{
-	long nr_inactive;
-	int nid = zone->zone_pgdat->node_id;
-	int zid = zone_idx(zone);
-	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	nr_pages = MEM_CGROUP_ZSTAT(mz, lru);
 
-	nr_inactive = MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE);
-	return (nr_inactive >> priority);
+	return (nr_pages >> priority);
 }
 
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
@@ -473,14 +450,11 @@ unsigned long mem_cgroup_isolate_pages(u
 	int nid = z->zone_pgdat->node_id;
 	int zid = zone_idx(z);
 	struct mem_cgroup_per_zone *mz;
+	int lru = !!active;
 
 	BUG_ON(!mem_cont);
 	mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
-	if (active)
-		src = &mz->active_list;
-	else
-		src = &mz->inactive_list;
-
+	src = &mz->lists[lru];
 
 	spin_lock(&mz->lru_lock);
 	scan = 0;
@@ -771,7 +745,7 @@ void mem_cgroup_end_migration(struct pag
 #define FORCE_UNCHARGE_BATCH	(128)
 static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 			    struct mem_cgroup_per_zone *mz,
-			    int active)
+			    enum lru_list lru)
 {
 	struct page_cgroup *pc;
 	struct page *page;
@@ -779,10 +753,7 @@ static void mem_cgroup_force_empty_list(
 	unsigned long flags;
 	struct list_head *list;
 
-	if (active)
-		list = &mz->active_list;
-	else
-		list = &mz->inactive_list;
+	list = &mz->lists[lru];
 
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	while (!list_empty(list)) {
@@ -832,11 +803,10 @@ static int mem_cgroup_force_empty(struct
 		for_each_node_state(node, N_POSSIBLE)
 			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 				struct mem_cgroup_per_zone *mz;
+				enum lru_list l;
 				mz = mem_cgroup_zoneinfo(mem, node, zid);
-				/* drop all page_cgroup in active_list */
-				mem_cgroup_force_empty_list(mem, mz, 1);
-				/* drop all page_cgroup in inactive_list */
-				mem_cgroup_force_empty_list(mem, mz, 0);
+				for_each_lru(l)
+					mem_cgroup_force_empty_list(mem, mz, l);
 			}
 	}
 	ret = 0;
@@ -923,9 +893,9 @@ static int mem_control_stat_show(struct 
 		unsigned long active, inactive;
 
 		inactive = mem_cgroup_get_all_zonestat(mem_cont,
-						MEM_CGROUP_ZSTAT_INACTIVE);
+						LRU_INACTIVE);
 		active = mem_cgroup_get_all_zonestat(mem_cont,
-						MEM_CGROUP_ZSTAT_ACTIVE);
+						LRU_ACTIVE);
 		cb->fill(cb, "active", (active) * PAGE_SIZE);
 		cb->fill(cb, "inactive", (inactive) * PAGE_SIZE);
 	}
@@ -970,6 +940,7 @@ static int alloc_mem_cgroup_per_zone_inf
 {
 	struct mem_cgroup_per_node *pn;
 	struct mem_cgroup_per_zone *mz;
+	enum lru_list l;
 	int zone, tmp = node;
 	/*
 	 * This routine is called against possible nodes.
@@ -990,9 +961,9 @@ static int alloc_mem_cgroup_per_zone_inf
 
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
-		INIT_LIST_HEAD(&mz->active_list);
-		INIT_LIST_HEAD(&mz->inactive_list);
 		spin_lock_init(&mz->lru_lock);
+		for_each_lru(l)
+			INIT_LIST_HEAD(&mz->lists[l]);
 	}
 	return 0;
 }

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 03/25] use an array for the LRU pagevecs
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
  2008-06-06 20:28 ` [PATCH -mm 01/25] move isolate_lru_page() to vmscan.c Rik van Riel, Rik van Riel
  2008-06-06 20:28 ` [PATCH -mm 02/25] Use an indexed array for LRU variables Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-06 20:28 ` [PATCH -mm 04/25] free swap space on swap-in/activation Rik van Riel, Rik van Riel
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro

[-- Attachment #1: pagevec-array.patch --]
[-- Type: text/plain, Size: 9283 bytes --]

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Turn the pagevecs into an array just like the LRUs.  This significantly
cleans up the source code and reduces the size of the kernel by about
13kB after all the LRU lists have been created further down in the split
VM patch series.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---
 include/linux/mmzone.h  |   13 +++++-
 include/linux/pagevec.h |   13 +++++-
 include/linux/swap.h    |   18 ++++++++-
 mm/migrate.c            |   11 -----
 mm/swap.c               |   96 ++++++++++++++++++++++--------------------------
 5 files changed, 83 insertions(+), 68 deletions(-)

Index: linux-2.6.26-rc2-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mmzone.h	2008-05-23 14:21:33.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mmzone.h	2008-05-23 14:21:33.000000000 -0400
@@ -107,13 +107,22 @@ enum zone_stat_item {
 #endif
 	NR_VM_ZONE_STAT_ITEMS };
 
+#define LRU_BASE 0
+
 enum lru_list {
-	LRU_INACTIVE,	/* must match order of NR_[IN]ACTIVE */
-	LRU_ACTIVE,	/*  "     "     "   "       "        */
+	LRU_INACTIVE = LRU_BASE,	/* must match order of NR_[IN]ACTIVE */
+	LRU_ACTIVE,			/*  "     "     "   "       "        */
 	NR_LRU_LISTS };
 
 #define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
 
+static inline int is_active_lru(enum lru_list l)
+{
+	return (l == LRU_ACTIVE);
+}
+
+enum lru_list page_lru(struct page *page);
+
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
Index: linux-2.6.26-rc2-mm1/mm/swap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/swap.c	2008-05-23 14:21:33.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/swap.c	2008-05-23 14:21:33.000000000 -0400
@@ -34,8 +34,7 @@
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
 
-static DEFINE_PER_CPU(struct pagevec, lru_add_pvecs) = { 0, };
-static DEFINE_PER_CPU(struct pagevec, lru_add_active_pvecs) = { 0, };
+static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs) = { {0,}, };
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs) = { 0, };
 
 /*
@@ -96,6 +95,23 @@ void put_pages_list(struct list_head *pa
 }
 EXPORT_SYMBOL(put_pages_list);
 
+/**
+ * page_lru - which LRU list should a page be on?
+ * @page: the page to test
+ *
+ * Returns the LRU list a page should be on, as an index
+ * into the array of LRU lists.
+ */
+enum lru_list page_lru(struct page *page)
+{
+	enum lru_list lru = LRU_BASE;
+
+	if (PageActive(page))
+		lru += LRU_ACTIVE;
+
+	return lru;
+}
+
 /*
  * pagevec_move_tail() must be called with IRQ disabled.
  * Otherwise this may cause nasty races.
@@ -186,28 +202,29 @@ void mark_page_accessed(struct page *pag
 
 EXPORT_SYMBOL(mark_page_accessed);
 
-/**
- * lru_cache_add: add a page to the page lists
- * @page: the page to add
- */
-void lru_cache_add(struct page *page)
+void __lru_cache_add(struct page *page, enum lru_list lru)
 {
-	struct pagevec *pvec = &get_cpu_var(lru_add_pvecs);
+	struct pagevec *pvec = &get_cpu_var(lru_add_pvecs)[lru];
 
 	page_cache_get(page);
 	if (!pagevec_add(pvec, page))
-		__pagevec_lru_add(pvec);
+		____pagevec_lru_add(pvec, lru);
 	put_cpu_var(lru_add_pvecs);
 }
 
-void lru_cache_add_active(struct page *page)
+/**
+ * lru_cache_add_lru - add a page to a page list
+ * @page: the page to be added to the LRU.
+ * @lru: the LRU list to which the page is added.
+ */
+void lru_cache_add_lru(struct page *page, enum lru_list lru)
 {
-	struct pagevec *pvec = &get_cpu_var(lru_add_active_pvecs);
+	if (PageActive(page)) {
+		ClearPageActive(page);
+	}
 
-	page_cache_get(page);
-	if (!pagevec_add(pvec, page))
-		__pagevec_lru_add_active(pvec);
-	put_cpu_var(lru_add_active_pvecs);
+	VM_BUG_ON(PageLRU(page) || PageActive(page));
+	__lru_cache_add(page, lru);
 }
 
 /*
@@ -217,15 +234,15 @@ void lru_cache_add_active(struct page *p
  */
 static void drain_cpu_pagevecs(int cpu)
 {
+	struct pagevec *pvecs = per_cpu(lru_add_pvecs, cpu);
 	struct pagevec *pvec;
+	int lru;
 
-	pvec = &per_cpu(lru_add_pvecs, cpu);
-	if (pagevec_count(pvec))
-		__pagevec_lru_add(pvec);
-
-	pvec = &per_cpu(lru_add_active_pvecs, cpu);
-	if (pagevec_count(pvec))
-		__pagevec_lru_add_active(pvec);
+	for_each_lru(lru) {
+		pvec = &pvecs[lru - LRU_BASE];
+		if (pagevec_count(pvec))
+			____pagevec_lru_add(pvec, lru);
+	}
 
 	pvec = &per_cpu(lru_rotate_pvecs, cpu);
 	if (pagevec_count(pvec)) {
@@ -379,7 +396,7 @@ void __pagevec_release_nonlru(struct pag
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.
  */
-void __pagevec_lru_add(struct pagevec *pvec)
+void ____pagevec_lru_add(struct pagevec *pvec, enum lru_list lru)
 {
 	int i;
 	struct zone *zone = NULL;
@@ -396,7 +413,9 @@ void __pagevec_lru_add(struct pagevec *p
 		}
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
-		add_page_to_inactive_list(zone, page);
+		if (is_active_lru(lru))
+			SetPageActive(page);
+		add_page_to_lru_list(zone, page, lru);
 	}
 	if (zone)
 		spin_unlock_irq(&zone->lru_lock);
@@ -404,34 +423,7 @@ void __pagevec_lru_add(struct pagevec *p
 	pagevec_reinit(pvec);
 }
 
-EXPORT_SYMBOL(__pagevec_lru_add);
-
-void __pagevec_lru_add_active(struct pagevec *pvec)
-{
-	int i;
-	struct zone *zone = NULL;
-
-	for (i = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
-		struct zone *pagezone = page_zone(page);
-
-		if (pagezone != zone) {
-			if (zone)
-				spin_unlock_irq(&zone->lru_lock);
-			zone = pagezone;
-			spin_lock_irq(&zone->lru_lock);
-		}
-		VM_BUG_ON(PageLRU(page));
-		SetPageLRU(page);
-		VM_BUG_ON(PageActive(page));
-		SetPageActive(page);
-		add_page_to_active_list(zone, page);
-	}
-	if (zone)
-		spin_unlock_irq(&zone->lru_lock);
-	release_pages(pvec->pages, pvec->nr, pvec->cold);
-	pagevec_reinit(pvec);
-}
+EXPORT_SYMBOL(____pagevec_lru_add);
 
 /*
  * Try to drop buffers from the pages in a pagevec
Index: linux-2.6.26-rc2-mm1/include/linux/pagevec.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/pagevec.h	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/pagevec.h	2008-05-23 14:21:33.000000000 -0400
@@ -23,8 +23,7 @@ struct pagevec {
 void __pagevec_release(struct pagevec *pvec);
 void __pagevec_release_nonlru(struct pagevec *pvec);
 void __pagevec_free(struct pagevec *pvec);
-void __pagevec_lru_add(struct pagevec *pvec);
-void __pagevec_lru_add_active(struct pagevec *pvec);
+void ____pagevec_lru_add(struct pagevec *pvec, enum lru_list lru);
 void pagevec_strip(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
@@ -81,6 +80,16 @@ static inline void pagevec_free(struct p
 		__pagevec_free(pvec);
 }
 
+static inline void __pagevec_lru_add(struct pagevec *pvec)
+{
+	____pagevec_lru_add(pvec, LRU_INACTIVE);
+}
+
+static inline void __pagevec_lru_add_active(struct pagevec *pvec)
+{
+	____pagevec_lru_add(pvec, LRU_ACTIVE);
+}
+
 static inline void pagevec_lru_add(struct pagevec *pvec)
 {
 	if (pagevec_count(pvec))
Index: linux-2.6.26-rc2-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/swap.h	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/swap.h	2008-05-23 14:21:33.000000000 -0400
@@ -171,8 +171,8 @@ extern unsigned int nr_free_pagecache_pa
 
 
 /* linux/mm/swap.c */
-extern void lru_cache_add(struct page *);
-extern void lru_cache_add_active(struct page *);
+extern void __lru_cache_add(struct page *, enum lru_list lru);
+extern void lru_cache_add_lru(struct page *, enum lru_list lru);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
@@ -180,6 +180,20 @@ extern int lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void swap_setup(void);
 
+/**
+ * lru_cache_add: add a page to the page lists
+ * @page: the page to add
+ */
+static inline void lru_cache_add(struct page *page)
+{
+	__lru_cache_add(page, LRU_INACTIVE);
+}
+
+static inline void lru_cache_add_active(struct page *page)
+{
+	__lru_cache_add(page, LRU_ACTIVE);
+}
+
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask);
Index: linux-2.6.26-rc2-mm1/mm/migrate.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/migrate.c	2008-05-23 14:21:32.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/migrate.c	2008-05-23 14:21:33.000000000 -0400
@@ -55,16 +55,7 @@ int migrate_prep(void)
 
 static inline void move_to_lru(struct page *page)
 {
-	if (PageActive(page)) {
-		/*
-		 * lru_cache_add_active checks that
-		 * the PG_active bit is off.
-		 */
-		ClearPageActive(page);
-		lru_cache_add_active(page);
-	} else {
-		lru_cache_add(page);
-	}
+	lru_cache_add_lru(page, page_lru(page));
 	put_page(page);
 }
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 04/25] free swap space on swap-in/activation
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (2 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 03/25] use an array for the LRU pagevecs Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-07  1:04   ` Andrew Morton
  2008-06-06 20:28 ` [PATCH -mm 05/25] define page_file_cache() function Rik van Riel, Rik van Riel
                   ` (21 subsequent siblings)
  25 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, MinChan Kim,
	Lee Schermerhorn

[-- Attachment #1: rvr-00-linux-2.6-swapfree.patch --]
[-- Type: text/plain, Size: 6190 bytes --]

From: Rik van Riel <riel@redhat.com>

Free swap cache entries when swapping in pages if vm_swap_full()
[swap space > 1/2 used].  Uses new pagevec to reduce pressure
on locks.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: MinChan Kim <minchan.kim@gmail.com>

---
 include/linux/pagevec.h |    1 +
 include/linux/swap.h    |    6 ++++++
 mm/swap.c               |   18 ++++++++++++++++++
 mm/swapfile.c           |   25 ++++++++++++++++++++++---
 mm/vmscan.c             |    7 +++++++
 5 files changed, 54 insertions(+), 3 deletions(-)

Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-05-23 14:21:33.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-05-23 14:21:33.000000000 -0400
@@ -619,6 +619,9 @@ free_it:
 		continue;
 
 activate_locked:
+		/* Not a candidate for swapping, so reclaim swap space. */
+		if (PageSwapCache(page) && vm_swap_full())
+			remove_exclusive_swap_page_ref(page);
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
@@ -1203,6 +1206,8 @@ static void shrink_active_list(unsigned 
 			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
 			pgmoved = 0;
 			spin_unlock_irq(&zone->lru_lock);
+			if (vm_swap_full())
+				pagevec_swap_free(&pvec);
 			__pagevec_release(&pvec);
 			spin_lock_irq(&zone->lru_lock);
 		}
@@ -1212,6 +1217,8 @@ static void shrink_active_list(unsigned 
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
 	__count_vm_events(PGDEACTIVATE, pgdeactivate);
 	spin_unlock_irq(&zone->lru_lock);
+	if (vm_swap_full())
+		pagevec_swap_free(&pvec);
 
 	pagevec_release(&pvec);
 }
Index: linux-2.6.26-rc2-mm1/mm/swap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/swap.c	2008-05-23 14:21:33.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/swap.c	2008-05-23 14:21:33.000000000 -0400
@@ -443,6 +443,24 @@ void pagevec_strip(struct pagevec *pvec)
 	}
 }
 
+/*
+ * Try to free swap space from the pages in a pagevec
+ */
+void pagevec_swap_free(struct pagevec *pvec)
+{
+	int i;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+
+		if (PageSwapCache(page) && !TestSetPageLocked(page)) {
+			if (PageSwapCache(page))
+				remove_exclusive_swap_page_ref(page);
+			unlock_page(page);
+		}
+	}
+}
+
 /**
  * pagevec_lookup - gang pagecache lookup
  * @pvec:	Where the resulting pages are placed
Index: linux-2.6.26-rc2-mm1/include/linux/pagevec.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/pagevec.h	2008-05-23 14:21:33.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/pagevec.h	2008-05-23 14:21:33.000000000 -0400
@@ -25,6 +25,7 @@ void __pagevec_release_nonlru(struct pag
 void __pagevec_free(struct pagevec *pvec);
 void ____pagevec_lru_add(struct pagevec *pvec, enum lru_list lru);
 void pagevec_strip(struct pagevec *pvec);
+void pagevec_swap_free(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
 unsigned pagevec_lookup_tag(struct pagevec *pvec,
Index: linux-2.6.26-rc2-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/swap.h	2008-05-23 14:21:33.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/swap.h	2008-05-23 14:21:33.000000000 -0400
@@ -265,6 +265,7 @@ extern sector_t swapdev_block(int, pgoff
 extern struct swap_info_struct *get_swap_info_struct(unsigned);
 extern int can_share_swap_page(struct page *);
 extern int remove_exclusive_swap_page(struct page *);
+extern int remove_exclusive_swap_page_ref(struct page *);
 struct backing_dev_info;
 
 /* linux/mm/thrash.c */
@@ -353,6 +354,11 @@ static inline int remove_exclusive_swap_
 	return 0;
 }
 
+static inline int remove_exclusive_swap_page_ref(struct page *page)
+{
+	return 0;
+}
+
 static inline swp_entry_t get_swap_page(void)
 {
 	swp_entry_t entry;
Index: linux-2.6.26-rc2-mm1/mm/swapfile.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/swapfile.c	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/swapfile.c	2008-05-23 14:21:33.000000000 -0400
@@ -343,7 +343,7 @@ int can_share_swap_page(struct page *pag
  * Work out if there are any other processes sharing this
  * swap cache page. Free it if you can. Return success.
  */
-int remove_exclusive_swap_page(struct page *page)
+static int remove_exclusive_swap_page_count(struct page *page, int count)
 {
 	int retval;
 	struct swap_info_struct * p;
@@ -356,7 +356,7 @@ int remove_exclusive_swap_page(struct pa
 		return 0;
 	if (PageWriteback(page))
 		return 0;
-	if (page_count(page) != 2) /* 2: us + cache */
+	if (page_count(page) != count) /* us + cache + ptes */
 		return 0;
 
 	entry.val = page_private(page);
@@ -369,7 +369,7 @@ int remove_exclusive_swap_page(struct pa
 	if (p->swap_map[swp_offset(entry)] == 1) {
 		/* Recheck the page count with the swapcache lock held.. */
 		write_lock_irq(&swapper_space.tree_lock);
-		if ((page_count(page) == 2) && !PageWriteback(page)) {
+		if ((page_count(page) == count) && !PageWriteback(page)) {
 			__delete_from_swap_cache(page);
 			SetPageDirty(page);
 			retval = 1;
@@ -387,6 +387,25 @@ int remove_exclusive_swap_page(struct pa
 }
 
 /*
+ * Most of the time the page should have two references: one for the
+ * process and one for the swap cache.
+ */
+int remove_exclusive_swap_page(struct page *page)
+{
+	return remove_exclusive_swap_page_count(page, 2);
+}
+
+/*
+ * The pageout code holds an extra reference to the page.  That raises
+ * the reference count to test for to 2 for a page that is only in the
+ * swap cache plus 1 for each process that maps the page.
+ */
+int remove_exclusive_swap_page_ref(struct page *page)
+{
+	return remove_exclusive_swap_page_count(page, 2 + page_mapcount(page));
+}
+
+/*
  * Free the swap entry like above, but also try to
  * free the page cache entry if it is the last user.
  */

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 05/25] define page_file_cache() function
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (3 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 04/25] free swap space on swap-in/activation Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-07  1:04   ` Andrew Morton
  2008-06-06 20:28 ` [PATCH -mm 06/25] split LRU lists into anon & file sets Rik van Riel, Rik van Riel
                   ` (20 subsequent siblings)
  25 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, MinChan Kim

[-- Attachment #1: rvr-01-linux-2.6-page_file_cache.patch --]
[-- Type: text/plain, Size: 7664 bytes --]

From: Rik van Riel <riel@redhat.com>

Define page_file_cache() function to answer the question:
	is page backed by a file?

Originally part of Rik van Riel's split-lru patch.  Extracted
to make available for other, independent reclaim patches.

Moved inline function to linux/mm_inline.h where it will
be needed by subsequent "split LRU" and "noreclaim" patches.  

Unfortunately this needs to use a page flag, since the
PG_swapbacked state needs to be preserved all the way
to the point where the page is last removed from the
LRU.  Trying to derive the status from other info in
the page resulted in wrong VM statistics in earlier
split VM patchsets.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: MinChan Kim <minchan.kim@gmail.com>

---
 include/linux/mm_inline.h  |   24 ++++++++++++++++++++++++
 include/linux/page-flags.h |    2 ++
 mm/memory.c                |    3 +++
 mm/migrate.c               |    2 ++
 mm/page_alloc.c            |    4 ++++
 mm/shmem.c                 |    1 +
 mm/swap_state.c            |    3 +++
 7 files changed, 39 insertions(+)

Index: linux-2.6.26-rc2-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mm_inline.h	2008-05-23 14:21:33.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mm_inline.h	2008-05-23 14:21:34.000000000 -0400
@@ -1,3 +1,26 @@
+#ifndef LINUX_MM_INLINE_H
+#define LINUX_MM_INLINE_H
+
+/**
+ * page_file_cache - should the page be on a file LRU or anon LRU?
+ * @page: the page to test
+ *
+ * Returns !0 if @page is page cache page backed by a regular filesystem,
+ * or 0 if @page is anonymous, tmpfs or otherwise ram or swap backed.
+ *
+ * We would like to get this info without a page flag, but the state
+ * needs to survive until the page is last deleted from the LRU, which
+ * could be as far down as __page_cache_release.
+ */
+static inline int page_file_cache(struct page *page)
+{
+	if (PageSwapBacked(page))
+		return 0;
+
+	/* The page is page cache backed by a normal filesystem. */
+	return 2;
+}
+
 static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
@@ -49,3 +72,4 @@ del_page_from_lru(struct zone *zone, str
 	__dec_zone_state(zone, NR_INACTIVE + l);
 }
 
+#endif
Index: linux-2.6.26-rc2-mm1/mm/shmem.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/shmem.c	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/shmem.c	2008-05-23 14:21:34.000000000 -0400
@@ -1378,6 +1378,7 @@ repeat:
 				goto failed;
 			}
 
+			SetPageSwapBacked(filepage);
 			spin_lock(&info->lock);
 			entry = shmem_swp_alloc(info, idx, sgp);
 			if (IS_ERR(entry))
Index: linux-2.6.26-rc2-mm1/include/linux/page-flags.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/page-flags.h	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/page-flags.h	2008-05-23 14:21:34.000000000 -0400
@@ -93,6 +93,7 @@ enum pageflags {
 	PG_mappedtodisk,	/* Has blocks allocated on-disk */
 	PG_reclaim,		/* To be reclaimed asap */
 	PG_buddy,		/* Page is free, on buddy lists */
+	PG_swapbacked,		/* Page is backed by RAM/swap */
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 	PG_uncached,		/* Page has been mapped as uncached */
 #endif
@@ -160,6 +161,7 @@ PAGEFLAG(Pinned, owner_priv_1) TESTSCFLA
 PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved)
 PAGEFLAG(Private, private) __CLEARPAGEFLAG(Private, private)
 	__SETPAGEFLAG(Private, private)
+PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
 
 /*
  * Only test-and-set exist for PG_writeback.  The unconditional operators are
Index: linux-2.6.26-rc2-mm1/mm/memory.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/memory.c	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/memory.c	2008-05-23 14:21:34.000000000 -0400
@@ -1765,6 +1765,7 @@ gotten:
 		ptep_clear_flush(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
+		SetPageSwapBacked(new_page);
 		lru_cache_add_active(new_page);
 		page_add_new_anon_rmap(new_page, vma, address);
 
@@ -2233,6 +2234,7 @@ static int do_anonymous_page(struct mm_s
 	if (!pte_none(*page_table))
 		goto release;
 	inc_mm_counter(mm, anon_rss);
+	SetPageSwapBacked(page);
 	lru_cache_add_active(page);
 	page_add_new_anon_rmap(page, vma, address);
 	set_pte_at(mm, address, page_table, entry);
@@ -2374,6 +2376,7 @@ static int __do_fault(struct mm_struct *
 		set_pte_at(mm, address, page_table, entry);
 		if (anon) {
                         inc_mm_counter(mm, anon_rss);
+			SetPageSwapBacked(page);
                         lru_cache_add_active(page);
                         page_add_new_anon_rmap(page, vma, address);
 		} else {
Index: linux-2.6.26-rc2-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/swap_state.c	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/swap_state.c	2008-05-23 14:21:34.000000000 -0400
@@ -74,6 +74,7 @@ int add_to_swap_cache(struct page *page,
 	BUG_ON(!PageLocked(page));
 	BUG_ON(PageSwapCache(page));
 	BUG_ON(PagePrivate(page));
+	BUG_ON(!PageSwapBacked(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
 		write_lock_irq(&swapper_space.tree_lock);
@@ -295,6 +296,7 @@ struct page *read_swap_cache_async(swp_e
 		 * May fail (-ENOMEM) if radix-tree node allocation failed.
 		 */
 		SetPageLocked(new_page);
+		SetPageSwapBacked(new_page);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (!err) {
 			/*
@@ -304,6 +306,7 @@ struct page *read_swap_cache_async(swp_e
 			swap_readpage(NULL, new_page);
 			return new_page;
 		}
+		ClearPageSwapBacked(new_page);
 		ClearPageLocked(new_page);
 		swap_free(entry);
 	} while (err != -ENOMEM);
Index: linux-2.6.26-rc2-mm1/mm/migrate.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/migrate.c	2008-05-23 14:21:33.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/migrate.c	2008-05-23 14:21:34.000000000 -0400
@@ -558,6 +558,8 @@ static int move_to_new_page(struct page 
 	/* Prepare mapping for the new page.*/
 	newpage->index = page->index;
 	newpage->mapping = page->mapping;
+	if (PageSwapBacked(page))
+		SetPageSwapBacked(newpage);
 
 	mapping = page_mapping(page);
 	if (!mapping)
Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c	2008-05-23 14:21:33.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/page_alloc.c	2008-05-23 14:21:34.000000000 -0400
@@ -261,6 +261,7 @@ static void bad_page(struct page *page)
 			1 << PG_slab    |
 			1 << PG_swapcache |
 			1 << PG_writeback |
+			1 << PG_swapbacked |
 			1 << PG_buddy );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
@@ -494,6 +495,8 @@ static inline int free_pages_check(struc
 		bad_page(page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
+	if (PageSwapBacked(page))
+		__ClearPageSwapBacked(page);
 	/*
 	 * For now, we report if PG_reserved was found set, but do not
 	 * clear it, and do not free the page.  But we shall soon need
@@ -644,6 +647,7 @@ static int prep_new_page(struct page *pa
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_reserved |
+			1 << PG_swapbacked |
 			1 << PG_buddy ))))
 		bad_page(page);
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 06/25] split LRU lists into anon & file sets
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (4 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 05/25] define page_file_cache() function Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-07  1:04   ` Andrew Morton
  2008-06-06 20:28 ` [PATCH -mm 07/25] second chance replacement for anonymous pages Rik van Riel, Rik van Riel
                   ` (19 subsequent siblings)
  25 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro,
	Lee Schermerhorn

[-- Attachment #1: rvr-02-linux-2.6-vm-split-lrus.patch --]
[-- Type: text/plain, Size: 59936 bytes --]

From: Rik van Riel <riel@redhat.com>

Split the LRU lists in two, one set for pages that are backed by
real file systems ("file") and one for pages that are backed by
memory and swap ("anon").  The latter includes tmpfs.

Eventually mlocked pages will be taken off the LRUs alltogether.
A patch for that already exists and just needs to be integrated
into this series.

This patch mostly has the infrastructure and a basic policy to
balance how much we scan the anon lists and how much we scan
the file lists. The big policy changes are in separate patches.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

---
 drivers/base/node.c        |   56 +++--
 fs/cifs/file.c             |    4 
 fs/nfs/dir.c               |    2 
 fs/ntfs/file.c             |    4 
 fs/proc/proc_misc.c        |   81 +++++---
 fs/ramfs/file-nommu.c      |    4 
 include/linux/memcontrol.h |    2 
 include/linux/mm_inline.h  |   55 ++++-
 include/linux/mmzone.h     |   33 ++-
 include/linux/pagevec.h    |   29 ++-
 include/linux/swap.h       |   20 +-
 include/linux/vmstat.h     |   10 +
 mm/filemap.c               |    9 
 mm/memcontrol.c            |   73 +++----
 mm/memory.c                |    6 
 mm/page-writeback.c        |    8 
 mm/page_alloc.c            |   24 +-
 mm/readahead.c             |    2 
 mm/swap.c                  |   12 -
 mm/swap_state.c            |    2 
 mm/vmscan.c                |  428 ++++++++++++++++++++-------------------------
 mm/vmstat.c                |   14 -
 22 files changed, 491 insertions(+), 387 deletions(-)

Index: linux-2.6.26-rc2-mm1/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/fs/proc/proc_misc.c	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/fs/proc/proc_misc.c	2008-05-23 14:21:34.000000000 -0400
@@ -132,6 +132,10 @@ static int meminfo_read_proc(char *page,
 	unsigned long allowed;
 	struct vmalloc_info vmi;
 	long cached;
+	unsigned long inactive_anon;
+	unsigned long active_anon;
+	unsigned long inactive_file;
+	unsigned long active_file;
 
 /*
  * display in kilobytes.
@@ -150,48 +154,61 @@ static int meminfo_read_proc(char *page,
 
 	get_vmalloc_info(&vmi);
 
+	inactive_anon = global_page_state(NR_INACTIVE_ANON);
+	active_anon   = global_page_state(NR_ACTIVE_ANON);
+	inactive_file = global_page_state(NR_INACTIVE_FILE);
+	active_file   = global_page_state(NR_ACTIVE_FILE);
+
 	/*
 	 * Tagged format, for easy grepping and expansion.
 	 */
 	len = sprintf(page,
-		"MemTotal:     %8lu kB\n"
-		"MemFree:      %8lu kB\n"
-		"Buffers:      %8lu kB\n"
-		"Cached:       %8lu kB\n"
-		"SwapCached:   %8lu kB\n"
-		"Active:       %8lu kB\n"
-		"Inactive:     %8lu kB\n"
+		"MemTotal:       %8lu kB\n"
+		"MemFree:        %8lu kB\n"
+		"Buffers:        %8lu kB\n"
+		"Cached:         %8lu kB\n"
+		"SwapCached:     %8lu kB\n"
+		"Active:         %8lu kB\n"
+		"Inactive:       %8lu kB\n"
+		"Active(anon):   %8lu kB\n"
+		"Inactive(anon): %8lu kB\n"
+		"Active(file):   %8lu kB\n"
+		"Inactive(file): %8lu kB\n"
 #ifdef CONFIG_HIGHMEM
-		"HighTotal:    %8lu kB\n"
-		"HighFree:     %8lu kB\n"
-		"LowTotal:     %8lu kB\n"
-		"LowFree:      %8lu kB\n"
-#endif
-		"SwapTotal:    %8lu kB\n"
-		"SwapFree:     %8lu kB\n"
-		"Dirty:        %8lu kB\n"
-		"Writeback:    %8lu kB\n"
-		"AnonPages:    %8lu kB\n"
-		"Mapped:       %8lu kB\n"
-		"Slab:         %8lu kB\n"
-		"SReclaimable: %8lu kB\n"
-		"SUnreclaim:   %8lu kB\n"
-		"PageTables:   %8lu kB\n"
-		"NFS_Unstable: %8lu kB\n"
-		"Bounce:       %8lu kB\n"
-		"WritebackTmp: %8lu kB\n"
-		"CommitLimit:  %8lu kB\n"
-		"Committed_AS: %8lu kB\n"
-		"VmallocTotal: %8lu kB\n"
-		"VmallocUsed:  %8lu kB\n"
-		"VmallocChunk: %8lu kB\n",
+		"HighTotal:      %8lu kB\n"
+		"HighFree:       %8lu kB\n"
+		"LowTotal:       %8lu kB\n"
+		"LowFree:        %8lu kB\n"
+#endif
+		"SwapTotal:      %8lu kB\n"
+		"SwapFree:       %8lu kB\n"
+		"Dirty:          %8lu kB\n"
+		"Writeback:      %8lu kB\n"
+		"AnonPages:      %8lu kB\n"
+		"Mapped:         %8lu kB\n"
+		"Slab:           %8lu kB\n"
+		"SReclaimable:   %8lu kB\n"
+		"SUnreclaim:     %8lu kB\n"
+		"PageTables:     %8lu kB\n"
+		"NFS_Unstable:   %8lu kB\n"
+		"Bounce:         %8lu kB\n"
+		"WritebackTmp:   %8lu kB\n"
+		"CommitLimit:    %8lu kB\n"
+		"Committed_AS:   %8lu kB\n"
+		"VmallocTotal:   %8lu kB\n"
+		"VmallocUsed:    %8lu kB\n"
+		"VmallocChunk:   %8lu kB\n",
 		K(i.totalram),
 		K(i.freeram),
 		K(i.bufferram),
 		K(cached),
 		K(total_swapcache_pages),
-		K(global_page_state(NR_ACTIVE)),
-		K(global_page_state(NR_INACTIVE)),
+		K(active_anon   + active_file),
+		K(inactive_anon + inactive_file),
+		K(active_anon),
+		K(inactive_anon),
+		K(active_file),
+		K(inactive_file),
 #ifdef CONFIG_HIGHMEM
 		K(i.totalhigh),
 		K(i.freehigh),
Index: linux-2.6.26-rc2-mm1/fs/cifs/file.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/fs/cifs/file.c	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/fs/cifs/file.c	2008-05-23 14:21:34.000000000 -0400
@@ -1778,7 +1778,7 @@ static void cifs_copy_cache_pages(struct
 		SetPageUptodate(page);
 		unlock_page(page);
 		if (!pagevec_add(plru_pvec, page))
-			__pagevec_lru_add(plru_pvec);
+			__pagevec_lru_add_file(plru_pvec);
 		data += PAGE_CACHE_SIZE;
 	}
 	return;
@@ -1912,7 +1912,7 @@ static int cifs_readpages(struct file *f
 		bytes_read = 0;
 	}
 
-	pagevec_lru_add(&lru_pvec);
+	pagevec_lru_add_file(&lru_pvec);
 
 /* need to free smb_read_data buf before exit */
 	if (smb_read_data) {
Index: linux-2.6.26-rc2-mm1/fs/ntfs/file.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/fs/ntfs/file.c	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/fs/ntfs/file.c	2008-05-23 14:21:34.000000000 -0400
@@ -439,7 +439,7 @@ static inline int __ntfs_grab_cache_page
 			pages[nr] = *cached_page;
 			page_cache_get(*cached_page);
 			if (unlikely(!pagevec_add(lru_pvec, *cached_page)))
-				__pagevec_lru_add(lru_pvec);
+				__pagevec_lru_add_file(lru_pvec);
 			*cached_page = NULL;
 		}
 		index++;
@@ -2084,7 +2084,7 @@ err_out:
 						OSYNC_METADATA|OSYNC_DATA);
 		}
   	}
-	pagevec_lru_add(&lru_pvec);
+	pagevec_lru_add_file(&lru_pvec);
 	ntfs_debug("Done.  Returning %s (written 0x%lx, status %li).",
 			written ? "written" : "status", (unsigned long)written,
 			(long)status);
Index: linux-2.6.26-rc2-mm1/fs/nfs/dir.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/fs/nfs/dir.c	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/fs/nfs/dir.c	2008-05-23 14:21:34.000000000 -0400
@@ -1524,7 +1524,7 @@ static int nfs_symlink(struct inode *dir
 	if (!add_to_page_cache(page, dentry->d_inode->i_mapping, 0,
 							GFP_KERNEL)) {
 		pagevec_add(&lru_pvec, page);
-		pagevec_lru_add(&lru_pvec);
+		pagevec_lru_add_file(&lru_pvec);
 		SetPageUptodate(page);
 		unlock_page(page);
 	} else
Index: linux-2.6.26-rc2-mm1/fs/ramfs/file-nommu.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/fs/ramfs/file-nommu.c	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/fs/ramfs/file-nommu.c	2008-05-23 14:21:34.000000000 -0400
@@ -111,12 +111,12 @@ static int ramfs_nommu_expand_for_mappin
 			goto add_error;
 
 		if (!pagevec_add(&lru_pvec, page))
-			__pagevec_lru_add(&lru_pvec);
+			__pagevec_lru_add_file(&lru_pvec);
 
 		unlock_page(page);
 	}
 
-	pagevec_lru_add(&lru_pvec);
+	pagevec_lru_add_file(&lru_pvec);
 	return 0;
 
  fsize_exceeded:
Index: linux-2.6.26-rc2-mm1/drivers/base/node.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/drivers/base/node.c	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/drivers/base/node.c	2008-05-23 14:21:34.000000000 -0400
@@ -58,34 +58,44 @@ static ssize_t node_read_meminfo(struct 
 	si_meminfo_node(&i, nid);
 
 	n = sprintf(buf, "\n"
-		       "Node %d MemTotal:     %8lu kB\n"
-		       "Node %d MemFree:      %8lu kB\n"
-		       "Node %d MemUsed:      %8lu kB\n"
-		       "Node %d Active:       %8lu kB\n"
-		       "Node %d Inactive:     %8lu kB\n"
+		       "Node %d MemTotal:       %8lu kB\n"
+		       "Node %d MemFree:        %8lu kB\n"
+		       "Node %d MemUsed:        %8lu kB\n"
+		       "Node %d Active:         %8lu kB\n"
+		       "Node %d Inactive:       %8lu kB\n"
+		       "Node %d Active(anon):   %8lu kB\n"
+		       "Node %d Inactive(anon): %8lu kB\n"
+		       "Node %d Active(file):   %8lu kB\n"
+		       "Node %d Inactive(file): %8lu kB\n"
 #ifdef CONFIG_HIGHMEM
-		       "Node %d HighTotal:    %8lu kB\n"
-		       "Node %d HighFree:     %8lu kB\n"
-		       "Node %d LowTotal:     %8lu kB\n"
-		       "Node %d LowFree:      %8lu kB\n"
+		       "Node %d HighTotal:      %8lu kB\n"
+		       "Node %d HighFree:       %8lu kB\n"
+		       "Node %d LowTotal:       %8lu kB\n"
+		       "Node %d LowFree:        %8lu kB\n"
 #endif
-		       "Node %d Dirty:        %8lu kB\n"
-		       "Node %d Writeback:    %8lu kB\n"
-		       "Node %d FilePages:    %8lu kB\n"
-		       "Node %d Mapped:       %8lu kB\n"
-		       "Node %d AnonPages:    %8lu kB\n"
-		       "Node %d PageTables:   %8lu kB\n"
-		       "Node %d NFS_Unstable: %8lu kB\n"
-		       "Node %d Bounce:       %8lu kB\n"
-		       "Node %d WritebackTmp: %8lu kB\n"
-		       "Node %d Slab:         %8lu kB\n"
-		       "Node %d SReclaimable: %8lu kB\n"
-		       "Node %d SUnreclaim:   %8lu kB\n",
+		       "Node %d Dirty:          %8lu kB\n"
+		       "Node %d Writeback:      %8lu kB\n"
+		       "Node %d FilePages:      %8lu kB\n"
+		       "Node %d Mapped:         %8lu kB\n"
+		       "Node %d AnonPages:      %8lu kB\n"
+		       "Node %d PageTables:     %8lu kB\n"
+		       "Node %d NFS_Unstable:   %8lu kB\n"
+		       "Node %d Bounce:         %8lu kB\n"
+		       "Node %d WritebackTmp:   %8lu kB\n"
+		       "Node %d Slab:           %8lu kB\n"
+		       "Node %d SReclaimable:   %8lu kB\n"
+		       "Node %d SUnreclaim:     %8lu kB\n",
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
 		       nid, K(i.totalram - i.freeram),
-		       nid, node_page_state(nid, NR_ACTIVE),
-		       nid, node_page_state(nid, NR_INACTIVE),
+		       nid, node_page_state(nid, NR_ACTIVE_ANON) +
+				node_page_state(nid, NR_ACTIVE_FILE),
+		       nid, node_page_state(nid, NR_INACTIVE_ANON) +
+				node_page_state(nid, NR_INACTIVE_FILE),
+		       nid, node_page_state(nid, NR_ACTIVE_ANON),
+		       nid, node_page_state(nid, NR_INACTIVE_ANON),
+		       nid, node_page_state(nid, NR_ACTIVE_FILE),
+		       nid, node_page_state(nid, NR_INACTIVE_FILE),
 #ifdef CONFIG_HIGHMEM
 		       nid, K(i.totalhigh),
 		       nid, K(i.freehigh),
Index: linux-2.6.26-rc2-mm1/mm/memory.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/memory.c	2008-05-23 14:21:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/memory.c	2008-05-23 14:21:34.000000000 -0400
@@ -1766,7 +1766,7 @@ gotten:
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		SetPageSwapBacked(new_page);
-		lru_cache_add_active(new_page);
+		lru_cache_add_active_anon(new_page);
 		page_add_new_anon_rmap(new_page, vma, address);
 
 		/* Free the old page.. */
@@ -2235,7 +2235,7 @@ static int do_anonymous_page(struct mm_s
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	SetPageSwapBacked(page);
-	lru_cache_add_active(page);
+	lru_cache_add_active_anon(page);
 	page_add_new_anon_rmap(page, vma, address);
 	set_pte_at(mm, address, page_table, entry);
 
@@ -2377,7 +2377,7 @@ static int __do_fault(struct mm_struct *
 		if (anon) {
                         inc_mm_counter(mm, anon_rss);
 			SetPageSwapBacked(page);
-                        lru_cache_add_active(page);
+                        lru_cache_add_active_anon(page);
                         page_add_new_anon_rmap(page, vma, address);
 		} else {
 			inc_mm_counter(mm, file_rss);
Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c	2008-05-23 14:21:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/page_alloc.c	2008-05-23 14:21:34.000000000 -0400
@@ -1908,10 +1908,13 @@ void show_free_areas(void)
 		}
 	}
 
-	printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu\n"
+	printk("Active_anon:%lu active_file:%lu inactive_anon%lu\n"
+		" inactive_file:%lu dirty:%lu writeback:%lu unstable:%lu\n"
 		" free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n",
-		global_page_state(NR_ACTIVE),
-		global_page_state(NR_INACTIVE),
+		global_page_state(NR_ACTIVE_ANON),
+		global_page_state(NR_ACTIVE_FILE),
+		global_page_state(NR_INACTIVE_ANON),
+		global_page_state(NR_INACTIVE_FILE),
 		global_page_state(NR_FILE_DIRTY),
 		global_page_state(NR_WRITEBACK),
 		global_page_state(NR_UNSTABLE_NFS),
@@ -1934,8 +1937,10 @@ void show_free_areas(void)
 			" min:%lukB"
 			" low:%lukB"
 			" high:%lukB"
-			" active:%lukB"
-			" inactive:%lukB"
+			" active_anon:%lukB"
+			" inactive_anon:%lukB"
+			" active_file:%lukB"
+			" inactive_file:%lukB"
 			" present:%lukB"
 			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
@@ -1945,8 +1950,10 @@ void show_free_areas(void)
 			K(zone->pages_min),
 			K(zone->pages_low),
 			K(zone->pages_high),
-			K(zone_page_state(zone, NR_ACTIVE)),
-			K(zone_page_state(zone, NR_INACTIVE)),
+			K(zone_page_state(zone, NR_ACTIVE_ANON)),
+			K(zone_page_state(zone, NR_INACTIVE_ANON)),
+			K(zone_page_state(zone, NR_ACTIVE_FILE)),
+			K(zone_page_state(zone, NR_INACTIVE_FILE)),
 			K(zone->present_pages),
 			zone->pages_scanned,
 			(zone_is_all_unreclaimable(zone) ? "yes" : "no")
@@ -3503,6 +3510,9 @@ static void __paginginit free_area_init_
 			INIT_LIST_HEAD(&zone->list[l]);
 			zone->nr_scan[l] = 0;
 		}
+		zone->recent_rotated_anon = 0;
+		zone->recent_rotated_file = 0;
+//TODO recent_scanned_* ???
 		zap_zone_vm_stats(zone);
 		zone->flags = 0;
 		if (!size)
Index: linux-2.6.26-rc2-mm1/mm/swap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/swap.c	2008-05-23 14:21:33.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/swap.c	2008-05-23 14:21:34.000000000 -0400
@@ -108,6 +108,7 @@ enum lru_list page_lru(struct page *page
 
 	if (PageActive(page))
 		lru += LRU_ACTIVE;
+	lru += page_file_cache(page);
 
 	return lru;
 }
@@ -133,7 +134,8 @@ static void pagevec_move_tail(struct pag
 			spin_lock(&zone->lru_lock);
 		}
 		if (PageLRU(page) && !PageActive(page)) {
-			list_move_tail(&page->lru, &zone->list[LRU_INACTIVE]);
+			int lru = page_file_cache(page);
+			list_move_tail(&page->lru, &zone->list[lru]);
 			pgmoved++;
 		}
 	}
@@ -174,9 +176,13 @@ void activate_page(struct page *page)
 
 	spin_lock_irq(&zone->lru_lock);
 	if (PageLRU(page) && !PageActive(page)) {
-		del_page_from_inactive_list(zone, page);
+		int lru = LRU_BASE;
+		lru += page_file_cache(page);
+		del_page_from_lru_list(zone, page, lru);
+
 		SetPageActive(page);
-		add_page_to_active_list(zone, page);
+		lru += LRU_ACTIVE;
+		add_page_to_lru_list(zone, page, lru);
 		__count_vm_event(PGACTIVATE);
 		mem_cgroup_move_lists(page, true);
 	}
Index: linux-2.6.26-rc2-mm1/mm/readahead.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/readahead.c	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/readahead.c	2008-05-23 14:21:34.000000000 -0400
@@ -229,7 +229,7 @@ int do_page_cache_readahead(struct addre
  */
 unsigned long max_sane_readahead(unsigned long nr)
 {
-	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE)
+	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
 		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
 }
 
Index: linux-2.6.26-rc2-mm1/mm/filemap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/filemap.c	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/filemap.c	2008-05-23 14:21:34.000000000 -0400
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/mm_inline.h> /* for page_file_cache() */
 #include "internal.h"
 
 /*
@@ -489,8 +490,12 @@ int add_to_page_cache_lru(struct page *p
 				pgoff_t offset, gfp_t gfp_mask)
 {
 	int ret = add_to_page_cache(page, mapping, offset, gfp_mask);
-	if (ret == 0)
-		lru_cache_add(page);
+	if (ret == 0) {
+		if (page_file_cache(page))
+			lru_cache_add_file(page);
+		else
+			lru_cache_add_active_anon(page);
+	}
 	return ret;
 }
 
Index: linux-2.6.26-rc2-mm1/mm/vmstat.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmstat.c	2008-05-23 14:21:33.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmstat.c	2008-05-23 14:21:34.000000000 -0400
@@ -695,8 +695,10 @@ const struct seq_operations pagetypeinfo
 static const char * const vmstat_text[] = {
 	/* Zoned VM counters */
 	"nr_free_pages",
-	"nr_inactive",
-	"nr_active",
+	"nr_inactive_anon",
+	"nr_active_anon",
+	"nr_inactive_file",
+	"nr_active_file",
 	"nr_anon_pages",
 	"nr_mapped",
 	"nr_file_pages",
@@ -764,7 +766,7 @@ static void zoneinfo_show_print(struct s
 		   "\n        min      %lu"
 		   "\n        low      %lu"
 		   "\n        high     %lu"
-		   "\n        scanned  %lu (a: %lu i: %lu)"
+		   "\n        scanned  %lu (aa: %lu ia: %lu af: %lu if: %lu)"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
@@ -772,8 +774,10 @@ static void zoneinfo_show_print(struct s
 		   zone->pages_low,
 		   zone->pages_high,
 		   zone->pages_scanned,
-		   zone->nr_scan[LRU_ACTIVE],
-		   zone->nr_scan[LRU_INACTIVE],
+		   zone->nr_scan[LRU_ACTIVE_ANON],
+		   zone->nr_scan[LRU_INACTIVE_ANON],
+		   zone->nr_scan[LRU_ACTIVE_FILE],
+		   zone->nr_scan[LRU_INACTIVE_FILE],
 		   zone->spanned_pages,
 		   zone->present_pages);
 
Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-05-23 14:21:33.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-05-23 14:21:34.000000000 -0400
@@ -78,7 +78,7 @@ struct scan_control {
 	unsigned long (*isolate_pages)(unsigned long nr, struct list_head *dst,
 			unsigned long *scanned, int order, int mode,
 			struct zone *z, struct mem_cgroup *mem_cont,
-			int active);
+			int active, int file);
 };
 
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -236,27 +236,6 @@ unsigned long shrink_slab(unsigned long 
 	return ret;
 }
 
-/* Called without lock on whether page is mapped, so answer is unstable */
-static inline int page_mapping_inuse(struct page *page)
-{
-	struct address_space *mapping;
-
-	/* Page is in somebody's page tables. */
-	if (page_mapped(page))
-		return 1;
-
-	/* Be more reluctant to reclaim swapcache than pagecache */
-	if (PageSwapCache(page))
-		return 1;
-
-	mapping = page_mapping(page);
-	if (!mapping)
-		return 0;
-
-	/* File is mmap'd by somebody? */
-	return mapping_mapped(mapping);
-}
-
 static inline int is_page_cache_freeable(struct page *page)
 {
 	return page_count(page) - !!PagePrivate(page) == 2;
@@ -518,8 +497,7 @@ static unsigned long shrink_page_list(st
 
 		referenced = page_referenced(page, 1, sc->mem_cgroup);
 		/* In active use or really unfreeable?  Activate it. */
-		if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
-					referenced && page_mapping_inuse(page))
+		if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced)
 			goto activate_locked;
 
 #ifdef CONFIG_SWAP
@@ -550,8 +528,6 @@ static unsigned long shrink_page_list(st
 		}
 
 		if (PageDirty(page)) {
-			if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced)
-				goto keep_locked;
 			if (!may_enter_fs)
 				goto keep_locked;
 			if (!sc->may_writepage)
@@ -652,7 +628,7 @@ keep:
  *
  * returns 0 on success, -ve errno on failure.
  */
-int __isolate_lru_page(struct page *page, int mode)
+int __isolate_lru_page(struct page *page, int mode, int file)
 {
 	int ret = -EINVAL;
 
@@ -668,6 +644,9 @@ int __isolate_lru_page(struct page *page
 	if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
 		return ret;
 
+	if (mode != ISOLATE_BOTH && (!page_file_cache(page) != !file))
+		return ret;
+
 	ret = -EBUSY;
 	if (likely(get_page_unless_zero(page))) {
 		/*
@@ -698,12 +677,13 @@ int __isolate_lru_page(struct page *page
  * @scanned:	The number of pages that were scanned.
  * @order:	The caller's attempted allocation order
  * @mode:	One of the LRU isolation modes
+ * @file:	True [1] if isolating file [!anon] pages
  *
  * returns how many pages were moved onto *@dst.
  */
 static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		struct list_head *src, struct list_head *dst,
-		unsigned long *scanned, int order, int mode)
+		unsigned long *scanned, int order, int mode, int file)
 {
 	unsigned long nr_taken = 0;
 	unsigned long scan;
@@ -720,7 +700,7 @@ static unsigned long isolate_lru_pages(u
 
 		VM_BUG_ON(!PageLRU(page));
 
-		switch (__isolate_lru_page(page, mode)) {
+		switch (__isolate_lru_page(page, mode, file)) {
 		case 0:
 			list_move(&page->lru, dst);
 			nr_taken++;
@@ -763,10 +743,11 @@ static unsigned long isolate_lru_pages(u
 				break;
 
 			cursor_page = pfn_to_page(pfn);
+
 			/* Check that we have not crossed a zone boundary. */
 			if (unlikely(page_zone_id(cursor_page) != zone_id))
 				continue;
-			switch (__isolate_lru_page(cursor_page, mode)) {
+			switch (__isolate_lru_page(cursor_page, mode, file)) {
 			case 0:
 				list_move(&cursor_page->lru, dst);
 				nr_taken++;
@@ -791,30 +772,37 @@ static unsigned long isolate_pages_globa
 					unsigned long *scanned, int order,
 					int mode, struct zone *z,
 					struct mem_cgroup *mem_cont,
-					int active)
+					int active, int file)
 {
+	int lru = LRU_BASE;
 	if (active)
-		return isolate_lru_pages(nr, &z->list[LRU_ACTIVE], dst,
-						scanned, order, mode);
-	else
-		return isolate_lru_pages(nr, &z->list[LRU_INACTIVE], dst,
-						scanned, order, mode);
+		lru += LRU_ACTIVE;
+	if (file)
+		lru += LRU_FILE;
+	return isolate_lru_pages(nr, &z->list[lru], dst, scanned, order,
+								mode, !!file);
 }
 
 /*
  * clear_active_flags() is a helper for shrink_active_list(), clearing
  * any active bits from the pages in the list.
  */
-static unsigned long clear_active_flags(struct list_head *page_list)
+static unsigned long clear_active_flags(struct list_head *page_list,
+					unsigned int *count)
 {
 	int nr_active = 0;
+	int lru;
 	struct page *page;
 
-	list_for_each_entry(page, page_list, lru)
+	list_for_each_entry(page, page_list, lru) {
+		lru = page_file_cache(page);
 		if (PageActive(page)) {
+			lru += LRU_ACTIVE;
 			ClearPageActive(page);
 			nr_active++;
 		}
+		count[lru]++;
+	}
 
 	return nr_active;
 }
@@ -852,12 +840,12 @@ int isolate_lru_page(struct page *page)
 
 		spin_lock_irq(&zone->lru_lock);
 		if (PageLRU(page) && get_page_unless_zero(page)) {
+			int lru = LRU_BASE;
 			ret = 0;
 			ClearPageLRU(page);
-			if (PageActive(page))
-				del_page_from_active_list(zone, page);
-			else
-				del_page_from_inactive_list(zone, page);
+
+			lru += page_file_cache(page) + !!PageActive(page);
+			del_page_from_lru_list(zone, page, lru);
 		}
 		spin_unlock_irq(&zone->lru_lock);
 	}
@@ -869,7 +857,7 @@ int isolate_lru_page(struct page *page)
  * of reclaimed pages
  */
 static unsigned long shrink_inactive_list(unsigned long max_scan,
-				struct zone *zone, struct scan_control *sc)
+			struct zone *zone, struct scan_control *sc, int file)
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
@@ -886,18 +874,25 @@ static unsigned long shrink_inactive_lis
 		unsigned long nr_scan;
 		unsigned long nr_freed;
 		unsigned long nr_active;
+		unsigned int count[NR_LRU_LISTS] = { 0, };
+		int mode = (sc->order > PAGE_ALLOC_COSTLY_ORDER) ?
+					ISOLATE_BOTH : ISOLATE_INACTIVE;
 
 		nr_taken = sc->isolate_pages(sc->swap_cluster_max,
-			     &page_list, &nr_scan, sc->order,
-			     (sc->order > PAGE_ALLOC_COSTLY_ORDER)?
-					     ISOLATE_BOTH : ISOLATE_INACTIVE,
-				zone, sc->mem_cgroup, 0);
-		nr_active = clear_active_flags(&page_list);
+			     &page_list, &nr_scan, sc->order, mode,
+				zone, sc->mem_cgroup, 0, file);
+		nr_active = clear_active_flags(&page_list, count);
 		__count_vm_events(PGDEACTIVATE, nr_active);
 
-		__mod_zone_page_state(zone, NR_ACTIVE, -nr_active);
-		__mod_zone_page_state(zone, NR_INACTIVE,
-						-(nr_taken - nr_active));
+		__mod_zone_page_state(zone, NR_ACTIVE_FILE,
+						-count[LRU_ACTIVE_FILE]);
+		__mod_zone_page_state(zone, NR_INACTIVE_FILE,
+						-count[LRU_INACTIVE_FILE]);
+		__mod_zone_page_state(zone, NR_ACTIVE_ANON,
+						-count[LRU_ACTIVE_ANON]);
+		__mod_zone_page_state(zone, NR_INACTIVE_ANON,
+						-count[LRU_INACTIVE_ANON]);
+
 		if (scan_global_lru(sc))
 			zone->pages_scanned += nr_scan;
 		spin_unlock_irq(&zone->lru_lock);
@@ -919,7 +914,7 @@ static unsigned long shrink_inactive_lis
 			 * The attempt at page out may have made some
 			 * of the pages active, mark them inactive again.
 			 */
-			nr_active = clear_active_flags(&page_list);
+			nr_active = clear_active_flags(&page_list, count);
 			count_vm_events(PGDEACTIVATE, nr_active);
 
 			nr_freed += shrink_page_list(&page_list, sc,
@@ -944,11 +939,20 @@ static unsigned long shrink_inactive_lis
 		 * Put back any unfreeable pages.
 		 */
 		while (!list_empty(&page_list)) {
+			int lru = LRU_BASE;
 			page = lru_to_page(&page_list);
 			VM_BUG_ON(PageLRU(page));
 			SetPageLRU(page);
 			list_del(&page->lru);
-			add_page_to_lru_list(zone, page, PageActive(page));
+			if (page_file_cache(page)) {
+				lru += LRU_FILE;
+				zone->recent_rotated_file++;
+			} else {
+				zone->recent_rotated_anon++;
+			}
+			if (PageActive(page))
+				lru += LRU_ACTIVE;
+			add_page_to_lru_list(zone, page, lru);
 			if (!pagevec_add(&pvec, page)) {
 				spin_unlock_irq(&zone->lru_lock);
 				__pagevec_release(&pvec);
@@ -979,115 +983,7 @@ static inline void note_zone_scanning_pr
 
 static inline int zone_is_near_oom(struct zone *zone)
 {
-	return zone->pages_scanned >= (zone_page_state(zone, NR_ACTIVE)
-				+ zone_page_state(zone, NR_INACTIVE))*3;
-}
-
-/*
- * Determine we should try to reclaim mapped pages.
- * This is called only when sc->mem_cgroup is NULL.
- */
-static int calc_reclaim_mapped(struct scan_control *sc, struct zone *zone,
-				int priority)
-{
-	long mapped_ratio;
-	long distress;
-	long swap_tendency;
-	long imbalance;
-	int reclaim_mapped = 0;
-	int prev_priority;
-
-	if (scan_global_lru(sc) && zone_is_near_oom(zone))
-		return 1;
-	/*
-	 * `distress' is a measure of how much trouble we're having
-	 * reclaiming pages.  0 -> no problems.  100 -> great trouble.
-	 */
-	if (scan_global_lru(sc))
-		prev_priority = zone->prev_priority;
-	else
-		prev_priority = mem_cgroup_get_reclaim_priority(sc->mem_cgroup);
-
-	distress = 100 >> min(prev_priority, priority);
-
-	/*
-	 * The point of this algorithm is to decide when to start
-	 * reclaiming mapped memory instead of just pagecache.  Work out
-	 * how much memory
-	 * is mapped.
-	 */
-	if (scan_global_lru(sc))
-		mapped_ratio = ((global_page_state(NR_FILE_MAPPED) +
-				global_page_state(NR_ANON_PAGES)) * 100) /
-					vm_total_pages;
-	else
-		mapped_ratio = mem_cgroup_calc_mapped_ratio(sc->mem_cgroup);
-
-	/*
-	 * Now decide how much we really want to unmap some pages.  The
-	 * mapped ratio is downgraded - just because there's a lot of
-	 * mapped memory doesn't necessarily mean that page reclaim
-	 * isn't succeeding.
-	 *
-	 * The distress ratio is important - we don't want to start
-	 * going oom.
-	 *
-	 * A 100% value of vm_swappiness overrides this algorithm
-	 * altogether.
-	 */
-	swap_tendency = mapped_ratio / 2 + distress + sc->swappiness;
-
-	/*
-	 * If there's huge imbalance between active and inactive
-	 * (think active 100 times larger than inactive) we should
-	 * become more permissive, or the system will take too much
-	 * cpu before it start swapping during memory pressure.
-	 * Distress is about avoiding early-oom, this is about
-	 * making swappiness graceful despite setting it to low
-	 * values.
-	 *
-	 * Avoid div by zero with nr_inactive+1, and max resulting
-	 * value is vm_total_pages.
-	 */
-	if (scan_global_lru(sc)) {
-		imbalance  = zone_page_state(zone, NR_ACTIVE);
-		imbalance /= zone_page_state(zone, NR_INACTIVE) + 1;
-	} else
-		imbalance = mem_cgroup_reclaim_imbalance(sc->mem_cgroup);
-
-	/*
-	 * Reduce the effect of imbalance if swappiness is low,
-	 * this means for a swappiness very low, the imbalance
-	 * must be much higher than 100 for this logic to make
-	 * the difference.
-	 *
-	 * Max temporary value is vm_total_pages*100.
-	 */
-	imbalance *= (vm_swappiness + 1);
-	imbalance /= 100;
-
-	/*
-	 * If not much of the ram is mapped, makes the imbalance
-	 * less relevant, it's high priority we refill the inactive
-	 * list with mapped pages only in presence of high ratio of
-	 * mapped pages.
-	 *
-	 * Max temporary value is vm_total_pages*100.
-	 */
-	imbalance *= mapped_ratio;
-	imbalance /= 100;
-
-	/* apply imbalance feedback to swap_tendency */
-	swap_tendency += imbalance;
-
-	/*
-	 * Now use this metric to decide whether to start moving mapped
-	 * memory onto the inactive list.
-	 */
-	if (swap_tendency >= 100)
-		reclaim_mapped = 1;
-
-	return reclaim_mapped;
+	return zone->pages_scanned >= (zone_lru_pages(zone) * 3);
 }
 
 /*
@@ -1110,7 +1006,7 @@ static int calc_reclaim_mapped(struct sc
 
 
 static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
-				struct scan_control *sc, int priority)
+			struct scan_control *sc, int priority, int file)
 {
 	unsigned long pgmoved;
 	int pgdeactivate = 0;
@@ -1120,16 +1016,13 @@ static void shrink_active_list(unsigned 
 	LIST_HEAD(l_inactive);
 	struct page *page;
 	struct pagevec pvec;
-	int reclaim_mapped = 0;
-
-	if (sc->may_swap)
-		reclaim_mapped = calc_reclaim_mapped(sc, zone, priority);
+	enum lru_list lru;
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 	pgmoved = sc->isolate_pages(nr_pages, &l_hold, &pgscanned, sc->order,
 					ISOLATE_ACTIVE, zone,
-					sc->mem_cgroup, 1);
+					sc->mem_cgroup, 1, file);
 	/*
 	 * zone->pages_scanned is used for detect zone's oom
 	 * mem_cgroup remembers nr_scan by itself.
@@ -1137,29 +1030,29 @@ static void shrink_active_list(unsigned 
 	if (scan_global_lru(sc))
 		zone->pages_scanned += pgscanned;
 
-	__mod_zone_page_state(zone, NR_ACTIVE, -pgmoved);
+	if (file)
+		__mod_zone_page_state(zone, NR_ACTIVE_FILE, -pgmoved);
+	else
+		__mod_zone_page_state(zone, NR_ACTIVE_ANON, -pgmoved);
 	spin_unlock_irq(&zone->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
 		page = lru_to_page(&l_hold);
 		list_del(&page->lru);
-		if (page_mapped(page)) {
-			if (!reclaim_mapped ||
-			    (total_swap_pages == 0 && PageAnon(page)) ||
-			    page_referenced(page, 0, sc->mem_cgroup)) {
-				list_add(&page->lru, &l_active);
-				continue;
-			}
-		} else if (TestClearPageReferenced(page)) {
+		if (page_referenced(page, 0, sc->mem_cgroup))
 			list_add(&page->lru, &l_active);
-			continue;
-		}
-		list_add(&page->lru, &l_inactive);
+		else
+			list_add(&page->lru, &l_inactive);
 	}
 
+	/*
+	 * Now put the pages back on the appropriate [file or anon] inactive
+	 * and active lists.
+	 */
 	pagevec_init(&pvec, 1);
 	pgmoved = 0;
+	lru = LRU_BASE + file * LRU_FILE;
 	spin_lock_irq(&zone->lru_lock);
 	while (!list_empty(&l_inactive)) {
 		page = lru_to_page(&l_inactive);
@@ -1169,11 +1062,12 @@ static void shrink_active_list(unsigned 
 		VM_BUG_ON(!PageActive(page));
 		ClearPageActive(page);
 
-		list_move(&page->lru, &zone->list[LRU_INACTIVE]);
+		list_move(&page->lru, &zone->list[lru]);
 		mem_cgroup_move_lists(page, false);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
-			__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
+			__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru,
+								pgmoved);
 			spin_unlock_irq(&zone->lru_lock);
 			pgdeactivate += pgmoved;
 			pgmoved = 0;
@@ -1183,7 +1077,7 @@ static void shrink_active_list(unsigned 
 			spin_lock_irq(&zone->lru_lock);
 		}
 	}
-	__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru, pgmoved);
 	pgdeactivate += pgmoved;
 	if (buffer_heads_over_limit) {
 		spin_unlock_irq(&zone->lru_lock);
@@ -1192,6 +1086,7 @@ static void shrink_active_list(unsigned 
 	}
 
 	pgmoved = 0;
+	lru = LRU_ACTIVE + file * LRU_FILE;
 	while (!list_empty(&l_active)) {
 		page = lru_to_page(&l_active);
 		prefetchw_prev_lru_page(page, &l_active, flags);
@@ -1199,11 +1094,12 @@ static void shrink_active_list(unsigned 
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
 
-		list_move(&page->lru, &zone->list[LRU_ACTIVE]);
+		list_move(&page->lru, &zone->list[lru]);
 		mem_cgroup_move_lists(page, true);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
-			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
+			__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru,
+								pgmoved);
 			pgmoved = 0;
 			spin_unlock_irq(&zone->lru_lock);
 			if (vm_swap_full())
@@ -1212,7 +1108,12 @@ static void shrink_active_list(unsigned 
 			spin_lock_irq(&zone->lru_lock);
 		}
 	}
-	__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru, pgmoved);
+	if (file) {
+		zone->recent_rotated_file += pgmoved;
+	} else {
+		zone->recent_rotated_anon += pgmoved;
+	}
 
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
 	__count_vm_events(PGDEACTIVATE, pgdeactivate);
@@ -1223,16 +1124,82 @@ static void shrink_active_list(unsigned 
 	pagevec_release(&pvec);
 }
 
-static unsigned long shrink_list(enum lru_list l, unsigned long nr_to_scan,
+static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 	struct zone *zone, struct scan_control *sc, int priority)
 {
-	if (l == LRU_ACTIVE) {
-		shrink_active_list(nr_to_scan, zone, sc, priority);
+	int file = is_file_lru(lru);
+
+	if (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE) {
+		shrink_active_list(nr_to_scan, zone, sc, priority, file);
 		return 0;
 	}
-	return shrink_inactive_list(nr_to_scan, zone, sc);
+	return shrink_inactive_list(nr_to_scan, zone, sc, file);
+}
+
+/*
+ * The utility of the anon and file memory corresponds to the fraction
+ * of pages that were recently referenced in each category.  Pageout
+ * pressure is distributed according to the size of each set, the fraction
+ * of recently referenced pages (except used-once file pages) and the
+ * swappiness parameter.
+ *
+ * We return the relative pressures as percentages so shrink_zone can
+ * easily use them.
+ */
+static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
+					unsigned long *percent)
+{
+	unsigned long anon, file;
+	unsigned long anon_prio, file_prio;
+	unsigned long rotate_sum;
+	unsigned long ap, fp;
+
+	anon  = zone_page_state(zone, NR_ACTIVE_ANON) +
+		zone_page_state(zone, NR_INACTIVE_ANON);
+	file  = zone_page_state(zone, NR_ACTIVE_FILE) +
+		zone_page_state(zone, NR_INACTIVE_FILE);
+
+	rotate_sum = zone->recent_rotated_file + zone->recent_rotated_anon;
+
+	/* Keep a floating average of RECENT references. */
+	if (unlikely(rotate_sum > min(anon, file))) {
+		spin_lock_irq(&zone->lru_lock);
+		zone->recent_rotated_file /= 2;
+		zone->recent_rotated_anon /= 2;
+		spin_unlock_irq(&zone->lru_lock);
+		rotate_sum /= 2;
+	}
+
+	/*
+	 * With swappiness at 100, anonymous and file have the same priority.
+	 * This scanning priority is essentially the inverse of IO cost.
+	 */
+	anon_prio = sc->swappiness;
+	file_prio = 200 - sc->swappiness;
+
+	/*
+	 *                  anon       recent_rotated_anon
+	 * %anon = 100 * ----------- / ------------------- * IO cost
+	 *               anon + file       rotate_sum
+	 */
+	ap = (anon_prio * anon) / (anon + file + 1);
+	ap *= rotate_sum / (zone->recent_rotated_anon + 1);
+	if (ap == 0)
+		ap = 1;
+	else if (ap > 100)
+		ap = 100;
+	percent[0] = ap;
+
+	fp = (file_prio * file) / (anon + file + 1);
+	fp *= rotate_sum / (zone->recent_rotated_file + 1);
+	if (fp == 0)
+		fp = 1;
+	else if (fp > 100)
+		fp = 100;
+	percent[1] = fp;
 }
 
+
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
@@ -1242,36 +1209,38 @@ static unsigned long shrink_zone(int pri
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
 	unsigned long nr_reclaimed = 0;
+	unsigned long percent[2];	/* anon @ 0; file @ 1 */
 	enum lru_list l;
 
-	if (scan_global_lru(sc)) {
-		/*
-		 * Add one to nr_to_scan just to make sure that the kernel
-		 * will slowly sift through the active list.
-		 */
-		for_each_lru(l) {
+	get_scan_ratio(zone, sc, percent);
+
+	for_each_lru(l) {
+		if (scan_global_lru(sc)) {
+			int file = is_file_lru(l);
+			/*
+			 * Add one to nr_to_scan just to make sure that the
+			 * kernel will slowly sift through the active list.
+			 */
 			zone->nr_scan[l] += (zone_page_state(zone,
-					NR_INACTIVE + l)  >> priority) + 1;
-			nr[l] = zone->nr_scan[l];
+				NR_INACTIVE_ANON + l) >> priority) + 1;
+			nr[l] = zone->nr_scan[l] * percent[file] / 100;
 			if (nr[l] >= sc->swap_cluster_max)
 				zone->nr_scan[l] = 0;
 			else
 				nr[l] = 0;
+		} else {
+			/*
+			 * This reclaim occurs not because zone memory shortage
+			 * but because memory controller hits its limit.
+			 * Then, don't modify zone reclaim related data.
+			 */
+			nr[l] = mem_cgroup_calc_reclaim(sc->mem_cgroup, zone,
+								priority, l);
 		}
-	} else {
-		/*
-		 * This reclaim occurs not because zone memory shortage but
-		 * because memory controller hits its limit.
-		 * Then, don't modify zone reclaim related data.
-		 */
-		nr[LRU_ACTIVE] = mem_cgroup_calc_reclaim(sc->mem_cgroup,
-					zone, priority, LRU_ACTIVE);
-
-		nr[LRU_INACTIVE] = mem_cgroup_calc_reclaim(sc->mem_cgroup,
-					zone, priority, LRU_INACTIVE);
 	}
 
-	while (nr[LRU_ACTIVE] || nr[LRU_INACTIVE]) {
+	while (nr[LRU_ACTIVE_ANON] || nr[LRU_INACTIVE_ANON] ||
+				nr[LRU_ACTIVE_FILE] || nr[LRU_INACTIVE_FILE]) {
 		for_each_lru(l) {
 			if (nr[l]) {
 				nr_to_scan = min(nr[l],
@@ -1344,7 +1313,7 @@ static unsigned long shrink_zones(int pr
 
 	return nr_reclaimed;
 }
- 
+
 /*
  * This is the main entry point to direct page reclaim.
  *
@@ -1385,8 +1354,7 @@ static unsigned long do_try_to_free_page
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
 
-			lru_pages += zone_page_state(zone, NR_ACTIVE)
-					+ zone_page_state(zone, NR_INACTIVE);
+			lru_pages += zone_lru_pages(zone);
 		}
 	}
 
@@ -1586,8 +1554,7 @@ loop_again:
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
 
-			lru_pages += zone_page_state(zone, NR_ACTIVE)
-					+ zone_page_state(zone, NR_INACTIVE);
+			lru_pages += zone_lru_pages(zone);
 		}
 
 		/*
@@ -1631,8 +1598,7 @@ loop_again:
 			if (zone_is_all_unreclaimable(zone))
 				continue;
 			if (nr_slab == 0 && zone->pages_scanned >=
-				(zone_page_state(zone, NR_ACTIVE)
-				+ zone_page_state(zone, NR_INACTIVE)) * 6)
+						(zone_lru_pages(zone) * 6))
 					zone_set_flag(zone,
 						      ZONE_ALL_UNRECLAIMABLE);
 			/*
@@ -1686,7 +1652,7 @@ out:
 
 /*
  * The background pageout daemon, started as a kernel thread
- * from the init process. 
+ * from the init process.
  *
  * This basically trickles out pages so that we have _some_
  * free memory available even if there is no other activity
@@ -1780,6 +1746,14 @@ void wakeup_kswapd(struct zone *zone, in
 	wake_up_interruptible(&pgdat->kswapd_wait);
 }
 
+unsigned long global_lru_pages(void)
+{
+	return global_page_state(NR_ACTIVE_ANON)
+		+ global_page_state(NR_ACTIVE_FILE)
+		+ global_page_state(NR_INACTIVE_ANON)
+		+ global_page_state(NR_INACTIVE_FILE);
+}
+
 #ifdef CONFIG_PM
 /*
  * Helper function for shrink_all_memory().  Tries to reclaim 'nr_pages' pages
@@ -1805,17 +1779,18 @@ static unsigned long shrink_all_zones(un
 
 		for_each_lru(l) {
 			/* For pass = 0 we don't shrink the active list */
-			if (pass == 0 && l == LRU_ACTIVE)
+			if (pass == 0 &&
+				(l == LRU_ACTIVE_ANON || l == LRU_ACTIVE_FILE))
 				continue;
 
 			zone->nr_scan[l] +=
-				(zone_page_state(zone, NR_INACTIVE + l)
+				(zone_page_state(zone, NR_INACTIVE_ANON + l)
 								>> prio) + 1;
 			if (zone->nr_scan[l] >= nr_pages || pass > 3) {
 				zone->nr_scan[l] = 0;
 				nr_to_scan = min(nr_pages,
 					zone_page_state(zone,
-							NR_INACTIVE + l));
+							NR_INACTIVE_ANON + l));
 				ret += shrink_list(l, nr_to_scan, zone,
 								sc, prio);
 				if (ret >= nr_pages)
@@ -1827,11 +1802,6 @@ static unsigned long shrink_all_zones(un
 	return ret;
 }
 
-static unsigned long count_lru_pages(void)
-{
-	return global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE);
-}
-
 /*
  * Try to free `nr_pages' of memory, system-wide, and return the number of
  * freed pages.
@@ -1857,7 +1827,7 @@ unsigned long shrink_all_memory(unsigned
 
 	current->reclaim_state = &reclaim_state;
 
-	lru_pages = count_lru_pages();
+	lru_pages = global_lru_pages();
 	nr_slab = global_page_state(NR_SLAB_RECLAIMABLE);
 	/* If slab caches are huge, it's better to hit them first */
 	while (nr_slab >= lru_pages) {
@@ -1900,7 +1870,7 @@ unsigned long shrink_all_memory(unsigned
 
 			reclaim_state.reclaimed_slab = 0;
 			shrink_slab(sc.nr_scanned, sc.gfp_mask,
-					count_lru_pages());
+					global_lru_pages());
 			ret += reclaim_state.reclaimed_slab;
 			if (ret >= nr_pages)
 				goto out;
@@ -1917,7 +1887,7 @@ unsigned long shrink_all_memory(unsigned
 	if (!ret) {
 		do {
 			reclaim_state.reclaimed_slab = 0;
-			shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages());
+			shrink_slab(nr_pages, sc.gfp_mask, global_lru_pages());
 			ret += reclaim_state.reclaimed_slab;
 		} while (ret < nr_pages && reclaim_state.reclaimed_slab > 0);
 	}
Index: linux-2.6.26-rc2-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/swap_state.c	2008-05-23 14:21:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/swap_state.c	2008-05-23 14:21:34.000000000 -0400
@@ -302,7 +302,7 @@ struct page *read_swap_cache_async(swp_e
 			/*
 			 * Initiate read into locked page and return.
 			 */
-			lru_cache_add_active(new_page);
+			lru_cache_add_active_anon(new_page);
 			swap_readpage(NULL, new_page);
 			return new_page;
 		}
Index: linux-2.6.26-rc2-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mmzone.h	2008-05-23 14:21:33.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mmzone.h	2008-05-23 14:21:34.000000000 -0400
@@ -81,21 +81,23 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
-	NR_INACTIVE,	/* must match order of LRU_[IN]ACTIVE */
-	NR_ACTIVE,	/*  "     "     "   "       "         */
+	NR_INACTIVE_ANON,	/* must match order of LRU_[IN]ACTIVE_* */
+	NR_ACTIVE_ANON,		/*  "     "     "   "       "           */
+	NR_INACTIVE_FILE,	/*  "     "     "   "       "           */
+	NR_ACTIVE_FILE,		/*  "     "     "   "       "           */
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
 	NR_FILE_PAGES,
 	NR_FILE_DIRTY,
 	NR_WRITEBACK,
-	/* Second 128 byte cacheline */
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
 	NR_PAGETABLE,		/* used for pagetables */
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_BOUNCE,
 	NR_VMSCAN_WRITE,
+	/* Second 128 byte cacheline */
 	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
 #ifdef CONFIG_NUMA
 	NUMA_HIT,		/* allocated in intended node */
@@ -107,18 +109,33 @@ enum zone_stat_item {
 #endif
 	NR_VM_ZONE_STAT_ITEMS };
 
+/*
+ * We do arithmetic on the LRU lists in various places in the code,
+ * so it is important to keep the active lists LRU_ACTIVE higher in
+ * the array than the corresponding inactive lists, and to keep
+ * the *_FILE lists LRU_FILE higher than the corresponding _ANON lists.
+ */
 #define LRU_BASE 0
+#define LRU_ACTIVE 1
+#define LRU_FILE 2
 
 enum lru_list {
-	LRU_INACTIVE = LRU_BASE,	/* must match order of NR_[IN]ACTIVE */
-	LRU_ACTIVE,			/*  "     "     "   "       "        */
+	LRU_INACTIVE_ANON = LRU_BASE,
+	LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
+	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
+	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
 	NR_LRU_LISTS };
 
 #define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
 
+static inline int is_file_lru(enum lru_list l)
+{
+	return (l == LRU_INACTIVE_FILE || l == LRU_ACTIVE_FILE);
+}
+
 static inline int is_active_lru(enum lru_list l)
 {
-	return (l == LRU_ACTIVE);
+	return (l == LRU_ACTIVE_ANON || l == LRU_ACTIVE_FILE);
 }
 
 enum lru_list page_lru(struct page *page);
@@ -269,6 +286,10 @@ struct zone {
 	spinlock_t		lru_lock;	
 	struct list_head	list[NR_LRU_LISTS];
 	unsigned long		nr_scan[NR_LRU_LISTS];
+
+	unsigned long		recent_rotated_anon;
+	unsigned long		recent_rotated_file;
+
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
Index: linux-2.6.26-rc2-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mm_inline.h	2008-05-23 14:21:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mm_inline.h	2008-05-23 14:21:34.000000000 -0400
@@ -5,7 +5,7 @@
  * page_file_cache - should the page be on a file LRU or anon LRU?
  * @page: the page to test
  *
- * Returns !0 if @page is page cache page backed by a regular filesystem,
+ * Returns LRU_FILE if @page is page cache page backed by a regular filesystem,
  * or 0 if @page is anonymous, tmpfs or otherwise ram or swap backed.
  *
  * We would like to get this info without a page flag, but the state
@@ -18,58 +18,83 @@ static inline int page_file_cache(struct
 		return 0;
 
 	/* The page is page cache backed by a normal filesystem. */
-	return 2;
+	return LRU_FILE;
 }
 
 static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
 	list_add(&page->lru, &zone->list[l]);
-	__inc_zone_state(zone, NR_INACTIVE + l);
+	__inc_zone_state(zone, NR_INACTIVE_ANON + l);
 }
 
 static inline void
 del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
 	list_del(&page->lru);
-	__dec_zone_state(zone, NR_INACTIVE + l);
+	__dec_zone_state(zone, NR_INACTIVE_ANON + l);
 }
 
 static inline void
-add_page_to_active_list(struct zone *zone, struct page *page)
+add_page_to_inactive_anon_list(struct zone *zone, struct page *page)
 {
-	add_page_to_lru_list(zone, page, LRU_ACTIVE);
+	add_page_to_lru_list(zone, page, LRU_INACTIVE_ANON);
 }
 
 static inline void
-add_page_to_inactive_list(struct zone *zone, struct page *page)
+add_page_to_active_anon_list(struct zone *zone, struct page *page)
 {
-	add_page_to_lru_list(zone, page, LRU_INACTIVE);
+	add_page_to_lru_list(zone, page, LRU_ACTIVE_ANON);
 }
 
 static inline void
-del_page_from_active_list(struct zone *zone, struct page *page)
+add_page_to_inactive_file_list(struct zone *zone, struct page *page)
 {
-	del_page_from_lru_list(zone, page, LRU_ACTIVE);
+	add_page_to_lru_list(zone, page, LRU_INACTIVE_FILE);
 }
 
 static inline void
-del_page_from_inactive_list(struct zone *zone, struct page *page)
+add_page_to_active_file_list(struct zone *zone, struct page *page)
 {
-	del_page_from_lru_list(zone, page, LRU_INACTIVE);
+	add_page_to_lru_list(zone, page, LRU_ACTIVE_FILE);
+}
+
+static inline void
+del_page_from_inactive_anon_list(struct zone *zone, struct page *page)
+{
+	del_page_from_lru_list(zone, page, LRU_INACTIVE_ANON);
+}
+
+static inline void
+del_page_from_active_anon_list(struct zone *zone, struct page *page)
+{
+	del_page_from_lru_list(zone, page, LRU_ACTIVE_ANON);
+}
+
+static inline void
+del_page_from_inactive_file_list(struct zone *zone, struct page *page)
+{
+	del_page_from_lru_list(zone, page, LRU_INACTIVE_FILE);
+}
+
+static inline void
+del_page_from_active_file_list(struct zone *zone, struct page *page)
+{
+	del_page_from_lru_list(zone, page, LRU_INACTIVE_FILE);
 }
 
 static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
-	enum lru_list l = LRU_INACTIVE;
+	enum lru_list l = LRU_INACTIVE_ANON;
 
 	list_del(&page->lru);
 	if (PageActive(page)) {
 		__ClearPageActive(page);
-		l = LRU_ACTIVE;
+		l += LRU_ACTIVE;
 	}
-	__dec_zone_state(zone, NR_INACTIVE + l);
+	l += page_file_cache(page);
+	__dec_zone_state(zone, NR_INACTIVE_ANON + l);
 }
 
 #endif
Index: linux-2.6.26-rc2-mm1/include/linux/pagevec.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/pagevec.h	2008-05-23 14:21:33.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/pagevec.h	2008-05-23 14:21:34.000000000 -0400
@@ -81,20 +81,37 @@ static inline void pagevec_free(struct p
 		__pagevec_free(pvec);
 }
 
-static inline void __pagevec_lru_add(struct pagevec *pvec)
+static inline void __pagevec_lru_add_anon(struct pagevec *pvec)
 {
-	____pagevec_lru_add(pvec, LRU_INACTIVE);
+	____pagevec_lru_add(pvec, LRU_INACTIVE_ANON);
 }
 
-static inline void __pagevec_lru_add_active(struct pagevec *pvec)
+static inline void __pagevec_lru_add_active_anon(struct pagevec *pvec)
 {
-	____pagevec_lru_add(pvec, LRU_ACTIVE);
+	____pagevec_lru_add(pvec, LRU_ACTIVE_ANON);
 }
 
-static inline void pagevec_lru_add(struct pagevec *pvec)
+static inline void __pagevec_lru_add_file(struct pagevec *pvec)
+{
+	____pagevec_lru_add(pvec, LRU_INACTIVE_FILE);
+}
+
+static inline void __pagevec_lru_add_active_file(struct pagevec *pvec)
+{
+	____pagevec_lru_add(pvec, LRU_ACTIVE_FILE);
+}
+
+
+static inline void pagevec_lru_add_file(struct pagevec *pvec)
+{
+	if (pagevec_count(pvec))
+		__pagevec_lru_add_file(pvec);
+}
+
+static inline void pagevec_lru_add_anon(struct pagevec *pvec)
 {
 	if (pagevec_count(pvec))
-		__pagevec_lru_add(pvec);
+		__pagevec_lru_add_anon(pvec);
 }
 
 #endif /* _LINUX_PAGEVEC_H */
Index: linux-2.6.26-rc2-mm1/include/linux/vmstat.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/vmstat.h	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/vmstat.h	2008-05-23 14:21:34.000000000 -0400
@@ -159,6 +159,16 @@ static inline unsigned long zone_page_st
 	return x;
 }
 
+extern unsigned long global_lru_pages(void);
+
+static inline unsigned long zone_lru_pages(struct zone *zone)
+{
+	return (zone_page_state(zone, NR_ACTIVE_ANON)
+		+ zone_page_state(zone, NR_ACTIVE_FILE)
+		+ zone_page_state(zone, NR_INACTIVE_ANON)
+		+ zone_page_state(zone, NR_INACTIVE_FILE));
+}
+
 #ifdef CONFIG_NUMA
 /*
  * Determine the per node value of a stat item. This function
Index: linux-2.6.26-rc2-mm1/mm/page-writeback.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/page-writeback.c	2008-05-23 14:21:21.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/page-writeback.c	2008-05-23 14:21:34.000000000 -0400
@@ -331,9 +331,7 @@ static unsigned long highmem_dirtyable_m
 		struct zone *z =
 			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
 
-		x += zone_page_state(z, NR_FREE_PAGES)
-			+ zone_page_state(z, NR_INACTIVE)
-			+ zone_page_state(z, NR_ACTIVE);
+		x += zone_page_state(z, NR_FREE_PAGES) + zone_lru_pages(z);
 	}
 	/*
 	 * Make sure that the number of highmem pages is never larger
@@ -351,9 +349,7 @@ static unsigned long determine_dirtyable
 {
 	unsigned long x;
 
-	x = global_page_state(NR_FREE_PAGES)
-		+ global_page_state(NR_INACTIVE)
-		+ global_page_state(NR_ACTIVE);
+	x = global_page_state(NR_FREE_PAGES) + global_lru_pages();
 
 	if (!vm_highmem_is_dirtyable)
 		x -= highmem_dirtyable_memory(x);
Index: linux-2.6.26-rc2-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/swap.h	2008-05-23 14:21:33.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/swap.h	2008-05-23 14:21:34.000000000 -0400
@@ -184,14 +184,24 @@ extern void swap_setup(void);
  * lru_cache_add: add a page to the page lists
  * @page: the page to add
  */
-static inline void lru_cache_add(struct page *page)
+static inline void lru_cache_add_anon(struct page *page)
 {
-	__lru_cache_add(page, LRU_INACTIVE);
+	__lru_cache_add(page, LRU_INACTIVE_ANON);
 }
 
-static inline void lru_cache_add_active(struct page *page)
+static inline void lru_cache_add_active_anon(struct page *page)
 {
-	__lru_cache_add(page, LRU_ACTIVE);
+	__lru_cache_add(page, LRU_ACTIVE_ANON);
+}
+
+static inline void lru_cache_add_file(struct page *page)
+{
+	__lru_cache_add(page, LRU_INACTIVE_FILE);
+}
+
+static inline void lru_cache_add_active_file(struct page *page)
+{
+	__lru_cache_add(page, LRU_ACTIVE_FILE);
 }
 
 /* linux/mm/vmscan.c */
@@ -199,7 +209,7 @@ extern unsigned long try_to_free_pages(s
 					gfp_t gfp_mask);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 							gfp_t gfp_mask);
-extern int __isolate_lru_page(struct page *page, int mode);
+extern int __isolate_lru_page(struct page *page, int mode, int file);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
Index: linux-2.6.26-rc2-mm1/include/linux/memcontrol.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/memcontrol.h	2008-05-23 14:21:33.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/memcontrol.h	2008-05-23 14:21:34.000000000 -0400
@@ -41,7 +41,7 @@ extern unsigned long mem_cgroup_isolate_
 					unsigned long *scanned, int order,
 					int mode, struct zone *z,
 					struct mem_cgroup *mem_cont,
-					int active);
+					int active, int file);
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
Index: linux-2.6.26-rc2-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/memcontrol.c	2008-05-23 14:21:33.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/memcontrol.c	2008-05-23 14:21:34.000000000 -0400
@@ -163,6 +163,7 @@ struct page_cgroup {
 };
 #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
 #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* page is active in this cgroup */
+#define PAGE_CGROUP_FLAG_FILE	(0x4)	/* page is file system backed */
 
 static int page_cgroup_nid(struct page_cgroup *pc)
 {
@@ -280,8 +281,12 @@ static void unlock_page_cgroup(struct pa
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
 			struct page_cgroup *pc)
 {
-	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
-	int lru = !!from;
+	int lru = LRU_BASE;
+
+	if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+		lru += LRU_ACTIVE;
+	if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+		lru += LRU_FILE;
 
 	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
 
@@ -292,10 +297,12 @@ static void __mem_cgroup_remove_list(str
 static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
 				struct page_cgroup *pc)
 {
-	int lru = LRU_INACTIVE;
+	int lru = LRU_BASE;
 
 	if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
 		lru += LRU_ACTIVE;
+	if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+		lru += LRU_FILE;
 
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
 	list_add(&pc->lru, &mz->lists[lru]);
@@ -306,10 +313,9 @@ static void __mem_cgroup_add_list(struct
 static void __mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
 {
 	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
-	int lru = LRU_INACTIVE;
-
-	if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
-		lru += LRU_ACTIVE;
+	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
+	int file = pc->flags & PAGE_CGROUP_FLAG_FILE;
+	int lru = LRU_FILE * !!file + !!from;
 
 	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
 
@@ -318,7 +324,7 @@ static void __mem_cgroup_move_lists(stru
 	else
 		pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
 
-	lru = !!active;
+	lru = LRU_FILE * !!file + !!active;
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
 	list_move(&pc->lru, &mz->lists[lru]);
 }
@@ -380,21 +386,6 @@ int mem_cgroup_calc_mapped_ratio(struct 
 }
 
 /*
- * This function is called from vmscan.c. In page reclaiming loop. balance
- * between active and inactive list is calculated. For memory controller
- * page reclaiming, we should use using mem_cgroup's imbalance rather than
- * zone's global lru imbalance.
- */
-long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem)
-{
-	unsigned long active, inactive;
-	/* active and inactive are the number of pages. 'long' is ok.*/
-	active = mem_cgroup_get_all_zonestat(mem, LRU_ACTIVE);
-	inactive = mem_cgroup_get_all_zonestat(mem, LRU_INACTIVE);
-	return (long) (active / (inactive + 1));
-}
-
-/*
  * prev_priority control...this will be used in memory reclaim path.
  */
 int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
@@ -439,7 +430,7 @@ unsigned long mem_cgroup_isolate_pages(u
 					unsigned long *scanned, int order,
 					int mode, struct zone *z,
 					struct mem_cgroup *mem_cont,
-					int active)
+					int active, int file)
 {
 	unsigned long nr_taken = 0;
 	struct page *page;
@@ -450,7 +441,7 @@ unsigned long mem_cgroup_isolate_pages(u
 	int nid = z->zone_pgdat->node_id;
 	int zid = zone_idx(z);
 	struct mem_cgroup_per_zone *mz;
-	int lru = !!active;
+	int lru = LRU_FILE * !!file + !!active;
 
 	BUG_ON(!mem_cont);
 	mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
@@ -466,6 +457,9 @@ unsigned long mem_cgroup_isolate_pages(u
 		if (unlikely(!PageLRU(page)))
 			continue;
 
+		/*
+		 * TODO: play better with lumpy reclaim, grabbing anything.
+		 */
 		if (PageActive(page) && !active) {
 			__mem_cgroup_move_lists(pc, true);
 			continue;
@@ -478,7 +472,7 @@ unsigned long mem_cgroup_isolate_pages(u
 		scan++;
 		list_move(&pc->lru, &pc_list);
 
-		if (__isolate_lru_page(page, mode) == 0) {
+		if (__isolate_lru_page(page, mode, file) == 0) {
 			list_move(&page->lru, dst);
 			nr_taken++;
 		}
@@ -590,9 +584,11 @@ retry:
 	 * If a page is accounted as a page cache, insert to inactive list.
 	 * If anon, insert to active list.
 	 */
-	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
+	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
 		pc->flags = PAGE_CGROUP_FLAG_CACHE;
-	else
+		if (page_file_cache(page))
+			pc->flags |= PAGE_CGROUP_FLAG_FILE;
+	} else
 		pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
 
 	lock_page_cgroup(page);
@@ -890,14 +886,21 @@ static int mem_control_stat_show(struct 
 	}
 	/* showing # of active pages */
 	{
-		unsigned long active, inactive;
+		unsigned long active_anon, inactive_anon;
+		unsigned long active_file, inactive_file;
 
-		inactive = mem_cgroup_get_all_zonestat(mem_cont,
-						LRU_INACTIVE);
-		active = mem_cgroup_get_all_zonestat(mem_cont,
-						LRU_ACTIVE);
-		cb->fill(cb, "active", (active) * PAGE_SIZE);
-		cb->fill(cb, "inactive", (inactive) * PAGE_SIZE);
+		inactive_anon = mem_cgroup_get_all_zonestat(mem_cont,
+						LRU_INACTIVE_ANON);
+		active_anon = mem_cgroup_get_all_zonestat(mem_cont,
+						LRU_ACTIVE_ANON);
+		inactive_file = mem_cgroup_get_all_zonestat(mem_cont,
+						LRU_INACTIVE_FILE);
+		active_file = mem_cgroup_get_all_zonestat(mem_cont,
+						LRU_ACTIVE_FILE);
+		cb->fill(cb, "active_anon", (active_anon) * PAGE_SIZE);
+		cb->fill(cb, "inactive_anon", (inactive_anon) * PAGE_SIZE);
+		cb->fill(cb, "active_file", (active_file) * PAGE_SIZE);
+		cb->fill(cb, "inactive_file", (inactive_file) * PAGE_SIZE);
 	}
 	return 0;
 }

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 07/25] second chance replacement for anonymous pages
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (5 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 06/25] split LRU lists into anon & file sets Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-07  1:04   ` Andrew Morton
  2008-06-06 20:28 ` [PATCH -mm 08/25] add some sanity checks to get_scan_ratio Rik van Riel, Rik van Riel
                   ` (18 subsequent siblings)
  25 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro

[-- Attachment #1: rvr-03-linux-2.6-vm-anon-clock.patch --]
[-- Type: text/plain, Size: 8223 bytes --]

From: Rik van Riel <riel@redhat.com>

We avoid evicting and scanning anonymous pages for the most part, but
under some workloads we can end up with most of memory filled with
anonymous pages.  At that point, we suddenly need to clear the referenced
bits on all of memory, which can take ages on very large memory systems.

We can reduce the maximum number of pages that need to be scanned by
not taking the referenced state into account when deactivating an
anonymous page.  After all, every anonymous page starts out referenced,
so why check?

If an anonymous page gets referenced again before it reaches the end
of the inactive list, we move it back to the active list.

To keep the maximum amount of necessary work reasonable, we scale the
active to inactive ratio with the size of memory, using the formula
active:inactive ratio = sqrt(memory in GB * 10).

Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
instead of by the amount of memory present in the system.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---
 include/linux/mm_inline.h |   12 ++++++++++++
 include/linux/mmzone.h    |    5 +++++
 mm/page_alloc.c           |   40 ++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c               |   38 +++++++++++++++++++++++++++++++-------
 mm/vmstat.c               |    6 ++++--
 5 files changed, 92 insertions(+), 9 deletions(-)

Index: linux-2.6.26-rc2-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mm_inline.h	2008-05-23 14:21:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mm_inline.h	2008-05-28 12:09:06.000000000 -0400
@@ -97,4 +97,16 @@ del_page_from_lru(struct zone *zone, str
 	__dec_zone_state(zone, NR_INACTIVE_ANON + l);
 }
 
+static inline int inactive_anon_low(struct zone *zone)
+{
+	unsigned long active, inactive;
+
+	active = zone_page_state(zone, NR_ACTIVE_ANON);
+	inactive = zone_page_state(zone, NR_INACTIVE_ANON);
+
+	if (inactive * zone->inactive_ratio < active)
+		return 1;
+
+	return 0;
+}
 #endif
Index: linux-2.6.26-rc2-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mmzone.h	2008-05-23 14:21:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mmzone.h	2008-05-28 12:09:06.000000000 -0400
@@ -311,6 +311,11 @@ struct zone {
 	 */
 	int prev_priority;
 
+	/*
+	 * The ratio of active to inactive pages.
+	 */
+	unsigned int inactive_ratio;
+
 
 	ZONE_PADDING(_pad2_)
 	/* Rarely used or read-mostly fields */
Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c	2008-05-23 14:21:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/page_alloc.c	2008-05-28 12:09:06.000000000 -0400
@@ -4269,6 +4269,45 @@ void setup_per_zone_pages_min(void)
 	calculate_totalreserve_pages();
 }
 
+/**
+ * setup_per_zone_inactive_ratio - called when min_free_kbytes changes.
+ *
+ * The inactive anon list should be small enough that the VM never has to
+ * do too much work, but large enough that each inactive page has a chance
+ * to be referenced again before it is swapped out.
+ *
+ * The inactive_anon ratio is the ratio of active to inactive anonymous
+ * pages.  Ie. a ratio of 3 means 3:1 or 25% of the anonymous pages are
+ * on the inactive list.
+ *
+ * total     return    max
+ * memory    value     inactive anon
+ * -------------------------------------
+ *   10MB       1         5MB
+ *  100MB       1        50MB
+ *    1GB       3       250MB
+ *   10GB      10       0.9GB
+ *  100GB      31         3GB
+ *    1TB     101        10GB
+ *   10TB     320        32GB
+ */
+void setup_per_zone_inactive_ratio(void)
+{
+	struct zone *zone;
+
+	for_each_zone(zone) {
+		unsigned int gb, ratio;
+
+		/* Zone size in gigabytes */
+		gb = zone->present_pages >> (30 - PAGE_SHIFT);
+		ratio = int_sqrt(10 * gb);
+		if (!ratio)
+			ratio = 1;
+
+		zone->inactive_ratio = ratio;
+	}
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -4306,6 +4345,7 @@ static int __init init_per_zone_pages_mi
 		min_free_kbytes = 65536;
 	setup_per_zone_pages_min();
 	setup_per_zone_lowmem_reserve();
+	setup_per_zone_inactive_ratio();
 	return 0;
 }
 module_init(init_per_zone_pages_min)
Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-05-23 14:21:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-05-28 12:11:38.000000000 -0400
@@ -114,7 +114,7 @@ struct scan_control {
 /*
  * From 0 .. 100.  Higher means more swappy.
  */
-int vm_swappiness = 60;
+int vm_swappiness = 20;
 long vm_total_pages;	/* The total number of pages which the VM controls */
 
 static LIST_HEAD(shrinker_list);
@@ -1008,7 +1008,7 @@ static inline int zone_is_near_oom(struc
 static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			struct scan_control *sc, int priority, int file)
 {
-	unsigned long pgmoved;
+	unsigned long pgmoved = 0;
 	int pgdeactivate = 0;
 	unsigned long pgscanned;
 	LIST_HEAD(l_hold);	/* The pages which were snipped off */
@@ -1036,17 +1036,32 @@ static void shrink_active_list(unsigned 
 		__mod_zone_page_state(zone, NR_ACTIVE_ANON, -pgmoved);
 	spin_unlock_irq(&zone->lru_lock);
 
+	pgmoved = 0;
 	while (!list_empty(&l_hold)) {
 		cond_resched();
 		page = lru_to_page(&l_hold);
 		list_del(&page->lru);
-		if (page_referenced(page, 0, sc->mem_cgroup))
-			list_add(&page->lru, &l_active);
-		else
+		if (page_referenced(page, 0, sc->mem_cgroup)) {
+			if (file) {
+				/* Referenced file pages stay active. */
+				list_add(&page->lru, &l_active);
+			} else {
+				/* Anonymous pages always get deactivated. */
+				list_add(&page->lru, &l_inactive);
+				pgmoved++;
+			}
+		} else
 			list_add(&page->lru, &l_inactive);
 	}
 
 	/*
+	 * Count the referenced anon pages as rotated, to balance pageout
+	 * scan pressure between file and anonymous pages in get_sacn_ratio.
+	 */
+	if (!file)
+		zone->recent_rotated_anon += pgmoved;
+
+	/*
 	 * Now put the pages back on the appropriate [file or anon] inactive
 	 * and active lists.
 	 */
@@ -1129,7 +1144,13 @@ static unsigned long shrink_list(enum lr
 {
 	int file = is_file_lru(lru);
 
-	if (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE) {
+	if (lru == LRU_ACTIVE_FILE) {
+		shrink_active_list(nr_to_scan, zone, sc, priority, file);
+		return 0;
+	}
+
+	if (lru == LRU_ACTIVE_ANON &&
+	    (!scan_global_lru(sc) || inactive_anon_low(zone))) {
 		shrink_active_list(nr_to_scan, zone, sc, priority, file);
 		return 0;
 	}
@@ -1239,8 +1260,8 @@ static unsigned long shrink_zone(int pri
 		}
 	}
 
-	while (nr[LRU_ACTIVE_ANON] || nr[LRU_INACTIVE_ANON] ||
-				nr[LRU_ACTIVE_FILE] || nr[LRU_INACTIVE_FILE]) {
+	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
+						 nr[LRU_INACTIVE_FILE]) {
 		for_each_lru(l) {
 			if (nr[l]) {
 				nr_to_scan = min(nr[l],
@@ -1542,6 +1563,14 @@ loop_again:
 			    priority != DEF_PRIORITY)
 				continue;
 
+			/*
+			 * Do some background aging of the anon list, to give
+			 * pages a chance to be referenced before reclaiming.
+			 */
+			if (inactive_anon_low(zone))
+				shrink_active_list(SWAP_CLUSTER_MAX, zone,
+							&sc, priority, 0);
+
 			if (!zone_watermark_ok(zone, order, zone->pages_high,
 					       0, 0)) {
 				end_zone = i;
Index: linux-2.6.26-rc2-mm1/mm/vmstat.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmstat.c	2008-05-23 14:21:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmstat.c	2008-05-28 12:09:06.000000000 -0400
@@ -814,10 +814,12 @@ static void zoneinfo_show_print(struct s
 	seq_printf(m,
 		   "\n  all_unreclaimable: %u"
 		   "\n  prev_priority:     %i"
-		   "\n  start_pfn:         %lu",
+		   "\n  start_pfn:         %lu"
+		   "\n  inactive_ratio:    %u",
 			   zone_is_all_unreclaimable(zone),
 		   zone->prev_priority,
-		   zone->zone_start_pfn);
+		   zone->zone_start_pfn,
+		   zone->inactive_ratio);
 	seq_putc(m, '\n');
 }
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 08/25] add some sanity checks to get_scan_ratio
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (6 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 07/25] second chance replacement for anonymous pages Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-07  1:04   ` Andrew Morton
  2008-06-06 20:28 ` [PATCH -mm 09/25] fix pagecache reclaim referenced bit check Rik van Riel, Rik van Riel
                   ` (17 subsequent siblings)
  25 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro

[-- Attachment #1: rvr-04-linux-2.6-scan-ratio-fixes.patch --]
[-- Type: text/plain, Size: 8679 bytes --]

From: Rik van Riel <riel@redhat.com>

The access ratio based scan rate determination in get_scan_ratio
works ok in most situations, but needs to be corrected in some
corner cases:
- if we run out of swap space, do not bother scanning the anon LRUs
- if we have already freed all of the page cache, we need to scan
  the anon LRUs
- restore the *actual* access ratio based scan rate algorithm, the
  previous versions of this patch series had the wrong version
- scale the number of pages added to zone->nr_scan[l]

Signed-off-by: Rik van Riel <riel@redhat.com>

---
 include/linux/mmzone.h |    2 
 mm/page_alloc.c        |    3 -
 mm/swap.c              |   13 +++++-
 mm/vmscan.c            |  104 ++++++++++++++++++++++++++++++++-----------------
 4 files changed, 85 insertions(+), 37 deletions(-)

Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-05-28 12:11:38.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-05-28 12:11:51.000000000 -0400
@@ -893,8 +893,13 @@ static unsigned long shrink_inactive_lis
 		__mod_zone_page_state(zone, NR_INACTIVE_ANON,
 						-count[LRU_INACTIVE_ANON]);
 
-		if (scan_global_lru(sc))
+		if (scan_global_lru(sc)) {
 			zone->pages_scanned += nr_scan;
+			zone->recent_scanned_anon += count[LRU_ACTIVE_ANON] +
+						     count[LRU_INACTIVE_ANON];
+			zone->recent_scanned_file += count[LRU_ACTIVE_FILE] +
+						     count[LRU_INACTIVE_FILE];
+		}
 		spin_unlock_irq(&zone->lru_lock);
 
 		nr_scanned += nr_scan;
@@ -944,11 +949,13 @@ static unsigned long shrink_inactive_lis
 			VM_BUG_ON(PageLRU(page));
 			SetPageLRU(page);
 			list_del(&page->lru);
-			if (page_file_cache(page)) {
+			if (page_file_cache(page))
 				lru += LRU_FILE;
-				zone->recent_rotated_file++;
-			} else {
-				zone->recent_rotated_anon++;
+			if (scan_global_lru(sc)) {
+				if (page_file_cache(page))
+					zone->recent_rotated_file++;
+				else
+					zone->recent_rotated_anon++;
 			}
 			if (PageActive(page))
 				lru += LRU_ACTIVE;
@@ -1027,8 +1034,13 @@ static void shrink_active_list(unsigned 
 	 * zone->pages_scanned is used for detect zone's oom
 	 * mem_cgroup remembers nr_scan by itself.
 	 */
-	if (scan_global_lru(sc))
+	if (scan_global_lru(sc)) {
 		zone->pages_scanned += pgscanned;
+		if (file)
+			zone->recent_scanned_file += pgscanned;
+		else
+			zone->recent_scanned_anon += pgscanned;
+	}
 
 	if (file)
 		__mod_zone_page_state(zone, NR_ACTIVE_FILE, -pgmoved);
@@ -1170,9 +1182,8 @@ static unsigned long shrink_list(enum lr
 static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
 					unsigned long *percent)
 {
-	unsigned long anon, file;
+	unsigned long anon, file, free;
 	unsigned long anon_prio, file_prio;
-	unsigned long rotate_sum;
 	unsigned long ap, fp;
 
 	anon  = zone_page_state(zone, NR_ACTIVE_ANON) +
@@ -1180,15 +1191,19 @@ static void get_scan_ratio(struct zone *
 	file  = zone_page_state(zone, NR_ACTIVE_FILE) +
 		zone_page_state(zone, NR_INACTIVE_FILE);
 
-	rotate_sum = zone->recent_rotated_file + zone->recent_rotated_anon;
-
 	/* Keep a floating average of RECENT references. */
-	if (unlikely(rotate_sum > min(anon, file))) {
+	if (unlikely(zone->recent_scanned_anon > anon / zone->inactive_ratio)) {
 		spin_lock_irq(&zone->lru_lock);
-		zone->recent_rotated_file /= 2;
+		zone->recent_scanned_anon /= 2;
 		zone->recent_rotated_anon /= 2;
 		spin_unlock_irq(&zone->lru_lock);
-		rotate_sum /= 2;
+	}
+
+	if (unlikely(zone->recent_scanned_file > file / 4)) {
+		spin_lock_irq(&zone->lru_lock);
+		zone->recent_scanned_file /= 2;
+		zone->recent_rotated_file /= 2;
+		spin_unlock_irq(&zone->lru_lock);
 	}
 
 	/*
@@ -1201,23 +1216,33 @@ static void get_scan_ratio(struct zone *
 	/*
 	 *                  anon       recent_rotated_anon
 	 * %anon = 100 * ----------- / ------------------- * IO cost
-	 *               anon + file       rotate_sum
+	 *               anon + file   recent_scanned_anon
 	 */
-	ap = (anon_prio * anon) / (anon + file + 1);
-	ap *= rotate_sum / (zone->recent_rotated_anon + 1);
-	if (ap == 0)
-		ap = 1;
-	else if (ap > 100)
-		ap = 100;
-	percent[0] = ap;
-
-	fp = (file_prio * file) / (anon + file + 1);
-	fp *= rotate_sum / (zone->recent_rotated_file + 1);
-	if (fp == 0)
-		fp = 1;
-	else if (fp > 100)
-		fp = 100;
-	percent[1] = fp;
+	ap = (anon_prio + 1) * (zone->recent_scanned_anon + 1);
+	ap /= zone->recent_rotated_anon + 1;
+
+	fp = (file_prio + 1) * (zone->recent_scanned_file + 1);
+	fp /= zone->recent_rotated_file + 1;
+
+	/* Normalize to percentages */
+	percent[0] = 100 * ap / (ap + fp + 1);
+	percent[1] = 100 - percent[0];
+
+	free = zone_page_state(zone, NR_FREE_PAGES);
+
+	/*
+	 * If we have no swap space, do not bother scanning anon pages.
+	 */
+	if (nr_swap_pages <= 0) {
+		percent[0] = 0;
+		percent[1] = 100;
+	}
+	/*
+	 * If we already freed most file pages, scan the anon pages
+	 * regardless of the page access ratios or swappiness setting.
+	 */
+	else if (file + free <= zone->pages_high)
+		percent[0] = 100;
 }
 
 
@@ -1238,13 +1263,17 @@ static unsigned long shrink_zone(int pri
 	for_each_lru(l) {
 		if (scan_global_lru(sc)) {
 			int file = is_file_lru(l);
+			int scan;
 			/*
 			 * Add one to nr_to_scan just to make sure that the
-			 * kernel will slowly sift through the active list.
+			 * kernel will slowly sift through each list.
 			 */
-			zone->nr_scan[l] += (zone_page_state(zone,
-				NR_INACTIVE_ANON + l) >> priority) + 1;
-			nr[l] = zone->nr_scan[l] * percent[file] / 100;
+			scan = zone_page_state(zone, NR_INACTIVE_ANON + l);
+			scan >>= priority;
+			scan = (scan * percent[file]) / 100;
+
+			zone->nr_scan[l] += scan + 1;
+			nr[l] = zone->nr_scan[l];
 			if (nr[l] >= sc->swap_cluster_max)
 				zone->nr_scan[l] = 0;
 			else
@@ -1261,7 +1290,7 @@ static unsigned long shrink_zone(int pri
 	}
 
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
-						 nr[LRU_INACTIVE_FILE]) {
+					nr[LRU_INACTIVE_FILE]) {
 		for_each_lru(l) {
 			if (nr[l]) {
 				nr_to_scan = min(nr[l],
@@ -1274,6 +1303,13 @@ static unsigned long shrink_zone(int pri
 		}
 	}
 
+	/*
+	 * Even if we did not try to evict anon pages at all, we want to
+	 * rebalance the anon lru active/inactive ratio.
+	 */
+	if (scan_global_lru(sc) && inactive_anon_low(zone))
+		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
+
 	throttle_vm_writeout(sc->gfp_mask);
 	return nr_reclaimed;
 }
Index: linux-2.6.26-rc2-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mmzone.h	2008-05-28 12:09:06.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mmzone.h	2008-05-28 12:11:51.000000000 -0400
@@ -289,6 +289,8 @@ struct zone {
 
 	unsigned long		recent_rotated_anon;
 	unsigned long		recent_rotated_file;
+	unsigned long		recent_scanned_anon;
+	unsigned long		recent_scanned_file;
 
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c	2008-05-28 12:09:06.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/page_alloc.c	2008-05-28 12:11:51.000000000 -0400
@@ -3512,7 +3512,8 @@ static void __paginginit free_area_init_
 		}
 		zone->recent_rotated_anon = 0;
 		zone->recent_rotated_file = 0;
-//TODO recent_scanned_* ???
+		zone->recent_scanned_anon = 0;
+		zone->recent_scanned_file = 0;
 		zap_zone_vm_stats(zone);
 		zone->flags = 0;
 		if (!size)
Index: linux-2.6.26-rc2-mm1/mm/swap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/swap.c	2008-05-28 12:09:06.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/swap.c	2008-05-28 12:11:51.000000000 -0400
@@ -176,8 +176,8 @@ void activate_page(struct page *page)
 
 	spin_lock_irq(&zone->lru_lock);
 	if (PageLRU(page) && !PageActive(page)) {
-		int lru = LRU_BASE;
-		lru += page_file_cache(page);
+		int file = page_file_cache(page);
+		int lru = LRU_BASE + file;
 		del_page_from_lru_list(zone, page, lru);
 
 		SetPageActive(page);
@@ -185,6 +185,15 @@ void activate_page(struct page *page)
 		add_page_to_lru_list(zone, page, lru);
 		__count_vm_event(PGACTIVATE);
 		mem_cgroup_move_lists(page, true);
+
+		if (file) {
+			zone->recent_scanned_file++;
+			zone->recent_rotated_file++;
+		} else {
+			/* Can this happen?  Maybe through tmpfs... */
+			zone->recent_scanned_anon++;
+			zone->recent_rotated_anon++;
+		}
 	}
 	spin_unlock_irq(&zone->lru_lock);
 }

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 09/25] fix pagecache reclaim referenced bit check
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (7 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 08/25] add some sanity checks to get_scan_ratio Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-07  1:04   ` Andrew Morton
  2008-06-06 20:28 ` [PATCH -mm 10/25] add newly swapped in pages to the inactive list Rik van Riel, Rik van Riel
                   ` (16 subsequent siblings)
  25 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, Martin Bligh

[-- Attachment #1: rvr-05-linux-2.6-fix-pagecache-reclaim.patch --]
[-- Type: text/plain, Size: 1346 bytes --]

From: Rik van Riel <riel@redhat.com>

The -mm tree contains the patch
vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru.patch
which gives referenced pagecache pages another trip around
the active list.  This seems to help keep frequently accessed
pagecache pages in memory.

However, it means that pagecache pages that get moved to the
inactive list do not have their referenced bit set, and a
reference to the page will not get it moved back to the active
list.  This patch sets the referenced bit on pagecache pages
that get deactivated, so the next access to the page will promote
it back to the active list.

---
 mm/vmscan.c |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-05-28 12:11:51.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-05-28 12:14:34.000000000 -0400
@@ -1062,8 +1062,13 @@ static void shrink_active_list(unsigned 
 				list_add(&page->lru, &l_inactive);
 				pgmoved++;
 			}
-		} else
+		} else {
 			list_add(&page->lru, &l_inactive);
+			if (file && !page_mapped(page))
+				/* Bypass use-once, make the next access count.
+				 * See mark_page_accessed. */
+				SetPageReferenced(page);
+		}
 	}
 
 	/*

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 10/25] add newly swapped in pages to the inactive list
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (8 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 09/25] fix pagecache reclaim referenced bit check Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-07  1:04   ` Andrew Morton
  2008-06-06 20:28 ` [PATCH -mm 11/25] more aggressively use lumpy reclaim Rik van Riel, Rik van Riel
                   ` (15 subsequent siblings)
  25 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro

[-- Attachment #1: rvr-swapin-inactive.patch --]
[-- Type: text/plain, Size: 1113 bytes --]

From: Rik van Riel <riel@redhat.com>

Swapin_readahead can read in a lot of data that the processes in
memory never need.  Adding swap cache pages to the inactive list
prevents them from putting too much pressure on the working set.

This has the potential to help the programs that are already in
memory, but it could also be a disadvantage to processes that
are trying to get swapped in.

In short, this patch needs testing.

Signed-off-by: Rik van Riel <riel@redhat.com>

---
 mm/swap_state.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.26-rc2-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/swap_state.c	2008-05-28 09:40:59.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/swap_state.c	2008-05-28 09:42:26.000000000 -0400
@@ -302,7 +302,7 @@ struct page *read_swap_cache_async(swp_e
 			/*
 			 * Initiate read into locked page and return.
 			 */
-			lru_cache_add_active_anon(new_page);
+			lru_cache_add_anon(new_page);
 			swap_readpage(NULL, new_page);
 			return new_page;
 		}

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 11/25] more aggressively use lumpy reclaim
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (9 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 10/25] add newly swapped in pages to the inactive list Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-07  1:05   ` Andrew Morton
  2008-06-06 20:28 ` [PATCH -mm 12/25] pageflag helpers for configed-out flags Rik van Riel, Rik van Riel
                   ` (14 subsequent siblings)
  25 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro

[-- Attachment #1: lumpy-reclaim-lower-order.patch --]
[-- Type: text/plain, Size: 2358 bytes --]

From: Rik van Riel <riel@redhat.com>

During an AIM7 run on a 16GB system, fork started failing around
32000 threads, despite the system having plenty of free swap and
15GB of pageable memory.

If normal pageout does not result in contiguous free pages for
kernel stacks, fall back to lumpy reclaim instead of failing fork
or doing excessive pageout IO.

I do not know whether this change is needed due to the extreme
stress test or because the inactive list is a smaller fraction
of system memory on huge systems.

Signed-off-by: Rik van Riel <riel@redhat.com>

---
 mm/vmscan.c |   20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-05-28 12:14:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-05-28 12:14:43.000000000 -0400
@@ -857,7 +857,8 @@ int isolate_lru_page(struct page *page)
  * of reclaimed pages
  */
 static unsigned long shrink_inactive_list(unsigned long max_scan,
-			struct zone *zone, struct scan_control *sc, int file)
+			struct zone *zone, struct scan_control *sc,
+			int priority, int file)
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
@@ -875,8 +876,19 @@ static unsigned long shrink_inactive_lis
 		unsigned long nr_freed;
 		unsigned long nr_active;
 		unsigned int count[NR_LRU_LISTS] = { 0, };
-		int mode = (sc->order > PAGE_ALLOC_COSTLY_ORDER) ?
-					ISOLATE_BOTH : ISOLATE_INACTIVE;
+		int mode = ISOLATE_INACTIVE;
+
+		/*
+		 * If we need a large contiguous chunk of memory, or have
+		 * trouble getting a small set of contiguous pages, we
+		 * will reclaim both active and inactive pages.
+		 *
+		 * We use the same threshold as pageout congestion_wait below.
+		 */
+		if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+			mode = ISOLATE_BOTH;
+		else if (sc->order && priority < DEF_PRIORITY - 2)
+			mode = ISOLATE_BOTH;
 
 		nr_taken = sc->isolate_pages(sc->swap_cluster_max,
 			     &page_list, &nr_scan, sc->order, mode,
@@ -1171,7 +1183,7 @@ static unsigned long shrink_list(enum lr
 		shrink_active_list(nr_to_scan, zone, sc, priority, file);
 		return 0;
 	}
-	return shrink_inactive_list(nr_to_scan, zone, sc, file);
+	return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
 }
 
 /*

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 12/25] pageflag helpers for configed-out flags
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (10 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 11/25] more aggressively use lumpy reclaim Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-06 20:28 ` [PATCH -mm 13/25] Noreclaim LRU Infrastructure Rik van Riel, Rik van Riel
                   ` (13 subsequent siblings)
  25 siblings, 0 replies; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro,
	Lee Schermerhorn

[-- Attachment #1: pageflag-helpers.patch --]
[-- Type: text/plain, Size: 1361 bytes --]

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

Define proper false/noop inline functions for noreclaim page
flags when !defined(CONFIG_NORECLAIM_LRU)

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>

---
 include/linux/page-flags.h |   12 ++++++++++++
 1 file changed, 12 insertions(+)

Index: linux-2.6.26-rc2-mm1/include/linux/page-flags.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/page-flags.h	2008-05-28 09:40:59.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/page-flags.h	2008-05-28 09:42:35.000000000 -0400
@@ -147,6 +147,18 @@ static inline int Page##uname(struct pag
 #define TESTSCFLAG(uname, lname)					\
 	TESTSETFLAG(uname, lname) TESTCLEARFLAG(uname, lname)
 
+#define SETPAGEFLAG_NOOP(uname)						\
+static inline void SetPage##uname(struct page *page) {  }
+
+#define CLEARPAGEFLAG_NOOP(uname)					\
+static inline void ClearPage##uname(struct page *page) {  }
+
+#define __CLEARPAGEFLAG_NOOP(uname)					\
+static inline void __ClearPage##uname(struct page *page) {  }
+
+#define TESTCLEARFLAG_FALSE(uname)					\
+static inline int TestClearPage##uname(struct page *page) { return 0; }
+
 struct page;	/* forward declaration */
 
 PAGEFLAG(Locked, locked) TESTSCFLAG(Locked, locked)

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (11 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 12/25] pageflag helpers for configed-out flags Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-07  1:05   ` Andrew Morton
  2008-06-06 20:28 ` [PATCH -mm 14/25] Noreclaim LRU Page Statistics Rik van Riel, Rik van Riel
                   ` (12 subsequent siblings)
  25 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, linux-mm,
	Eric Whitney

[-- Attachment #1: rvr-13-lts-noreclaim-ramfs-pages-are-non-reclaimable.patch --]
[-- Type: text/plain, Size: 33804 bytes --]


From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Infrastructure to manage pages excluded from reclaim--i.e., hidden
from vmscan.  Based on a patch by Larry Woodman of Red Hat. Reworked
to maintain "nonreclaimable" pages on a separate per-zone LRU list,
to "hide" them from vmscan.

Kosaki Motohiro added the support for the memory controller noreclaim
lru list.

Pages on the noreclaim list have both PG_noreclaim and PG_lru set.
Thus, PG_noreclaim is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.  

The noreclaim infrastructure is enabled by a new mm Kconfig option
[CONFIG_]NORECLAIM_LRU.

A new function 'page_reclaimable(page, vma)' in vmscan.c tests whether
or not a page is reclaimable.  Subsequent patches will add the various
!reclaimable tests.  We'll want to keep these tests light-weight for
use in shrink_active_list() and, possibly, the fault path.

To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from nonreclaimable to reclaimable
state, one should test reclaimability under page lock and place
nonreclaimable pages directly on the noreclaim list before dropping the
lock.  Otherwise, we risk "stranding" reclaimable pages on the noreclaim
list.  It's OK to use the pagevec caches for reclaimable pages.  The new
function 'putback_lru_page()'--inverse to 'isolate_lru_page()'--handles
this transition, including potential page truncation while the page is
unlocked.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---

 include/linux/memcontrol.h |    2 
 include/linux/mm_inline.h  |   13 ++-
 include/linux/mmzone.h     |   24 ++++++
 include/linux/page-flags.h |   13 +++
 include/linux/pagevec.h    |    1 
 include/linux/swap.h       |   12 +++
 mm/Kconfig                 |   10 ++
 mm/internal.h              |   26 +++++++
 mm/memcontrol.c            |   73 ++++++++++++--------
 mm/mempolicy.c             |    2 
 mm/migrate.c               |   68 ++++++++++++------
 mm/page_alloc.c            |    9 ++
 mm/swap.c                  |   52 +++++++++++---
 mm/vmscan.c                |  164 +++++++++++++++++++++++++++++++++++++++------
 14 files changed, 382 insertions(+), 87 deletions(-)

Index: linux-2.6.26-rc2-mm1/mm/Kconfig
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/Kconfig	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/Kconfig	2008-06-06 16:05:15.000000000 -0400
@@ -205,3 +205,13 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config NORECLAIM_LRU
+	bool "Add LRU list to track non-reclaimable pages (EXPERIMENTAL, 64BIT only)"
+	depends on EXPERIMENTAL && 64BIT
+	help
+	  Supports tracking of non-reclaimable pages off the [in]active lists
+	  to avoid excessive reclaim overhead on large memory systems.  Pages
+	  may be non-reclaimable because:  they are locked into memory, they
+	  are anonymous pages for which no swap space exists, or they are anon
+	  pages that are expensive to unmap [long anon_vma "related vma" list.]
Index: linux-2.6.26-rc2-mm1/include/linux/page-flags.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/page-flags.h	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/page-flags.h	2008-06-06 16:05:15.000000000 -0400
@@ -94,6 +94,9 @@ enum pageflags {
 	PG_reclaim,		/* To be reclaimed asap */
 	PG_buddy,		/* Page is free, on buddy lists */
 	PG_swapbacked,		/* Page is backed by RAM/swap */
+#ifdef CONFIG_NORECLAIM_LRU
+	PG_noreclaim,		/* Page is "non-reclaimable"  */
+#endif
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 	PG_uncached,		/* Page has been mapped as uncached */
 #endif
@@ -167,6 +170,7 @@ PAGEFLAG(Referenced, referenced) TESTCLE
 PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
 PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
 PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active)
+	TESTCLEARFLAG(Active, active)
 __PAGEFLAG(Slab, slab)
 PAGEFLAG(Checked, owner_priv_1)		/* Used by some filesystems */
 PAGEFLAG(Pinned, owner_priv_1) TESTSCFLAG(Pinned, owner_priv_1) /* Xen */
@@ -203,6 +207,15 @@ PAGEFLAG(SwapCache, swapcache)
 PAGEFLAG_FALSE(SwapCache)
 #endif
 
+#ifdef CONFIG_NORECLAIM_LRU
+PAGEFLAG(Noreclaim, noreclaim) __CLEARPAGEFLAG(Noreclaim, noreclaim)
+	TESTCLEARFLAG(Noreclaim, noreclaim)
+#else
+PAGEFLAG_FALSE(Noreclaim) TESTCLEARFLAG_FALSE(Noreclaim)
+	SETPAGEFLAG_NOOP(Noreclaim) CLEARPAGEFLAG_NOOP(Noreclaim)
+	__CLEARPAGEFLAG_NOOP(Noreclaim)
+#endif
+
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 PAGEFLAG(Uncached, uncached)
 #else
Index: linux-2.6.26-rc2-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mmzone.h	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mmzone.h	2008-06-06 16:05:15.000000000 -0400
@@ -85,6 +85,11 @@ enum zone_stat_item {
 	NR_ACTIVE_ANON,		/*  "     "     "   "       "           */
 	NR_INACTIVE_FILE,	/*  "     "     "   "       "           */
 	NR_ACTIVE_FILE,		/*  "     "     "   "       "           */
+#ifdef CONFIG_NORECLAIM_LRU
+	NR_NORECLAIM,	/*  "     "     "   "       "         */
+#else
+	NR_NORECLAIM = NR_ACTIVE_FILE, /* avoid compiler errors in dead code */
+#endif
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
@@ -124,10 +129,18 @@ enum lru_list {
 	LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
 	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
 	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
-	NR_LRU_LISTS };
+#ifdef CONFIG_NORECLAIM_LRU
+	LRU_NORECLAIM,
+#else
+	LRU_NORECLAIM = LRU_ACTIVE_FILE, /* avoid compiler errors in dead code */
+#endif
+	NR_LRU_LISTS
+};
 
 #define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
 
+#define for_each_reclaimable_lru(l) for (l = 0; l <= LRU_ACTIVE_FILE; l++)
+
 static inline int is_file_lru(enum lru_list l)
 {
 	return (l == LRU_INACTIVE_FILE || l == LRU_ACTIVE_FILE);
@@ -138,6 +151,15 @@ static inline int is_active_lru(enum lru
 	return (l == LRU_ACTIVE_ANON || l == LRU_ACTIVE_FILE);
 }
 
+static inline int is_noreclaim_lru(enum lru_list l)
+{
+#ifdef CONFIG_NORECLAIM_LRU
+	return l == LRU_NORECLAIM;
+#else
+	return 0;
+#endif
+}
+
 enum lru_list page_lru(struct page *page);
 
 struct per_cpu_pages {
Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/page_alloc.c	2008-06-06 16:05:15.000000000 -0400
@@ -256,6 +256,9 @@ static void bad_page(struct page *page)
 			1 << PG_private |
 			1 << PG_locked	|
 			1 << PG_active	|
+#ifdef CONFIG_NORECLAIM_LRU
+			1 << PG_noreclaim	|
+#endif
 			1 << PG_dirty	|
 			1 << PG_reclaim |
 			1 << PG_slab    |
@@ -491,6 +494,9 @@ static inline int free_pages_check(struc
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_reserved |
+#ifdef CONFIG_NORECLAIM_LRU
+			1 << PG_noreclaim |
+#endif
 			1 << PG_buddy ))))
 		bad_page(page);
 	if (PageDirty(page))
@@ -642,6 +648,9 @@ static int prep_new_page(struct page *pa
 			1 << PG_private	|
 			1 << PG_locked	|
 			1 << PG_active	|
+#ifdef CONFIG_NORECLAIM_LRU
+			1 << PG_noreclaim	|
+#endif
 			1 << PG_dirty	|
 			1 << PG_slab    |
 			1 << PG_swapcache |
Index: linux-2.6.26-rc2-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mm_inline.h	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mm_inline.h	2008-06-06 16:05:15.000000000 -0400
@@ -89,11 +89,16 @@ del_page_from_lru(struct zone *zone, str
 	enum lru_list l = LRU_INACTIVE_ANON;
 
 	list_del(&page->lru);
-	if (PageActive(page)) {
-		__ClearPageActive(page);
-		l += LRU_ACTIVE;
+	if (PageNoreclaim(page)) {
+		__ClearPageNoreclaim(page);
+		l = LRU_NORECLAIM;
+	} else {
+		 if (PageActive(page)) {
+			__ClearPageActive(page);
+			l += LRU_ACTIVE;
+		}
+		l += page_file_cache(page);
 	}
-	l += page_file_cache(page);
 	__dec_zone_state(zone, NR_INACTIVE_ANON + l);
 }
 
Index: linux-2.6.26-rc2-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/swap.h	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/swap.h	2008-06-06 16:05:15.000000000 -0400
@@ -180,6 +180,8 @@ extern int lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void swap_setup(void);
 
+extern void add_page_to_noreclaim_list(struct page *page);
+
 /**
  * lru_cache_add: add a page to the page lists
  * @page: the page to add
@@ -228,6 +230,16 @@ static inline int zone_reclaim(struct zo
 }
 #endif
 
+#ifdef CONFIG_NORECLAIM_LRU
+extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
+#else
+static inline int page_reclaimable(struct page *page,
+						struct vm_area_struct *vma)
+{
+	return 1;
+}
+#endif
+
 extern int kswapd_run(int nid);
 
 #ifdef CONFIG_MMU
Index: linux-2.6.26-rc2-mm1/include/linux/pagevec.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/pagevec.h	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/pagevec.h	2008-06-06 16:05:15.000000000 -0400
@@ -101,7 +101,6 @@ static inline void __pagevec_lru_add_act
 	____pagevec_lru_add(pvec, LRU_ACTIVE_FILE);
 }
 
-
 static inline void pagevec_lru_add_file(struct pagevec *pvec)
 {
 	if (pagevec_count(pvec))
Index: linux-2.6.26-rc2-mm1/mm/swap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/swap.c	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/swap.c	2008-06-06 16:05:15.000000000 -0400
@@ -106,9 +106,13 @@ enum lru_list page_lru(struct page *page
 {
 	enum lru_list lru = LRU_BASE;
 
-	if (PageActive(page))
-		lru += LRU_ACTIVE;
-	lru += page_file_cache(page);
+	if (PageNoreclaim(page))
+		lru = LRU_NORECLAIM;
+	else {
+		if (PageActive(page))
+			lru += LRU_ACTIVE;
+		lru += page_file_cache(page);
+	}
 
 	return lru;
 }
@@ -133,7 +137,8 @@ static void pagevec_move_tail(struct pag
 			zone = pagezone;
 			spin_lock(&zone->lru_lock);
 		}
-		if (PageLRU(page) && !PageActive(page)) {
+		if (PageLRU(page) && !PageActive(page) &&
+					!PageNoreclaim(page)) {
 			int lru = page_file_cache(page);
 			list_move_tail(&page->lru, &zone->list[lru]);
 			pgmoved++;
@@ -154,7 +159,7 @@ static void pagevec_move_tail(struct pag
 void  rotate_reclaimable_page(struct page *page)
 {
 	if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
-	    PageLRU(page)) {
+	    !PageNoreclaim(page) && PageLRU(page)) {
 		struct pagevec *pvec;
 		unsigned long flags;
 
@@ -175,7 +180,7 @@ void activate_page(struct page *page)
 	struct zone *zone = page_zone(page);
 
 	spin_lock_irq(&zone->lru_lock);
-	if (PageLRU(page) && !PageActive(page)) {
+	if (PageLRU(page) && !PageActive(page) && !PageNoreclaim(page)) {
 		int file = page_file_cache(page);
 		int lru = LRU_BASE + file;
 		del_page_from_lru_list(zone, page, lru);
@@ -184,7 +189,7 @@ void activate_page(struct page *page)
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(zone, page, lru);
 		__count_vm_event(PGACTIVATE);
-		mem_cgroup_move_lists(page, true);
+		mem_cgroup_move_lists(page, lru);
 
 		if (file) {
 			zone->recent_scanned_file++;
@@ -207,7 +212,8 @@ void activate_page(struct page *page)
  */
 void mark_page_accessed(struct page *page)
 {
-	if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
+	if (!PageActive(page) && !PageNoreclaim(page) &&
+			PageReferenced(page) && PageLRU(page)) {
 		activate_page(page);
 		ClearPageReferenced(page);
 	} else if (!PageReferenced(page)) {
@@ -235,13 +241,38 @@ void __lru_cache_add(struct page *page, 
 void lru_cache_add_lru(struct page *page, enum lru_list lru)
 {
 	if (PageActive(page)) {
+		VM_BUG_ON(PageNoreclaim(page));
 		ClearPageActive(page);
+	} else if (PageNoreclaim(page)) {
+		VM_BUG_ON(PageActive(page));
+		ClearPageNoreclaim(page);
 	}
 
-	VM_BUG_ON(PageLRU(page) || PageActive(page));
+	VM_BUG_ON(PageLRU(page) || PageActive(page) || PageNoreclaim(page));
 	__lru_cache_add(page, lru);
 }
 
+/**
+ * add_page_to_noreclaim_list
+ * @page:  the page to be added to the noreclaim list
+ *
+ * Add page directly to its zone's noreclaim list.  To avoid races with
+ * tasks that might be making the page reclaimble while it's not on the
+ * lru, we want to add the page while it's locked or otherwise "invisible"
+ * to other tasks.  This is difficult to do when using the pagevec cache,
+ * so bypass that.
+ */
+void add_page_to_noreclaim_list(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock_irq(&zone->lru_lock);
+	SetPageNoreclaim(page);
+	SetPageLRU(page);
+	add_page_to_lru_list(zone, page, LRU_NORECLAIM);
+	spin_unlock_irq(&zone->lru_lock);
+}
+
 /*
  * Drain pages out of the cpu's pagevecs.
  * Either "cpu" is the current CPU, and preemption has already been
@@ -339,6 +370,7 @@ void release_pages(struct page **pages, 
 
 		if (PageLRU(page)) {
 			struct zone *pagezone = page_zone(page);
+
 			if (pagezone != zone) {
 				if (zone)
 					spin_unlock_irqrestore(&zone->lru_lock,
@@ -415,6 +447,7 @@ void ____pagevec_lru_add(struct pagevec 
 {
 	int i;
 	struct zone *zone = NULL;
+	VM_BUG_ON(is_noreclaim_lru(lru));
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
@@ -426,6 +459,7 @@ void ____pagevec_lru_add(struct pagevec 
 			zone = pagezone;
 			spin_lock_irq(&zone->lru_lock);
 		}
+		VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		if (is_active_lru(lru))
Index: linux-2.6.26-rc2-mm1/mm/migrate.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/migrate.c	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/migrate.c	2008-06-06 16:05:15.000000000 -0400
@@ -53,14 +53,9 @@ int migrate_prep(void)
 	return 0;
 }
 
-static inline void move_to_lru(struct page *page)
-{
-	lru_cache_add_lru(page, page_lru(page));
-	put_page(page);
-}
-
 /*
- * Add isolated pages on the list back to the LRU.
+ * Add isolated pages on the list back to the LRU under page lock
+ * to avoid leaking reclaimable pages back onto noreclaim list.
  *
  * returns the number of pages put back.
  */
@@ -72,7 +67,9 @@ int putback_lru_pages(struct list_head *
 
 	list_for_each_entry_safe(page, page2, l, lru) {
 		list_del(&page->lru);
-		move_to_lru(page);
+		lock_page(page);
+		if (putback_lru_page(page))
+			unlock_page(page);
 		count++;
 	}
 	return count;
@@ -340,8 +337,11 @@ static void migrate_page_copy(struct pag
 		SetPageReferenced(newpage);
 	if (PageUptodate(page))
 		SetPageUptodate(newpage);
-	if (PageActive(page))
+	if (TestClearPageActive(page)) {
+		VM_BUG_ON(PageNoreclaim(page));
 		SetPageActive(newpage);
+	} else
+		noreclaim_migrate_page(newpage, page);
 	if (PageChecked(page))
 		SetPageChecked(newpage);
 	if (PageMappedToDisk(page))
@@ -362,7 +362,6 @@ static void migrate_page_copy(struct pag
 #ifdef CONFIG_SWAP
 	ClearPageSwapCache(page);
 #endif
-	ClearPageActive(page);
 	ClearPagePrivate(page);
 	set_page_private(page, 0);
 	page->mapping = NULL;
@@ -541,10 +540,15 @@ static int fallback_migrate_page(struct 
  *
  * The new page will have replaced the old page if this function
  * is successful.
+ *
+ * Return value:
+ *   < 0 - error code
+ *  == 0 - success
  */
 static int move_to_new_page(struct page *newpage, struct page *page)
 {
 	struct address_space *mapping;
+	int unlock = 1;
 	int rc;
 
 	/*
@@ -579,10 +583,16 @@ static int move_to_new_page(struct page 
 
 	if (!rc) {
 		remove_migration_ptes(page, newpage);
+		/*
+		 * Put back on LRU while holding page locked to
+		 * handle potential race with, e.g., munlock()
+		 */
+		unlock = putback_lru_page(newpage);
 	} else
 		newpage->mapping = NULL;
 
-	unlock_page(newpage);
+	if (unlock)
+		unlock_page(newpage);
 
 	return rc;
 }
@@ -599,18 +609,19 @@ static int unmap_and_move(new_page_t get
 	struct page *newpage = get_new_page(page, private, &result);
 	int rcu_locked = 0;
 	int charge = 0;
+	int unlock = 1;
 
 	if (!newpage)
 		return -ENOMEM;
 
 	if (page_count(page) == 1)
 		/* page was freed from under us. So we are done. */
-		goto move_newpage;
+		goto end_migration;
 
 	charge = mem_cgroup_prepare_migration(page, newpage);
 	if (charge == -ENOMEM) {
 		rc = -ENOMEM;
-		goto move_newpage;
+		goto end_migration;
 	}
 	/* prepare cgroup just returns 0 or -ENOMEM */
 	BUG_ON(charge);
@@ -618,7 +629,7 @@ static int unmap_and_move(new_page_t get
 	rc = -EAGAIN;
 	if (TestSetPageLocked(page)) {
 		if (!force)
-			goto move_newpage;
+			goto end_migration;
 		lock_page(page);
 	}
 
@@ -680,8 +691,6 @@ rcu_unlock:
 
 unlock:
 
-	unlock_page(page);
-
 	if (rc != -EAGAIN) {
  		/*
  		 * A page that has been migrated has all references
@@ -690,17 +699,30 @@ unlock:
  		 * restored.
  		 */
  		list_del(&page->lru);
- 		move_to_lru(page);
+		if (!page->mapping) {
+			VM_BUG_ON(page_count(page) != 1);
+			unlock_page(page);
+			put_page(page);		/* just free the old page */
+			goto end_migration;
+		} else
+			unlock = putback_lru_page(page);
 	}
 
-move_newpage:
+	if (unlock)
+		unlock_page(page);
+
+end_migration:
 	if (!charge)
 		mem_cgroup_end_migration(newpage);
-	/*
-	 * Move the new page to the LRU. If migration was not successful
-	 * then this will free the page.
-	 */
-	move_to_lru(newpage);
+
+	if (!newpage->mapping) {
+		/*
+		 * Migration failed or was never attempted.
+		 * Free the newpage.
+		 */
+		VM_BUG_ON(page_count(newpage) != 1);
+		put_page(newpage);
+	}
 	if (result) {
 		if (rc)
 			*result = rc;
Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-06-06 16:05:50.000000000 -0400
@@ -437,6 +437,73 @@ cannot_free:
 	return 0;
 }
 
+/**
+ * putback_lru_page
+ * @page to be put back to appropriate lru list
+ *
+ * Add previously isolated @page to appropriate LRU list.
+ * Page may still be non-reclaimable for other reasons.
+ *
+ * lru_lock must not be held, interrupts must be enabled.
+ * Must be called with page locked.
+ *
+ * return 1 if page still locked [not truncated], else 0
+ */
+int putback_lru_page(struct page *page)
+{
+	int lru;
+	int ret = 1;
+
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(PageLRU(page));
+
+	lru = !!TestClearPageActive(page);
+	ClearPageNoreclaim(page);	/* for page_reclaimable() */
+
+	if (unlikely(!page->mapping)) {
+		/*
+		 * page truncated.  drop lock as put_page() will
+		 * free the page.
+		 */
+		VM_BUG_ON(page_count(page) != 1);
+		unlock_page(page);
+		ret = 0;
+	} else if (page_reclaimable(page, NULL)) {
+		/*
+		 * For reclaimable pages, we can use the cache.
+		 * In event of a race, worst case is we end up with a
+		 * non-reclaimable page on [in]active list.
+		 * We know how to handle that.
+		 */
+		lru += page_file_cache(page);
+		lru_cache_add_lru(page, lru);
+		mem_cgroup_move_lists(page, lru);
+	} else {
+		/*
+		 * Put non-reclaimable pages directly on zone's noreclaim
+		 * list.
+		 */
+		add_page_to_noreclaim_list(page);
+		mem_cgroup_move_lists(page, LRU_NORECLAIM);
+	}
+
+	put_page(page);		/* drop ref from isolate */
+	return ret;		/* ret => "page still locked" */
+}
+
+/*
+ * Cull page that shrink_*_list() has detected to be non-reclaimable
+ * under page lock to close races with other tasks that might be making
+ * the page reclaimable.  Avoid stranding a reclaimable page on the
+ * noreclaim list.
+ */
+static inline void cull_nonreclaimable_page(struct page *page)
+{
+	lock_page(page);
+	if (putback_lru_page(page))
+		unlock_page(page);
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -470,6 +537,12 @@ static unsigned long shrink_page_list(st
 
 		sc->nr_scanned++;
 
+		if (unlikely(!page_reclaimable(page, NULL))) {
+			if (putback_lru_page(page))
+				unlock_page(page);
+			continue;
+		}
+
 		if (!sc->may_swap && page_mapped(page))
 			goto keep_locked;
 
@@ -566,7 +639,7 @@ static unsigned long shrink_page_list(st
 		 * possible for a page to have PageDirty set, but it is actually
 		 * clean (all its buffers are clean).  This happens if the
 		 * buffers were written out directly, with submit_bh(). ext3
-		 * will do this, as well as the blockdev mapping. 
+		 * will do this, as well as the blockdev mapping.
 		 * try_to_release_page() will discover that cleanness and will
 		 * drop the buffers and mark the page clean - it can be freed.
 		 *
@@ -598,6 +671,7 @@ activate_locked:
 		/* Not a candidate for swapping, so reclaim swap space. */
 		if (PageSwapCache(page) && vm_swap_full())
 			remove_exclusive_swap_page_ref(page);
+		VM_BUG_ON(PageActive(page));
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
@@ -647,6 +721,14 @@ int __isolate_lru_page(struct page *page
 	if (mode != ISOLATE_BOTH && (!page_file_cache(page) != !file))
 		return ret;
 
+	/*
+	 * Non-reclaimable pages shouldn't make it onto either the active
+	 * nor the inactive list. However, when doing lumpy reclaim of
+	 * higher order pages we can still run into them.
+	 */
+	if (PageNoreclaim(page))
+		return ret;
+
 	ret = -EBUSY;
 	if (likely(get_page_unless_zero(page))) {
 		/*
@@ -758,7 +840,7 @@ static unsigned long isolate_lru_pages(u
 				/* else it is being freed elsewhere */
 				list_move(&cursor_page->lru, src);
 			default:
-				break;
+				break;	/* ! on LRU or wrong list */
 			}
 		}
 	}
@@ -818,8 +900,9 @@ static unsigned long clear_active_flags(
  * Returns -EBUSY if the page was not on an LRU list.
  *
  * The returned page will have PageLRU() cleared.  If it was found on
- * the active list, it will have PageActive set.  That flag may need
- * to be cleared by the caller before letting the page go.
+ * the active list, it will have PageActive set.  If it was found on
+ * the noreclaim list, it will have the PageNoreclaim bit set. That flag
+ * may need to be cleared by the caller before letting the page go.
  *
  * The vmstat statistic corresponding to the list on which the page was
  * found will be decremented.
@@ -844,7 +927,13 @@ int isolate_lru_page(struct page *page)
 			ret = 0;
 			ClearPageLRU(page);
 
+			/* Calculate the LRU list for normal pages ... */
 			lru += page_file_cache(page) + !!PageActive(page);
+
+			/* ... except NoReclaim, which has its own list. */
+			if (PageNoreclaim(page))
+				lru = LRU_NORECLAIM;
+
 			del_page_from_lru_list(zone, page, lru);
 		}
 		spin_unlock_irq(&zone->lru_lock);
@@ -959,19 +1048,27 @@ static unsigned long shrink_inactive_lis
 			int lru = LRU_BASE;
 			page = lru_to_page(&page_list);
 			VM_BUG_ON(PageLRU(page));
-			SetPageLRU(page);
 			list_del(&page->lru);
-			if (page_file_cache(page))
-				lru += LRU_FILE;
-			if (scan_global_lru(sc)) {
+			if (unlikely(!page_reclaimable(page, NULL))) {
+				spin_unlock_irq(&zone->lru_lock);
+				cull_nonreclaimable_page(page);
+				spin_lock_irq(&zone->lru_lock);
+				continue;
+			} else {
 				if (page_file_cache(page))
-					zone->recent_rotated_file++;
-				else
-					zone->recent_rotated_anon++;
+					lru += LRU_FILE;
+				if (scan_global_lru(sc)) {
+					if (page_file_cache(page))
+						zone->recent_rotated_file++;
+					else
+						zone->recent_rotated_anon++;
+				}
+				if (PageActive(page))
+					lru += LRU_ACTIVE;
 			}
-			if (PageActive(page))
-				lru += LRU_ACTIVE;
+			SetPageLRU(page);
 			add_page_to_lru_list(zone, page, lru);
+			mem_cgroup_move_lists(page, lru);
 			if (!pagevec_add(&pvec, page)) {
 				spin_unlock_irq(&zone->lru_lock);
 				__pagevec_release(&pvec);
@@ -1065,6 +1162,12 @@ static void shrink_active_list(unsigned 
 		cond_resched();
 		page = lru_to_page(&l_hold);
 		list_del(&page->lru);
+
+		if (unlikely(!page_reclaimable(page, NULL))) {
+			cull_nonreclaimable_page(page);
+			continue;
+		}
+
 		if (page_referenced(page, 0, sc->mem_cgroup)) {
 			if (file) {
 				/* Referenced file pages stay active. */
@@ -1107,7 +1210,7 @@ static void shrink_active_list(unsigned 
 		ClearPageActive(page);
 
 		list_move(&page->lru, &zone->list[lru]);
-		mem_cgroup_move_lists(page, false);
+		mem_cgroup_move_lists(page, lru);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru,
@@ -1139,7 +1242,7 @@ static void shrink_active_list(unsigned 
 		VM_BUG_ON(!PageActive(page));
 
 		list_move(&page->lru, &zone->list[lru]);
-		mem_cgroup_move_lists(page, true);
+		mem_cgroup_move_lists(page, lru);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru,
@@ -1277,7 +1380,7 @@ static unsigned long shrink_zone(int pri
 
 	get_scan_ratio(zone, sc, percent);
 
-	for_each_lru(l) {
+	for_each_reclaimable_lru(l) {
 		if (scan_global_lru(sc)) {
 			int file = is_file_lru(l);
 			int scan;
@@ -1308,7 +1411,7 @@ static unsigned long shrink_zone(int pri
 
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
-		for_each_lru(l) {
+		for_each_reclaimable_lru(l) {
 			if (nr[l]) {
 				nr_to_scan = min(nr[l],
 					(unsigned long)sc->swap_cluster_max);
@@ -1859,8 +1962,8 @@ static unsigned long shrink_all_zones(un
 		if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY)
 			continue;
 
-		for_each_lru(l) {
-			/* For pass = 0 we don't shrink the active list */
+		for_each_reclaimable_lru(l) {
+			/* For pass = 0, we don't shrink the active list */
 			if (pass == 0 &&
 				(l == LRU_ACTIVE_ANON || l == LRU_ACTIVE_FILE))
 				continue;
@@ -2197,3 +2300,26 @@ int zone_reclaim(struct zone *zone, gfp_
 	return ret;
 }
 #endif
+
+#ifdef CONFIG_NORECLAIM_LRU
+/*
+ * page_reclaimable - test whether a page is reclaimable
+ * @page: the page to test
+ * @vma: the VMA in which the page is or will be mapped, may be NULL
+ *
+ * Test whether page is reclaimable--i.e., should be placed on active/inactive
+ * lists vs noreclaim list.
+ *
+ * Reasons page might not be reclaimable:
+ * TODO - later patches
+ */
+int page_reclaimable(struct page *page, struct vm_area_struct *vma)
+{
+
+	VM_BUG_ON(PageNoreclaim(page));
+
+	/* TODO:  test page [!]reclaimable conditions */
+
+	return 1;
+}
+#endif
Index: linux-2.6.26-rc2-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/mempolicy.c	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/mempolicy.c	2008-06-06 16:05:15.000000000 -0400
@@ -2199,7 +2199,7 @@ static void gather_stats(struct page *pa
 	if (PageSwapCache(page))
 		md->swapcache++;
 
-	if (PageActive(page))
+	if (PageActive(page) || PageNoreclaim(page))
 		md->active++;
 
 	if (PageWriteback(page))
Index: linux-2.6.26-rc2-mm1/mm/internal.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/internal.h	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/internal.h	2008-06-06 16:05:15.000000000 -0400
@@ -34,8 +34,15 @@ static inline void __put_page(struct pag
 	atomic_dec(&page->_count);
 }
 
+/*
+ * in mm/vmscan.c:
+ */
 extern int isolate_lru_page(struct page *page);
+extern int putback_lru_page(struct page *page);
 
+/*
+ * in mm/page_alloc.c
+ */
 extern void __free_pages_bootmem(struct page *page, unsigned int order);
 
 /*
@@ -49,6 +56,25 @@ static inline unsigned long page_order(s
 	return page_private(page);
 }
 
+#ifdef CONFIG_NORECLAIM_LRU
+/*
+ * noreclaim_migrate_page() called only from migrate_page_copy() to
+ * migrate noreclaim flag to new page.
+ * Note that the old page has been isolated from the LRU lists at this
+ * point so we don't need to worry about LRU statistics.
+ */
+static inline void noreclaim_migrate_page(struct page *new, struct page *old)
+{
+	if (TestClearPageNoreclaim(old))
+		SetPageNoreclaim(new);
+}
+#else
+static inline void noreclaim_migrate_page(struct page *new, struct page *old)
+{
+}
+#endif
+
+
 /*
  * FLATMEM and DISCONTIGMEM configurations use alloc_bootmem_node,
  * so all functions starting at paging_init should be marked __init
Index: linux-2.6.26-rc2-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/memcontrol.c	2008-05-23 14:21:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/memcontrol.c	2008-06-06 16:05:15.000000000 -0400
@@ -161,9 +161,10 @@ struct page_cgroup {
 	int ref_cnt;			/* cached, mapped, migrating */
 	int flags;
 };
-#define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
-#define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* page is active in this cgroup */
-#define PAGE_CGROUP_FLAG_FILE	(0x4)	/* page is file system backed */
+#define PAGE_CGROUP_FLAG_CACHE	   (0x1)	/* charged as cache */
+#define PAGE_CGROUP_FLAG_ACTIVE    (0x2)	/* page is active in this cgroup */
+#define PAGE_CGROUP_FLAG_FILE	   (0x4)	/* page is file system backed */
+#define PAGE_CGROUP_FLAG_NORECLAIM (0x8)	/* page is noreclaimable page */
 
 static int page_cgroup_nid(struct page_cgroup *pc)
 {
@@ -283,10 +284,14 @@ static void __mem_cgroup_remove_list(str
 {
 	int lru = LRU_BASE;
 
-	if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
-		lru += LRU_ACTIVE;
-	if (pc->flags & PAGE_CGROUP_FLAG_FILE)
-		lru += LRU_FILE;
+	if (pc->flags & PAGE_CGROUP_FLAG_NORECLAIM)
+		lru = LRU_NORECLAIM;
+	else {
+		if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+			lru += LRU_ACTIVE;
+		if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+			lru += LRU_FILE;
+	}
 
 	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
 
@@ -299,10 +304,14 @@ static void __mem_cgroup_add_list(struct
 {
 	int lru = LRU_BASE;
 
-	if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
-		lru += LRU_ACTIVE;
-	if (pc->flags & PAGE_CGROUP_FLAG_FILE)
-		lru += LRU_FILE;
+	if (pc->flags & PAGE_CGROUP_FLAG_NORECLAIM)
+		lru = LRU_NORECLAIM;
+	else {
+		if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+			lru += LRU_ACTIVE;
+		if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+			lru += LRU_FILE;
+	}
 
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
 	list_add(&pc->lru, &mz->lists[lru]);
@@ -310,21 +319,31 @@ static void __mem_cgroup_add_list(struct
 	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
 }
 
-static void __mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
+static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
 {
 	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
-	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
-	int file = pc->flags & PAGE_CGROUP_FLAG_FILE;
-	int lru = LRU_FILE * !!file + !!from;
+	int active    = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
+	int file      = pc->flags & PAGE_CGROUP_FLAG_FILE;
+	int noreclaim = pc->flags & PAGE_CGROUP_FLAG_NORECLAIM;
+	enum lru_list from = noreclaim ? LRU_NORECLAIM :
+				(LRU_FILE * !!file + !!active);
 
-	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
+	if (lru == from)
+		return;
 
-	if (active)
-		pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
-	else
+	MEM_CGROUP_ZSTAT(mz, from) -= 1;
+
+	if (is_noreclaim_lru(lru)) {
 		pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
+		pc->flags |= PAGE_CGROUP_FLAG_NORECLAIM;
+	} else {
+		if (is_active_lru(lru))
+			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
+		else
+			pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
+		pc->flags &= ~PAGE_CGROUP_FLAG_NORECLAIM;
+	}
 
-	lru = LRU_FILE * !!file + !!active;
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
 	list_move(&pc->lru, &mz->lists[lru]);
 }
@@ -342,7 +361,7 @@ int task_in_mem_cgroup(struct task_struc
 /*
  * This routine assumes that the appropriate zone's lru lock is already held
  */
-void mem_cgroup_move_lists(struct page *page, bool active)
+void mem_cgroup_move_lists(struct page *page, enum lru_list lru)
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup_per_zone *mz;
@@ -362,7 +381,7 @@ void mem_cgroup_move_lists(struct page *
 	if (pc) {
 		mz = page_cgroup_zoneinfo(pc);
 		spin_lock_irqsave(&mz->lru_lock, flags);
-		__mem_cgroup_move_lists(pc, active);
+		__mem_cgroup_move_lists(pc, lru);
 		spin_unlock_irqrestore(&mz->lru_lock, flags);
 	}
 	unlock_page_cgroup(page);
@@ -460,12 +479,10 @@ unsigned long mem_cgroup_isolate_pages(u
 		/*
 		 * TODO: play better with lumpy reclaim, grabbing anything.
 		 */
-		if (PageActive(page) && !active) {
-			__mem_cgroup_move_lists(pc, true);
-			continue;
-		}
-		if (!PageActive(page) && active) {
-			__mem_cgroup_move_lists(pc, false);
+		if (PageNoreclaim(page) ||
+		    (PageActive(page) && !active) ||
+		    (!PageActive(page) && active)) {
+			__mem_cgroup_move_lists(pc, page_lru(page));
 			continue;
 		}
 
Index: linux-2.6.26-rc2-mm1/include/linux/memcontrol.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/memcontrol.h	2008-05-23 14:21:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/memcontrol.h	2008-06-06 16:05:15.000000000 -0400
@@ -35,7 +35,7 @@ extern int mem_cgroup_charge(struct page
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
 extern void mem_cgroup_uncharge_page(struct page *page);
-extern void mem_cgroup_move_lists(struct page *page, bool active);
+extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 14/25] Noreclaim LRU Page Statistics
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (12 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 13/25] Noreclaim LRU Infrastructure Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-06 20:28 ` [PATCH -mm 15/25] Ramfs and Ram Disk pages are non-reclaimable Rik van Riel, Rik van Riel
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro

[-- Attachment #1: rvr-14-lts-noreclaim-SHM_LOCKED-pages-are-nonreclaimable.patch --]
[-- Type: text/plain, Size: 5793 bytes --]


From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Report non-reclaimable pages per zone and system wide.

Kosaki Motohiro added support for memory controller noreclaim
statistics.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---

 drivers/base/node.c |    6 ++++++
 fs/proc/proc_misc.c |    6 ++++++
 mm/memcontrol.c     |    6 ++++++
 mm/page_alloc.c     |   16 +++++++++++++++-
 mm/vmstat.c         |    3 +++
 5 files changed, 36 insertions(+), 1 deletion(-)

Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c	2008-06-06 16:05:15.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/page_alloc.c	2008-06-06 16:05:57.000000000 -0400
@@ -1918,12 +1918,20 @@ void show_free_areas(void)
 	}
 
 	printk("Active_anon:%lu active_file:%lu inactive_anon%lu\n"
-		" inactive_file:%lu dirty:%lu writeback:%lu unstable:%lu\n"
+		" inactive_file:%lu"
+//TODO:  check/adjust line lengths
+#ifdef CONFIG_NORECLAIM_LRU
+		" noreclaim:%lu"
+#endif
+		" dirty:%lu writeback:%lu unstable:%lu\n"
 		" free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n",
 		global_page_state(NR_ACTIVE_ANON),
 		global_page_state(NR_ACTIVE_FILE),
 		global_page_state(NR_INACTIVE_ANON),
 		global_page_state(NR_INACTIVE_FILE),
+#ifdef CONFIG_NORECLAIM_LRU
+		global_page_state(NR_NORECLAIM),
+#endif
 		global_page_state(NR_FILE_DIRTY),
 		global_page_state(NR_WRITEBACK),
 		global_page_state(NR_UNSTABLE_NFS),
@@ -1950,6 +1958,9 @@ void show_free_areas(void)
 			" inactive_anon:%lukB"
 			" active_file:%lukB"
 			" inactive_file:%lukB"
+#ifdef CONFIG_NORECLAIM_LRU
+			" noreclaim:%lukB"
+#endif
 			" present:%lukB"
 			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
@@ -1963,6 +1974,9 @@ void show_free_areas(void)
 			K(zone_page_state(zone, NR_INACTIVE_ANON)),
 			K(zone_page_state(zone, NR_ACTIVE_FILE)),
 			K(zone_page_state(zone, NR_INACTIVE_FILE)),
+#ifdef CONFIG_NORECLAIM_LRU
+			K(zone_page_state(zone, NR_NORECLAIM)),
+#endif
 			K(zone->present_pages),
 			zone->pages_scanned,
 			(zone_is_all_unreclaimable(zone) ? "yes" : "no")
Index: linux-2.6.26-rc2-mm1/mm/vmstat.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmstat.c	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmstat.c	2008-06-06 16:05:58.000000000 -0400
@@ -699,6 +699,9 @@ static const char * const vmstat_text[] 
 	"nr_active_anon",
 	"nr_inactive_file",
 	"nr_active_file",
+#ifdef CONFIG_NORECLAIM_LRU
+	"nr_noreclaim",
+#endif
 	"nr_anon_pages",
 	"nr_mapped",
 	"nr_file_pages",
Index: linux-2.6.26-rc2-mm1/drivers/base/node.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/drivers/base/node.c	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/drivers/base/node.c	2008-06-06 16:05:58.000000000 -0400
@@ -67,6 +67,9 @@ static ssize_t node_read_meminfo(struct 
 		       "Node %d Inactive(anon): %8lu kB\n"
 		       "Node %d Active(file):   %8lu kB\n"
 		       "Node %d Inactive(file): %8lu kB\n"
+#ifdef CONFIG_NORECLAIM_LRU
+		       "Node %d Noreclaim:      %8lu kB\n"
+#endif
 #ifdef CONFIG_HIGHMEM
 		       "Node %d HighTotal:      %8lu kB\n"
 		       "Node %d HighFree:       %8lu kB\n"
@@ -96,6 +99,9 @@ static ssize_t node_read_meminfo(struct 
 		       nid, node_page_state(nid, NR_INACTIVE_ANON),
 		       nid, node_page_state(nid, NR_ACTIVE_FILE),
 		       nid, node_page_state(nid, NR_INACTIVE_FILE),
+#ifdef CONFIG_NORECLAIM_LRU
+		       nid, node_page_state(nid, NR_NORECLAIM),
+#endif
 #ifdef CONFIG_HIGHMEM
 		       nid, K(i.totalhigh),
 		       nid, K(i.freehigh),
Index: linux-2.6.26-rc2-mm1/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/fs/proc/proc_misc.c	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/fs/proc/proc_misc.c	2008-06-06 16:05:58.000000000 -0400
@@ -174,6 +174,9 @@ static int meminfo_read_proc(char *page,
 		"Inactive(anon): %8lu kB\n"
 		"Active(file):   %8lu kB\n"
 		"Inactive(file): %8lu kB\n"
+#ifdef CONFIG_NORECLAIM_LRU
+		"Noreclaim:      %8lu kB\n"
+#endif
 #ifdef CONFIG_HIGHMEM
 		"HighTotal:      %8lu kB\n"
 		"HighFree:       %8lu kB\n"
@@ -209,6 +212,9 @@ static int meminfo_read_proc(char *page,
 		K(inactive_anon),
 		K(active_file),
 		K(inactive_file),
+#ifdef CONFIG_NORECLAIM_LRU
+		K(global_page_state(NR_NORECLAIM)),
+#endif
 #ifdef CONFIG_HIGHMEM
 		K(i.totalhigh),
 		K(i.freehigh),
Index: linux-2.6.26-rc2-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/memcontrol.c	2008-06-06 16:05:15.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/memcontrol.c	2008-06-06 16:05:58.000000000 -0400
@@ -905,6 +905,7 @@ static int mem_control_stat_show(struct 
 	{
 		unsigned long active_anon, inactive_anon;
 		unsigned long active_file, inactive_file;
+		unsigned long noreclaim;
 
 		inactive_anon = mem_cgroup_get_all_zonestat(mem_cont,
 						LRU_INACTIVE_ANON);
@@ -914,10 +915,15 @@ static int mem_control_stat_show(struct 
 						LRU_INACTIVE_FILE);
 		active_file = mem_cgroup_get_all_zonestat(mem_cont,
 						LRU_ACTIVE_FILE);
+		noreclaim = mem_cgroup_get_all_zonestat(mem_cont,
+							LRU_NORECLAIM);
+
 		cb->fill(cb, "active_anon", (active_anon) * PAGE_SIZE);
 		cb->fill(cb, "inactive_anon", (inactive_anon) * PAGE_SIZE);
 		cb->fill(cb, "active_file", (active_file) * PAGE_SIZE);
 		cb->fill(cb, "inactive_file", (inactive_file) * PAGE_SIZE);
+		cb->fill(cb, "noreclaim", noreclaim * PAGE_SIZE);
+
 	}
 	return 0;
 }

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 15/25] Ramfs and Ram Disk pages are non-reclaimable
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (13 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 14/25] Noreclaim LRU Page Statistics Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-07  1:05   ` Andrew Morton
  2008-06-06 20:28 ` [PATCH -mm 16/25] SHM_LOCKED " Rik van Riel, Rik van Riel
                   ` (10 subsequent siblings)
  25 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, linux-mm,
	Eric Whitney

[-- Attachment #1: rvr-15-lts-noreclaim-mlocked-pages-are-nonreclaimable.patch --]
[-- Type: text/plain, Size: 4652 bytes --]


From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Christoph Lameter pointed out that ram disk pages also clutter the
LRU lists.  When vmscan finds them dirty and tries to clean them,
the ram disk writeback function just redirties the page so that it
goes back onto the active list.  Round and round she goes...

Define new address_space flag [shares address_space flags member
with mapping's gfp mask] to indicate that the address space contains
all non-reclaimable pages.  This will provide for efficient testing
of ramdisk pages in page_reclaimable().

Also provide wrapper functions to set/test the noreclaim state to
minimize #ifdefs in ramdisk driver and any other users of this
facility.

Set the noreclaim state on address_space structures for new
ramdisk inodes.  Test the noreclaim state in page_reclaimable()
to cull non-reclaimable pages.

Similarly, ramfs pages are non-reclaimable.  Set the 'noreclaim'
address_space flag for new ramfs inodes.

These changes depend on [CONFIG_]NORECLAIM_LRU.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

 drivers/block/brd.c     |   13 +++++++++++++
 fs/ramfs/inode.c        |    1 +
 include/linux/pagemap.h |   22 ++++++++++++++++++++++
 mm/vmscan.c             |    5 +++++
 4 files changed, 41 insertions(+)

Index: linux-2.6.26-rc2-mm1/include/linux/pagemap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/pagemap.h	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/pagemap.h	2008-06-06 16:06:20.000000000 -0400
@@ -30,6 +30,28 @@ static inline void mapping_set_error(str
 	}
 }
 
+#ifdef CONFIG_NORECLAIM_LRU
+#define AS_NORECLAIM	(__GFP_BITS_SHIFT + 2)	/* e.g., ramdisk, SHM_LOCK */
+
+static inline void mapping_set_noreclaim(struct address_space *mapping)
+{
+	set_bit(AS_NORECLAIM, &mapping->flags);
+}
+
+static inline int mapping_non_reclaimable(struct address_space *mapping)
+{
+	if (mapping && (mapping->flags & AS_NORECLAIM))
+		return 1;
+	return 0;
+}
+#else
+static inline void mapping_set_noreclaim(struct address_space *mapping) { }
+static inline int mapping_non_reclaimable(struct address_space *mapping)
+{
+	return 0;
+}
+#endif
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-06-06 16:05:50.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-06-06 16:06:20.000000000 -0400
@@ -2311,6 +2311,8 @@ int zone_reclaim(struct zone *zone, gfp_
  * lists vs noreclaim list.
  *
  * Reasons page might not be reclaimable:
+ * (1) page's mapping marked non-reclaimable
+ *
  * TODO - later patches
  */
 int page_reclaimable(struct page *page, struct vm_area_struct *vma)
@@ -2318,6 +2320,9 @@ int page_reclaimable(struct page *page, 
 
 	VM_BUG_ON(PageNoreclaim(page));
 
+	if (mapping_non_reclaimable(page_mapping(page)))
+		return 0;
+
 	/* TODO:  test page [!]reclaimable conditions */
 
 	return 1;
Index: linux-2.6.26-rc2-mm1/fs/ramfs/inode.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/fs/ramfs/inode.c	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/fs/ramfs/inode.c	2008-06-06 16:06:20.000000000 -0400
@@ -61,6 +61,7 @@ struct inode *ramfs_get_inode(struct sup
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
 		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		mapping_set_noreclaim(inode->i_mapping);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
 		default:
Index: linux-2.6.26-rc2-mm1/drivers/block/brd.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/drivers/block/brd.c	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/drivers/block/brd.c	2008-06-06 16:06:20.000000000 -0400
@@ -374,8 +374,21 @@ static int brd_ioctl(struct inode *inode
 	return error;
 }
 
+/*
+ * brd_open():
+ * Just mark the mapping as containing non-reclaimable pages
+ */
+static int brd_open(struct inode *inode, struct file *filp)
+{
+	struct address_space *mapping = inode->i_mapping;
+
+	mapping_set_noreclaim(mapping);
+	return 0;
+}
+
 static struct block_device_operations brd_fops = {
 	.owner =		THIS_MODULE,
+	.open  =		brd_open,
 	.ioctl =		brd_ioctl,
 #ifdef CONFIG_BLK_DEV_XIP
 	.direct_access =	brd_direct_access,

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 16/25] SHM_LOCKED pages are non-reclaimable
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (14 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 15/25] Ramfs and Ram Disk pages are non-reclaimable Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-07  1:05   ` Andrew Morton
  2008-06-06 20:28 ` [PATCH -mm 17/25] Mlocked Pages " Rik van Riel, Rik van Riel
                   ` (9 subsequent siblings)
  25 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro

[-- Attachment #1: rvr-16-lts-noreclaim-mlock-vma-pages-under-mmap_sem-held-for-read.patch --]
[-- Type: text/plain, Size: 10273 bytes --]

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Against:  2.6.26-rc2-mm1

While working with Nick Piggin's mlock patches, I noticed that
shmem segments locked via shmctl(SHM_LOCKED) were not being handled.
SHM_LOCKed pages work like ramdisk pages--the writeback function
just redirties the page so that it can't be reclaimed.  Deal with
these using the same approach as for ram disk pages.

Use the AS_NORECLAIM flag to mark address_space of SHM_LOCKed
shared memory regions as non-reclaimable.  Then these pages
will be culled off the normal LRU lists during vmscan.

Add new wrapper function to clear the mapping's noreclaim state
when/if shared memory segment is munlocked.

Add 'scan_mapping_noreclaim_page()' to mm/vmscan.c to scan all
pages in the shmem segment's mapping [struct address_space] for
reclaimability now that they're no longer locked.  If so, move
them to the appropriate zone lru list.  Note that
scan_mapping_noreclaim_page() must be able to sleep on page_lock(),
so we can't call it holding the shmem info spinlock nor the shmid
spinlock.  So, we pass the mapping [address_space] back to shmctl()
on SHM_UNLOCK for rescuing any nonreclaimable pages after dropping
the spinlocks.  Once we drop the shmid lock, the backing shmem file
can be deleted if the calling task doesn't have the shm area
attached.  To handle this, we take an extra reference on the file
before dropping the shmid lock and drop the reference after scanning
the mapping's noreclaim pages.

Changes depend on [CONFIG_]NORECLAIM_LRU.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>
Signed-off-by:  Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>

--- 
 include/linux/mm.h      |    9 ++--
 include/linux/pagemap.h |   12 ++++--
 include/linux/swap.h    |    4 ++
 ipc/shm.c               |   20 +++++++++-
 mm/shmem.c              |   10 +++--
 mm/vmscan.c             |   93 ++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 136 insertions(+), 12 deletions(-)

Index: linux-2.6.26-rc2-mm1/mm/shmem.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/shmem.c	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/shmem.c	2008-06-06 16:06:24.000000000 -0400
@@ -1458,23 +1458,27 @@ static struct mempolicy *shmem_get_polic
 }
 #endif
 
-int shmem_lock(struct file *file, int lock, struct user_struct *user)
+struct address_space *shmem_lock(struct file *file, int lock,
+				 struct user_struct *user)
 {
 	struct inode *inode = file->f_path.dentry->d_inode;
 	struct shmem_inode_info *info = SHMEM_I(inode);
-	int retval = -ENOMEM;
+	struct address_space *retval = ERR_PTR(-ENOMEM);
 
 	spin_lock(&info->lock);
 	if (lock && !(info->flags & VM_LOCKED)) {
 		if (!user_shm_lock(inode->i_size, user))
 			goto out_nomem;
 		info->flags |= VM_LOCKED;
+		mapping_set_noreclaim(file->f_mapping);
+		retval = NULL;
 	}
 	if (!lock && (info->flags & VM_LOCKED) && user) {
 		user_shm_unlock(inode->i_size, user);
 		info->flags &= ~VM_LOCKED;
+		mapping_clear_noreclaim(file->f_mapping);
+		retval = file->f_mapping;
 	}
-	retval = 0;
 out_nomem:
 	spin_unlock(&info->lock);
 	return retval;
Index: linux-2.6.26-rc2-mm1/include/linux/pagemap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/pagemap.h	2008-06-06 16:06:20.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/pagemap.h	2008-06-06 16:06:24.000000000 -0400
@@ -38,14 +38,20 @@ static inline void mapping_set_noreclaim
 	set_bit(AS_NORECLAIM, &mapping->flags);
 }
 
+static inline void mapping_clear_noreclaim(struct address_space *mapping)
+{
+	clear_bit(AS_NORECLAIM, &mapping->flags);
+}
+
 static inline int mapping_non_reclaimable(struct address_space *mapping)
 {
-	if (mapping && (mapping->flags & AS_NORECLAIM))
-		return 1;
-	return 0;
+	if (likely(mapping))
+		return test_bit(AS_NORECLAIM, &mapping->flags);
+	return !!mapping;
 }
 #else
 static inline void mapping_set_noreclaim(struct address_space *mapping) { }
+static inline void mapping_clear_noreclaim(struct address_space *mapping) { }
 static inline int mapping_non_reclaimable(struct address_space *mapping)
 {
 	return 0;
Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-06-06 16:06:20.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-06-06 16:06:24.000000000 -0400
@@ -2327,4 +2327,97 @@ int page_reclaimable(struct page *page, 
 
 	return 1;
 }
+
+/**
+ * check_move_noreclaim_page - check page for reclaimability and move to appropriate zone lru list
+ * @page: page to check reclaimability and move to appropriate lru list
+ * @zone: zone page is in
+ *
+ * Checks a page for reclaimability and moves the page to the appropriate
+ * zone lru list.
+ *
+ * Restrictions: zone->lru_lock must be held, page must be on LRU and must
+ * have PageNoreclaim set.
+ */
+static void check_move_noreclaim_page(struct page *page, struct zone *zone)
+{
+
+	ClearPageNoreclaim(page); /* for page_reclaimable() */
+	if (page_reclaimable(page, NULL)) {
+		enum lru_list l = LRU_INACTIVE_ANON + page_file_cache(page);
+		__dec_zone_state(zone, NR_NORECLAIM);
+		list_move(&page->lru, &zone->list[l]);
+		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
+	} else {
+		/*
+		 * rotate noreclaim list
+		 */
+		SetPageNoreclaim(page);
+		list_move(&page->lru, &zone->list[LRU_NORECLAIM]);
+	}
+}
+
+/**
+ * scan_mapping_noreclaim_pages - scan an address space for reclaimable pages
+ * @mapping: struct address_space to scan for reclaimable pages
+ *
+ * Scan all pages in mapping.  Check non-reclaimable pages for
+ * reclaimability and move them to the appropriate zone lru list.
+ */
+void scan_mapping_noreclaim_pages(struct address_space *mapping)
+{
+	pgoff_t next = 0;
+	pgoff_t end   = (i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1) >>
+			 PAGE_CACHE_SHIFT;
+	struct zone *zone;
+	struct pagevec pvec;
+
+	if (mapping->nrpages == 0)
+		return;
+
+	pagevec_init(&pvec, 0);
+	while (next < end &&
+		pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+		int i;
+
+		zone = NULL;
+
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *page = pvec.pages[i];
+			pgoff_t page_index = page->index;
+			struct zone *pagezone = page_zone(page);
+
+			if (page_index > next)
+				next = page_index;
+			next++;
+
+			if (TestSetPageLocked(page)) {
+				/*
+				 * OK, let's do it the hard way...
+				 */
+				if (zone)
+					spin_unlock_irq(&zone->lru_lock);
+				zone = NULL;
+				lock_page(page);
+			}
+
+			if (pagezone != zone) {
+				if (zone)
+					spin_unlock_irq(&zone->lru_lock);
+				zone = pagezone;
+				spin_lock_irq(&zone->lru_lock);
+			}
+
+			if (PageLRU(page) && PageNoreclaim(page))
+				check_move_noreclaim_page(page, zone);
+
+			unlock_page(page);
+
+		}
+		if (zone)
+			spin_unlock_irq(&zone->lru_lock);
+		pagevec_release(&pvec);
+	}
+
+}
 #endif
Index: linux-2.6.26-rc2-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/swap.h	2008-06-06 16:05:15.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/swap.h	2008-06-06 16:06:24.000000000 -0400
@@ -232,12 +232,16 @@ static inline int zone_reclaim(struct zo
 
 #ifdef CONFIG_NORECLAIM_LRU
 extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
+extern void scan_mapping_noreclaim_pages(struct address_space *);
 #else
 static inline int page_reclaimable(struct page *page,
 						struct vm_area_struct *vma)
 {
 	return 1;
 }
+static inline void scan_mapping_noreclaim_pages(struct address_space *mapping)
+{
+}
 #endif
 
 extern int kswapd_run(int nid);
Index: linux-2.6.26-rc2-mm1/include/linux/mm.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mm.h	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mm.h	2008-06-06 16:06:24.000000000 -0400
@@ -694,12 +694,13 @@ static inline int page_mapped(struct pag
 extern void show_free_areas(void);
 
 #ifdef CONFIG_SHMEM
-int shmem_lock(struct file *file, int lock, struct user_struct *user);
+extern struct address_space *shmem_lock(struct file *file, int lock,
+					struct user_struct *user);
 #else
-static inline int shmem_lock(struct file *file, int lock,
-			     struct user_struct *user)
+static inline struct address_space *shmem_lock(struct file *file, int lock,
+					struct user_struct *user)
 {
-	return 0;
+	return NULL;
 }
 #endif
 struct file *shmem_file_setup(char *name, loff_t size, unsigned long flags);
Index: linux-2.6.26-rc2-mm1/ipc/shm.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/ipc/shm.c	2008-05-29 16:21:04.000000000 -0400
+++ linux-2.6.26-rc2-mm1/ipc/shm.c	2008-06-06 16:06:24.000000000 -0400
@@ -736,6 +736,11 @@ asmlinkage long sys_shmctl(int shmid, in
 	case SHM_LOCK:
 	case SHM_UNLOCK:
 	{
+		struct address_space *mapping = NULL;
+		struct file *uninitialized_var(shm_file);
+
+		lru_add_drain_all();  /* drain pagevecs to lru lists */
+
 		shp = shm_lock_check(ns, shmid);
 		if (IS_ERR(shp)) {
 			err = PTR_ERR(shp);
@@ -763,18 +768,29 @@ asmlinkage long sys_shmctl(int shmid, in
 		if(cmd==SHM_LOCK) {
 			struct user_struct * user = current->user;
 			if (!is_file_hugepages(shp->shm_file)) {
-				err = shmem_lock(shp->shm_file, 1, user);
+				mapping = shmem_lock(shp->shm_file, 1, user);
+				if (IS_ERR(mapping))
+					err = PTR_ERR(mapping);
+				mapping = NULL;
 				if (!err && !(shp->shm_perm.mode & SHM_LOCKED)){
 					shp->shm_perm.mode |= SHM_LOCKED;
 					shp->mlock_user = user;
 				}
 			}
 		} else if (!is_file_hugepages(shp->shm_file)) {
-			shmem_lock(shp->shm_file, 0, shp->mlock_user);
+			mapping = shmem_lock(shp->shm_file, 0, shp->mlock_user);
 			shp->shm_perm.mode &= ~SHM_LOCKED;
 			shp->mlock_user = NULL;
+			if (mapping) {
+				shm_file = shp->shm_file;
+				get_file(shm_file);	/* hold across unlock */
+			}
 		}
 		shm_unlock(shp);
+		if (mapping) {
+			scan_mapping_noreclaim_pages(mapping);
+			fput(shm_file);
+		}
 		goto out;
 	}
 	case IPC_RMID:

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (15 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 16/25] SHM_LOCKED " Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-07  1:07   ` Andrew Morton
  2008-06-06 20:28 ` [PATCH -mm 18/25] Downgrade mmap sem while populating mlocked regions Rik van Riel, Rik van Riel
                   ` (8 subsequent siblings)
  25 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, linux-mm,
	Eric Whitney, Nick Piggin

[-- Attachment #1: rvr-17-lts-noreclaim-handle-mlocked-pages-during-map-unmap.patch --]
[-- Type: text/plain, Size: 46099 bytes --]

Originally
From: Nick Piggin <npiggin@suse.de>

Against:  2.6.26-rc2-mm1

This patch:

1) defines the [CONFIG_]NORECLAIM_MLOCK sub-option and the
   stub version of the mlock/noreclaim APIs when it's
   not configured.  Depends on [CONFIG_]NORECLAIM_LRU.

2) add yet another page flag--PG_mlocked--to indicate that
   the page is locked for efficient testing in vmscan and,
   optionally, fault path.  This allows early culling of
   nonreclaimable pages, preventing them from getting to
   page_referenced()/try_to_unmap().  Also allows separate
   accounting of mlock'd pages, as Nick's original patch
   did.

   Note:  Nick's original mlock patch used a PG_mlocked
   flag.  I had removed this in favor of the PG_noreclaim
   flag + an mlock_count [new page struct member].  I
   restored the PG_mlocked flag to eliminate the new
   count field.  

3) add the mlock/noreclaim infrastructure to mm/mlock.c,
   with internal APIs in mm/internal.h.  This is a rework
   of Nick's original patch to these files, taking into
   account that mlocked pages are now kept on noreclaim
   LRU list.

4) update vmscan.c:page_reclaimable() to check PageMlocked()
   and, if vma passed in, the vm_flags.  Note that the vma
   will only be passed in for new pages in the fault path;
   and then only if the "cull nonreclaimable pages in fault
   path" patch is included.

5) add try_to_unlock() to rmap.c to walk a page's rmap and
   ClearPageMlocked() if no other vmas have it mlocked.  
   Reuses as much of try_to_unmap() as possible.  This
   effectively replaces the use of one of the lru list links
   as an mlock count.  If this mechanism let's pages in mlocked
   vmas leak through w/o PG_mlocked set [I don't know that it
   does], we should catch them later in try_to_unmap().  One
   hopes this will be rare, as it will be relatively expensive.

6) Kosaki:  added munlock page table walk to avoid using
   get_user_pages() for unlock.  get_user_pages() is unreliable
   for some vma protections.
   Lee:  modified to wait for in-flight migration to complete
   to close munlock/migration race that could strand pages.

Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

---

V8:
+ more refinement of rmap interaction, including attempt to
  handle mlocked pages in non-linear mappings.
+ cleanup of lockdep reported errors.
+ enhancement of munlock page table walker to detect and 
  handle pages under migration [migration ptes].

V6:
+ Kosaki-san and Rik van Riel:  added check for "page mapped
  in vma" to try_to_unlock() processing in try_to_unmap_anon().
+ Kosaki-san added munlock page table walker to avoid use of
  get_user_pages() for munlock.  get_user_pages() proved to be
  unreliable for some types of vmas.
+ added filtering of "special" vmas.  Some [_IO||_PFN] we skip
  altogether.  Others, we just "make_pages_present" to simulate
  old behavior--i.e., populate page tables.  Clear/don't set
  VM_LOCKED in non-mlockable vmas so that we don't try to unlock
  at exit/unmap time.
+ rework PG_mlock page flag definitions for new page flags
  macros.
+ Clear PageMlocked when COWing a page into a VM_LOCKED vma
  so we don't leave an mlocked page in another non-mlocked
  vma.  If the other vma[s] had the page mlocked, we'll re-mlock
  it if/when we try to reclaim it.  This is less expensive than
  walking the rmap in the COW/fault path.
+ in vmscan:shrink_page_list(), avoid  adding anon page to
  the swap cache if it's in a VM_LOCKED vma, even tho'
  PG_mlocked might not be set.  Call try_to_unlock() to
  determine this.  As a result, we'll never try to unmap
  an mlocked anon page.
+ in support of the above change, updated try_to_unlock()
  to use same logic as try_to_unmap() when it encounters a
  VM_LOCKED vma--call mlock_vma_page() directly.  Added
  stub try_to_unlock() for vmscan when NORECLAIM_MLOCK
  not configured.

V4 -> V5:
+ fixed problem with placement of #ifdef CONFIG_NORECLAIM_MLOCK
  in prep_new_page() [Thanks, minchan Kim!].

V3 -> V4:
+ Added #ifdef CONFIG_NORECLAIM_MLOCK, #endif around use of
  PG_mlocked in free_page_check(), et al.  Not defined for
  32-bit builds.

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series
+ fix page flags macros for *PageMlocked() when not configured.
+ ensure lru_add_drain_all() runs on all cpus when NORECLAIM_MLOCK
  configured.  Was just for NUMA.

V1 -> V2:
+ moved this patch [and related patches] up to right after
  ramdisk/ramfs and SHM_LOCKed patches.
+ add [back] missing put_page() in putback_lru_page().
  This solved page leakage as seen by stats in previous
  version.
+ fix up munlock_vma_page() to isolate page from lru
  before calling try_to_unlock().  Think I detected a
  race here.
+ use TestClearPageMlock() on old page in migrate.c's
  migrate_page_copy() to clean up old page.
+ live dangerously:  remove TestSetPageLocked() in 
  is_mlocked_vma()--should only be called on new pages in
  the fault path--iff we chose to cull there [later patch].
+ Add PG_mlocked to free_pages_check() etc to detect mlock
  state mismanagement.
  NOTE:  temporarily [???] commented out--tripping over it
  under load.  Why?

Rework of Nick Piggins's "mm: move mlocked pages off the LRU" patch
 -- part 1 of 2.

 include/linux/mm.h         |    5 
 include/linux/page-flags.h |   16 +
 include/linux/rmap.h       |   14 +
 mm/Kconfig                 |   14 +
 mm/internal.h              |   70 ++++++++
 mm/memory.c                |   19 ++
 mm/migrate.c               |    2 
 mm/mlock.c                 |  386 ++++++++++++++++++++++++++++++++++++++++++---
 mm/mmap.c                  |    1 
 mm/page_alloc.c            |   15 +
 mm/rmap.c                  |  252 +++++++++++++++++++++++++----
 mm/swap.c                  |    2 
 mm/vmscan.c                |   40 +++-
 13 files changed, 767 insertions(+), 69 deletions(-)

Index: linux-2.6.26-rc2-mm1/mm/Kconfig
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/Kconfig	2008-06-06 16:05:15.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/Kconfig	2008-06-06 16:06:28.000000000 -0400
@@ -215,3 +215,17 @@ config NORECLAIM_LRU
 	  may be non-reclaimable because:  they are locked into memory, they
 	  are anonymous pages for which no swap space exists, or they are anon
 	  pages that are expensive to unmap [long anon_vma "related vma" list.]
+
+config NORECLAIM_MLOCK
+	bool "Exclude mlock'ed pages from reclaim"
+	depends on NORECLAIM_LRU
+	help
+	  Treats mlock'ed pages as no-reclaimable.  Removing these pages from
+	  the LRU [in]active lists avoids the overhead of attempting to reclaim
+	  them.  Pages marked non-reclaimable for this reason will become
+	  reclaimable again when the last mlock is removed.
+	  when no swap space exists.  Removing these pages from the LRU lists
+	  avoids the overhead of attempting to reclaim them.  Pages marked
+	  non-reclaimable for this reason will become reclaimable again when/if
+	  sufficient swap space is added to the system.
+
Index: linux-2.6.26-rc2-mm1/mm/internal.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/internal.h	2008-06-06 16:05:15.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/internal.h	2008-06-06 16:06:28.000000000 -0400
@@ -56,6 +56,17 @@ static inline unsigned long page_order(s
 	return page_private(page);
 }
 
+/*
+ * mlock all pages in this vma range.  For mmap()/mremap()/...
+ */
+extern int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end);
+
+/*
+ * munlock all pages in vma.   For munmap() and exit().
+ */
+extern void munlock_vma_pages_all(struct vm_area_struct *vma);
+
 #ifdef CONFIG_NORECLAIM_LRU
 /*
  * noreclaim_migrate_page() called only from migrate_page_copy() to
@@ -74,6 +85,65 @@ static inline void noreclaim_migrate_pag
 }
 #endif
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+/*
+ * Called only in fault path via page_reclaimable() for a new page
+ * to determine if it's being mapped into a LOCKED vma.
+ * If so, mark page as mlocked.
+ */
+static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
+{
+	VM_BUG_ON(PageLRU(page));
+
+	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
+		return 0;
+
+	SetPageMlocked(page);
+	return 1;
+}
+
+/*
+ * must be called with vma's mmap_sem held for read, and page locked.
+ */
+extern void mlock_vma_page(struct page *page);
+
+/*
+ * Clear the page's PageMlocked().  This can be useful in a situation where
+ * we want to unconditionally remove a page from the pagecache -- e.g.,
+ * on truncation or freeing.
+ *
+ * It is legal to call this function for any page, mlocked or not.
+ * If called for a page that is still mapped by mlocked vmas, all we do
+ * is revert to lazy LRU behaviour -- semantics are not broken.
+ */
+extern void __clear_page_mlock(struct page *page);
+static inline void clear_page_mlock(struct page *page)
+{
+	if (unlikely(TestClearPageMlocked(page)))
+		__clear_page_mlock(page);
+}
+
+/*
+ * mlock_migrate_page - called only from migrate_page_copy() to
+ * migrate the Mlocked page flag
+ */
+static inline void mlock_migrate_page(struct page *newpage, struct page *page)
+{
+	if (TestClearPageMlocked(page))
+		SetPageMlocked(newpage);
+}
+
+
+#else /* CONFIG_NORECLAIM_MLOCK */
+static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
+{
+	return 0;
+}
+static inline void clear_page_mlock(struct page *page) { }
+static inline void mlock_vma_page(struct page *page) { }
+static inline void mlock_migrate_page(struct page *new, struct page *old) { }
+
+#endif /* CONFIG_NORECLAIM_MLOCK */
 
 /*
  * FLATMEM and DISCONTIGMEM configurations use alloc_bootmem_node,
Index: linux-2.6.26-rc2-mm1/mm/mlock.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/mlock.c	2008-05-15 11:20:15.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/mlock.c	2008-06-06 16:06:28.000000000 -0400
@@ -8,10 +8,18 @@
 #include <linux/capability.h>
 #include <linux/mman.h>
 #include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/pagemap.h>
 #include <linux/mempolicy.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
 #include <linux/module.h>
+#include <linux/rmap.h>
+#include <linux/mmzone.h>
+#include <linux/hugetlb.h>
+
+#include "internal.h"
 
 int can_do_mlock(void)
 {
@@ -23,17 +31,354 @@ int can_do_mlock(void)
 }
 EXPORT_SYMBOL(can_do_mlock);
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+/*
+ * Mlocked pages are marked with PageMlocked() flag for efficient testing
+ * in vmscan and, possibly, the fault path; and to support semi-accurate
+ * statistics.
+ *
+ * An mlocked page [PageMlocked(page)] is non-reclaimable.  As such, it will
+ * be placed on the LRU "noreclaim" list, rather than the [in]active lists.
+ * The noreclaim list is an LRU sibling list to the [in]active lists.
+ * PageNoreclaim is set to indicate the non-reclaimable state.
+ *
+ * When lazy mlocking via vmscan, it is important to ensure that the
+ * vma's VM_LOCKED status is not concurrently being modified, otherwise we
+ * may have mlocked a page that is being munlocked. So lazy mlock must take
+ * the mmap_sem for read, and verify that the vma really is locked
+ * (see mm/rmap.c).
+ */
+
+/*
+ *  LRU accounting for clear_page_mlock()
+ */
+void __clear_page_mlock(struct page *page)
+{
+	VM_BUG_ON(!PageLocked(page));	/* for LRU islolate/putback */
+
+	if (!isolate_lru_page(page)) {
+		putback_lru_page(page);
+	} else {
+		/*
+		 * Try hard not to leak this page ...
+		 */
+		lru_add_drain_all();
+		if (!isolate_lru_page(page))
+			putback_lru_page(page);
+	}
+}
+
+/*
+ * Mark page as mlocked if not already.
+ * If page on LRU, isolate and putback to move to noreclaim list.
+ */
+void mlock_vma_page(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+
+	if (!TestSetPageMlocked(page) && !isolate_lru_page(page))
+			putback_lru_page(page);
+}
+
+/*
+ * called from munlock()/munmap() path with page supposedly on the LRU.
+ *
+ * Note:  unlike mlock_vma_page(), we can't just clear the PageMlocked
+ * [in try_to_unlock()] and then attempt to isolate the page.  We must
+ * isolate the page() to keep others from messing with its noreclaim
+ * and mlocked state while trying to unlock.  However, we pre-clear the
+ * mlocked state anyway as we might lose the isolation race and we might
+ * not get another chance to clear PageMlocked.  If we successfully
+ * isolate the page and try_to_unlock() detects other VM_LOCKED vmas
+ * mapping the page, it will restore the PageMlocked state, unless the page
+ * is mapped in a non-linear vma.  So, we go ahead and SetPageMlocked(),
+ * perhaps redundantly.
+ * If we lose the isolation race, and the page is mapped by other VM_LOCKED
+ * vmas, we'll detect this in vmscan--via try_to_unlock() or try_to_unmap()
+ * either of which will restore the PageMlocked state by calling
+ * mlock_vma_page() above, if it can grab the vma's mmap sem.
+ */
+static void munlock_vma_page(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+
+	if (TestClearPageMlocked(page) && !isolate_lru_page(page)) {
+		try_to_unlock(page);
+		putback_lru_page(page);
+	}
+}
+
+/*
+ * mlock a range of pages in the vma.
+ *
+ * This takes care of making the pages present too.
+ *
+ * vma->vm_mm->mmap_sem must be held for write.
+ */
+static int __mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long addr = start;
+	struct page *pages[16]; /* 16 gives a reasonable batch */
+	int write = !!(vma->vm_flags & VM_WRITE);
+	int nr_pages = (end - start) / PAGE_SIZE;
+	int ret;
+
+	VM_BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK);
+	VM_BUG_ON(start < vma->vm_start || end > vma->vm_end);
+	VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
+
+	lru_add_drain_all();	/* push cached pages to LRU */
+
+	while (nr_pages > 0) {
+		int i;
+
+		cond_resched();
+
+		/*
+		 * get_user_pages makes pages present if we are
+		 * setting mlock.
+		 */
+		ret = get_user_pages(current, mm, addr,
+				min_t(int, nr_pages, ARRAY_SIZE(pages)),
+				write, 0, pages, NULL);
+		/*
+		 * This can happen for, e.g., VM_NONLINEAR regions before
+		 * a page has been allocated and mapped at a given offset,
+		 * or for addresses that map beyond end of a file.
+		 * We'll mlock the the pages if/when they get faulted in.
+		 */
+		if (ret < 0)
+			break;
+		if (ret == 0) {
+			/*
+			 * We know the vma is there, so the only time
+			 * we cannot get a single page should be an
+			 * error (ret < 0) case.
+			 */
+			WARN_ON(1);
+			break;
+		}
+
+		lru_add_drain();	/* push cached pages to LRU */
+
+		for (i = 0; i < ret; i++) {
+			struct page *page = pages[i];
+
+			/*
+			 * page might be truncated or migrated out from under
+			 * us.  Check after acquiring page lock.
+			 */
+			lock_page(page);
+			if (page->mapping)
+				mlock_vma_page(page);
+			unlock_page(page);
+			put_page(page);		/* ref from get_user_pages() */
+
+			/*
+			 * here we assume that get_user_pages() has given us
+			 * a list of virtually contiguous pages.
+			 */
+			addr += PAGE_SIZE;	/* for next get_user_pages() */
+			nr_pages--;
+		}
+	}
+
+	lru_add_drain_all();	/* to update stats */
+
+	return 0;	/* count entire vma as locked_vm */
+}
+
+/*
+ * private structure for munlock page table walk
+ */
+struct munlock_page_walk {
+	struct vm_area_struct *vma;
+	pmd_t                 *pmd; /* for migration_entry_wait() */
+};
+
+/*
+ * munlock normal pages for present ptes
+ */
+static int __munlock_pte_handler(pte_t *ptep, unsigned long addr,
+				   unsigned long end, void *private)
+{
+	struct munlock_page_walk *mpw = private;
+	swp_entry_t entry;
+	struct page *page;
+	pte_t pte;
+
+retry:
+	pte = *ptep;
+	/*
+	 * If it's a swap pte, we might be racing with page migration.
+	 */
+	if (unlikely(!pte_present(pte))) {
+		if (!is_swap_pte(pte))
+			goto out;
+		entry = pte_to_swp_entry(pte);
+		if (is_migration_entry(entry)) {
+			migration_entry_wait(mpw->vma->vm_mm, mpw->pmd, addr);
+			goto retry;
+		}
+		goto out;
+	}
+
+	page = vm_normal_page(mpw->vma, addr, pte);
+	if (!page)
+		goto out;
+
+	lock_page(page);
+	if (!page->mapping) {
+		unlock_page(page);
+		goto retry;
+	}
+	munlock_vma_page(page);
+	unlock_page(page);
+
+out:
+	return 0;
+}
+
+/*
+ * Save pmd for pte handler for waiting on migration entries
+ */
+static int __munlock_pmd_handler(pmd_t *pmd, unsigned long addr,
+				 unsigned long end, void *private)
+{
+	struct munlock_page_walk *mpw = private;
+
+	mpw->pmd = pmd;
+	return 0;
+}
+
+static struct mm_walk munlock_page_walk = {
+	.pmd_entry = __munlock_pmd_handler,
+	.pte_entry = __munlock_pte_handler,
+};
+
+/*
+ * munlock a range of pages in the vma using standard page table walk.
+ *
+ * vma->vm_mm->mmap_sem must be held for write.
+ */
+static void __munlock_vma_pages_range(struct vm_area_struct *vma,
+			      unsigned long start, unsigned long end)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct munlock_page_walk mpw;
+
+	VM_BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK);
+	VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
+	VM_BUG_ON(start < vma->vm_start);
+	VM_BUG_ON(end > vma->vm_end);
+
+	lru_add_drain_all();	/* push cached pages to LRU */
+	mpw.vma = vma;
+	(void)walk_page_range(mm, start, end, &munlock_page_walk, &mpw);
+	lru_add_drain_all();	/* to update stats */
+
+}
+
+#else /* CONFIG_NORECLAIM_MLOCK */
+
+/*
+ * Just make pages present if VM_LOCKED.  No-op if unlocking.
+ */
+static int __mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+	if (vma->vm_flags & VM_LOCKED)
+		make_pages_present(start, end);
+	return 0;
+}
+
+/*
+ * munlock a range of pages in the vma -- no-op.
+ */
+static void __munlock_vma_pages_range(struct vm_area_struct *vma,
+			      unsigned long start, unsigned long end)
+{
+}
+#endif /* CONFIG_NORECLAIM_MLOCK */
+
+/*
+ * mlock all pages in this vma range.  For mmap()/mremap()/...
+ */
+int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+	int nr_pages = (end - start) / PAGE_SIZE;
+	BUG_ON(!(vma->vm_flags & VM_LOCKED));
+
+	/*
+	 * filter unlockable vmas
+	 */
+	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+		goto no_mlock;
+
+	if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
+			is_vm_hugetlb_page(vma) ||
+			vma == get_gate_vma(current))
+		goto make_present;
+
+	return __mlock_vma_pages_range(vma, start, end);
+
+make_present:
+	/*
+	 * User mapped kernel pages or huge pages:
+	 * make these pages present to populate the ptes, but
+	 * fall thru' to reset VM_LOCKED--no need to unlock, and
+	 * return nr_pages so these don't get counted against task's
+	 * locked limit.  huge pages are already counted against
+	 * locked vm limit.
+	 */
+	make_pages_present(start, end);
+
+no_mlock:
+	vma->vm_flags &= ~VM_LOCKED;	/* and don't come back! */
+	return nr_pages;		/* pages NOT mlocked */
+}
+
+
+/*
+ * munlock all pages in vma.   For munmap() and exit().
+ */
+void munlock_vma_pages_all(struct vm_area_struct *vma)
+{
+	vma->vm_flags &= ~VM_LOCKED;
+	__munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
+}
+
+/*
+ * mlock_fixup  - handle mlock[all]/munlock[all] requests.
+ *
+ * Filters out "special" vmas -- VM_LOCKED never gets set for these, and
+ * munlock is a no-op.  However, for some special vmas, we go ahead and
+ * populate the ptes via make_pages_present().
+ *
+ * For vmas that pass the filters, merge/split as appropriate.
+ */
 static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	unsigned long start, unsigned long end, unsigned int newflags)
 {
-	struct mm_struct * mm = vma->vm_mm;
+	struct mm_struct *mm = vma->vm_mm;
 	pgoff_t pgoff;
-	int pages;
+	int nr_pages;
 	int ret = 0;
+	int lock = newflags & VM_LOCKED;
 
-	if (newflags == vma->vm_flags) {
-		*prev = vma;
-		goto out;
+	if (newflags == vma->vm_flags ||
+			(vma->vm_flags & (VM_IO | VM_PFNMAP)))
+		goto out;	/* don't set VM_LOCKED,  don't count */
+
+	if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
+			is_vm_hugetlb_page(vma) ||
+			vma == get_gate_vma(current)) {
+		if (lock)
+			make_pages_present(start, end);
+		goto out;	/* don't set VM_LOCKED,  don't count */
 	}
 
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
@@ -44,8 +389,6 @@ static int mlock_fixup(struct vm_area_st
 		goto success;
 	}
 
-	*prev = vma;
-
 	if (start != vma->vm_start) {
 		ret = split_vma(mm, vma, start, 1);
 		if (ret)
@@ -60,24 +403,31 @@ static int mlock_fixup(struct vm_area_st
 
 success:
 	/*
+	 * Keep track of amount of locked VM.
+	 */
+	nr_pages = (end - start) >> PAGE_SHIFT;
+	if (!lock)
+		nr_pages = -nr_pages;
+	mm->locked_vm += nr_pages;
+
+	/*
 	 * vm_flags is protected by the mmap_sem held in write mode.
 	 * It's okay if try_to_unmap_one unmaps a page just after we
-	 * set VM_LOCKED, make_pages_present below will bring it back.
+	 * set VM_LOCKED, __mlock_vma_pages_range will bring it back.
 	 */
 	vma->vm_flags = newflags;
 
-	/*
-	 * Keep track of amount of locked VM.
-	 */
-	pages = (end - start) >> PAGE_SHIFT;
-	if (newflags & VM_LOCKED) {
-		pages = -pages;
-		if (!(newflags & VM_IO))
-			ret = make_pages_present(start, end);
-	}
+	if (lock) {
+		ret = __mlock_vma_pages_range(vma, start, end);
+		if (ret > 0) {
+			mm->locked_vm -= ret;
+			ret = 0;
+		}
+	} else
+		__munlock_vma_pages_range(vma, start, end);
 
-	mm->locked_vm -= pages;
 out:
+	*prev = vma;
 	if (ret == -ENOMEM)
 		ret = -EAGAIN;
 	return ret;
Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-06-06 16:06:24.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-06-06 16:06:28.000000000 -0400
@@ -537,11 +537,8 @@ static unsigned long shrink_page_list(st
 
 		sc->nr_scanned++;
 
-		if (unlikely(!page_reclaimable(page, NULL))) {
-			if (putback_lru_page(page))
-				unlock_page(page);
-			continue;
-		}
+		if (unlikely(!page_reclaimable(page, NULL)))
+			goto cull_mlocked;
 
 		if (!sc->may_swap && page_mapped(page))
 			goto keep_locked;
@@ -578,9 +575,19 @@ static unsigned long shrink_page_list(st
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
 		 */
-		if (PageAnon(page) && !PageSwapCache(page))
+		if (PageAnon(page) && !PageSwapCache(page)) {
+			switch (try_to_unlock(page)) {
+			case SWAP_FAIL:		/* shouldn't happen */
+			case SWAP_AGAIN:
+				goto keep_locked;
+			case SWAP_MLOCK:
+				goto cull_mlocked;
+			case SWAP_SUCCESS:
+				; /* fall thru'; add to swap cache */
+			}
 			if (!add_to_swap(page, GFP_ATOMIC))
 				goto activate_locked;
+		}
 #endif /* CONFIG_SWAP */
 
 		mapping = page_mapping(page);
@@ -595,6 +602,8 @@ static unsigned long shrink_page_list(st
 				goto activate_locked;
 			case SWAP_AGAIN:
 				goto keep_locked;
+			case SWAP_MLOCK:
+				goto cull_mlocked;
 			case SWAP_SUCCESS:
 				; /* try to free the page below */
 			}
@@ -667,6 +676,11 @@ free_it:
 			__pagevec_release_nonlru(&freed_pvec);
 		continue;
 
+cull_mlocked:
+		if (putback_lru_page(page))
+			unlock_page(page);
+		continue;
+
 activate_locked:
 		/* Not a candidate for swapping, so reclaim swap space. */
 		if (PageSwapCache(page) && vm_swap_full())
@@ -678,7 +692,7 @@ keep_locked:
 		unlock_page(page);
 keep:
 		list_add(&page->lru, &ret_pages);
-		VM_BUG_ON(PageLRU(page));
+		VM_BUG_ON(PageLRU(page) || PageNoreclaim(page));
 	}
 	list_splice(&ret_pages, page_list);
 	if (pagevec_count(&freed_pvec))
@@ -2308,12 +2322,13 @@ int zone_reclaim(struct zone *zone, gfp_
  * @vma: the VMA in which the page is or will be mapped, may be NULL
  *
  * Test whether page is reclaimable--i.e., should be placed on active/inactive
- * lists vs noreclaim list.
+ * lists vs noreclaim list.  The vma argument is !NULL when called from the
+ * fault path to determine how to instantate a new page.
  *
  * Reasons page might not be reclaimable:
  * (1) page's mapping marked non-reclaimable
+ * (2) page is part of an mlocked VMA
  *
- * TODO - later patches
  */
 int page_reclaimable(struct page *page, struct vm_area_struct *vma)
 {
@@ -2323,13 +2338,16 @@ int page_reclaimable(struct page *page, 
 	if (mapping_non_reclaimable(page_mapping(page)))
 		return 0;
 
-	/* TODO:  test page [!]reclaimable conditions */
+#ifdef CONFIG_NORECLAIM_MLOCK
+	if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page)))
+		return 0;
+#endif
 
 	return 1;
 }
 
 /**
- * check_move_noreclaim_page - check page for reclaimability and move to appropriate zone lru list
+ * check_move_noreclaim_page - check page for reclaimability and move to appropriate lru list
  * @page: page to check reclaimability and move to appropriate lru list
  * @zone: zone page is in
  *
Index: linux-2.6.26-rc2-mm1/include/linux/page-flags.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/page-flags.h	2008-06-06 16:05:15.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/page-flags.h	2008-06-06 16:06:28.000000000 -0400
@@ -96,6 +96,9 @@ enum pageflags {
 	PG_swapbacked,		/* Page is backed by RAM/swap */
 #ifdef CONFIG_NORECLAIM_LRU
 	PG_noreclaim,		/* Page is "non-reclaimable"  */
+#ifdef CONFIG_NORECLAIM_MLOCK
+	PG_mlocked,		/* Page is vma mlocked */
+#endif
 #endif
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 	PG_uncached,		/* Page has been mapped as uncached */
@@ -210,12 +213,25 @@ PAGEFLAG_FALSE(SwapCache)
 #ifdef CONFIG_NORECLAIM_LRU
 PAGEFLAG(Noreclaim, noreclaim) __CLEARPAGEFLAG(Noreclaim, noreclaim)
 	TESTCLEARFLAG(Noreclaim, noreclaim)
+
+#ifdef CONFIG_NORECLAIM_MLOCK
+#define MLOCK_PAGES 1
+PAGEFLAG(Mlocked, mlocked) __CLEARPAGEFLAG(Mlocked, mlocked)
+	TESTSCFLAG(Mlocked, mlocked)
+#endif
+
 #else
 PAGEFLAG_FALSE(Noreclaim) TESTCLEARFLAG_FALSE(Noreclaim)
 	SETPAGEFLAG_NOOP(Noreclaim) CLEARPAGEFLAG_NOOP(Noreclaim)
 	__CLEARPAGEFLAG_NOOP(Noreclaim)
 #endif
 
+#if !defined(CONFIG_NORECLAIM_MLOCK)
+#define MLOCK_PAGES 0
+PAGEFLAG_FALSE(Mlocked)
+	SETPAGEFLAG_NOOP(Mlocked) TESTCLEARFLAG_FALSE(Mlocked)
+#endif
+
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 PAGEFLAG(Uncached, uncached)
 #else
Index: linux-2.6.26-rc2-mm1/include/linux/rmap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/rmap.h	2008-05-15 11:21:11.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/rmap.h	2008-06-06 16:06:28.000000000 -0400
@@ -97,6 +97,19 @@ unsigned long page_address_in_vma(struct
  */
 int page_mkclean(struct page *);
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+/*
+ * called in munlock()/munmap() path to check for other vmas holding
+ * the page mlocked.
+ */
+int try_to_unlock(struct page *);
+#else
+static inline int try_to_unlock(struct page *page)
+{
+	return 0;	/* a.k.a. SWAP_SUCCESS */
+}
+#endif
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
@@ -120,5 +133,6 @@ static inline int page_mkclean(struct pa
 #define SWAP_SUCCESS	0
 #define SWAP_AGAIN	1
 #define SWAP_FAIL	2
+#define SWAP_MLOCK	3
 
 #endif	/* _LINUX_RMAP_H */
Index: linux-2.6.26-rc2-mm1/mm/rmap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/rmap.c	2008-05-15 11:21:11.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/rmap.c	2008-06-06 16:06:28.000000000 -0400
@@ -52,6 +52,8 @@
 
 #include <asm/tlbflush.h>
 
+#include "internal.h"
+
 static struct kmem_cache *anon_vma_cachep;
 
 static inline struct anon_vma *anon_vma_alloc(void)
@@ -273,6 +275,32 @@ pte_t *page_check_address(struct page *p
 	return NULL;
 }
 
+/**
+ * page_mapped_in_vma - check whether a page is really mapped in a VMA
+ * @page: the page to test
+ * @vma: the VMA to test
+ *
+ * Returns 1 if the page is mapped into the page tables of the VMA, 0
+ * if the page is not mapped into the page tables of this VMA.  Only
+ * valid for normal file or anonymous VMAs.
+ */
+static int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
+{
+	unsigned long address;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	address = vma_address(page, vma);
+	if (address == -EFAULT)		/* out of vma range */
+		return 0;
+	pte = page_check_address(page, vma->vm_mm, address, &ptl);
+	if (!pte)			/* the page is not in this mm */
+		return 0;
+	pte_unmap_unlock(pte, ptl);
+
+	return 1;
+}
+
 /*
  * Subfunctions of page_referenced: page_referenced_one called
  * repeatedly from either page_referenced_anon or page_referenced_file.
@@ -294,10 +322,17 @@ static int page_referenced_one(struct pa
 	if (!pte)
 		goto out;
 
+	/*
+	 * Don't want to elevate referenced for mlocked page that gets this far,
+	 * in order that it progresses to try_to_unmap and is moved to the
+	 * noreclaim list.
+	 */
 	if (vma->vm_flags & VM_LOCKED) {
-		referenced++;
 		*mapcount = 1;	/* break early from loop */
-	} else if (ptep_clear_flush_young(vma, address, pte))
+		goto out_unmap;
+	}
+
+	if (ptep_clear_flush_young(vma, address, pte))
 		referenced++;
 
 	/* Pretend the page is referenced if the task has the
@@ -306,6 +341,7 @@ static int page_referenced_one(struct pa
 			rwsem_is_locked(&mm->mmap_sem))
 		referenced++;
 
+out_unmap:
 	(*mapcount)--;
 	pte_unmap_unlock(pte, ptl);
 out:
@@ -395,11 +431,6 @@ static int page_referenced_file(struct p
 		 */
 		if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
 			continue;
-		if ((vma->vm_flags & (VM_LOCKED|VM_MAYSHARE))
-				  == (VM_LOCKED|VM_MAYSHARE)) {
-			referenced++;
-			break;
-		}
 		referenced += page_referenced_one(page, vma, &mapcount);
 		if (!mapcount)
 			break;
@@ -726,10 +757,15 @@ static int try_to_unmap_one(struct page 
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
 	 */
-	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young(vma, address, pte)))) {
-		ret = SWAP_FAIL;
-		goto out_unmap;
+	if (!migration) {
+		if (vma->vm_flags & VM_LOCKED) {
+			ret = SWAP_MLOCK;
+			goto out_unmap;
+		}
+		if (ptep_clear_flush_young(vma, address, pte)) {
+			ret = SWAP_FAIL;
+			goto out_unmap;
+		}
 	}
 
 	/* Nuke the page table entry. */
@@ -811,12 +847,17 @@ out:
  * For very sparsely populated VMAs this is a little inefficient - chances are
  * there there won't be many ptes located within the scan cluster.  In this case
  * maybe we could scan further - to the end of the pte page, perhaps.
+ *
+ * Mlocked pages:  check VM_LOCKED under mmap_sem held for read, if we can
+ * acquire it without blocking.  If vma locked, mlock the pages in the cluster,
+ * rather than unmapping them.  If we encounter the "check_page" that vmscan is
+ * trying to unmap, return SWAP_MLOCK, else default SWAP_AGAIN.
  */
 #define CLUSTER_SIZE	min(32*PAGE_SIZE, PMD_SIZE)
 #define CLUSTER_MASK	(~(CLUSTER_SIZE - 1))
 
-static void try_to_unmap_cluster(unsigned long cursor,
-	unsigned int *mapcount, struct vm_area_struct *vma)
+static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
+		struct vm_area_struct *vma, struct page *check_page)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -828,6 +869,8 @@ static void try_to_unmap_cluster(unsigne
 	struct page *page;
 	unsigned long address;
 	unsigned long end;
+	int ret = SWAP_AGAIN;
+	int locked_vma = 0;
 
 	address = (vma->vm_start + cursor) & CLUSTER_MASK;
 	end = address + CLUSTER_SIZE;
@@ -838,15 +881,26 @@ static void try_to_unmap_cluster(unsigne
 
 	pgd = pgd_offset(mm, address);
 	if (!pgd_present(*pgd))
-		return;
+		return ret;
 
 	pud = pud_offset(pgd, address);
 	if (!pud_present(*pud))
-		return;
+		return ret;
 
 	pmd = pmd_offset(pud, address);
 	if (!pmd_present(*pmd))
-		return;
+		return ret;
+
+	/*
+	 * MLOCK_PAGES => feature is configured.
+	 * if we can acquire the mmap_sem for read, and vma is VM_LOCKED,
+	 * keep the sem while scanning the cluster for mlocking pages.
+	 */
+	if (MLOCK_PAGES && down_read_trylock(&vma->vm_mm->mmap_sem)) {
+		locked_vma = (vma->vm_flags & VM_LOCKED);
+		if (!locked_vma)
+			up_read(&vma->vm_mm->mmap_sem); /* don't need it */
+	}
 
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 
@@ -859,6 +913,13 @@ static void try_to_unmap_cluster(unsigne
 		page = vm_normal_page(vma, address, *pte);
 		BUG_ON(!page || PageAnon(page));
 
+		if (locked_vma) {
+			mlock_vma_page(page);   /* no-op if already mlocked */
+			if (page == check_page)
+				ret = SWAP_MLOCK;
+			continue;	/* don't unmap */
+		}
+
 		if (ptep_clear_flush_young(vma, address, pte))
 			continue;
 
@@ -880,39 +941,104 @@ static void try_to_unmap_cluster(unsigne
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
+	if (locked_vma)
+		up_read(&vma->vm_mm->mmap_sem);
+	return ret;
 }
 
-static int try_to_unmap_anon(struct page *page, int migration)
+/*
+ * common handling for pages mapped in VM_LOCKED vmas
+ */
+static int try_to_mlock_page(struct page *page, struct vm_area_struct *vma)
+{
+	int mlocked = 0;
+
+	if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
+		if (vma->vm_flags & VM_LOCKED) {
+			mlock_vma_page(page);
+			mlocked++;	/* really mlocked the page */
+		}
+		up_read(&vma->vm_mm->mmap_sem);
+	}
+	return mlocked;
+}
+
+/**
+ * try_to_unmap_anon - unmap or unlock anonymous page using the object-based
+ * rmap method
+ * @page: the page to unmap/unlock
+ * @unlock:  request for unlock rather than unmap [unlikely]
+ * @migration:  unmapping for migration - ignored if @unlock
+ *
+ * Find all the mappings of a page using the mapping pointer and the vma chains
+ * contained in the anon_vma struct it points to.
+ *
+ * This function is only called from try_to_unmap/try_to_unlock for
+ * anonymous pages.
+ * When called from try_to_unlock(), the mmap_sem of the mm containing the vma
+ * where the page was found will be held for write.  So, we won't recheck
+ * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
+ * 'LOCKED.
+ */
+static int try_to_unmap_anon(struct page *page, int unlock, int migration)
 {
 	struct anon_vma *anon_vma;
 	struct vm_area_struct *vma;
+	unsigned int mlocked = 0;
 	int ret = SWAP_AGAIN;
 
+	if (MLOCK_PAGES && unlikely(unlock))
+		ret = SWAP_SUCCESS;	/* default for try_to_unlock() */
+
 	anon_vma = page_lock_anon_vma(page);
 	if (!anon_vma)
 		return ret;
 
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
-		ret = try_to_unmap_one(page, vma, migration);
-		if (ret == SWAP_FAIL || !page_mapped(page))
-			break;
+		if (MLOCK_PAGES && unlikely(unlock)) {
+			if (!((vma->vm_flags & VM_LOCKED) &&
+			      page_mapped_in_vma(page, vma)))
+				continue;  /* must visit all unlocked vmas */
+			ret = SWAP_MLOCK;  /* saw at least one mlocked vma */
+		} else {
+			ret = try_to_unmap_one(page, vma, migration);
+			if (ret == SWAP_FAIL || !page_mapped(page))
+				break;
+		}
+		if (ret == SWAP_MLOCK) {
+			mlocked = try_to_mlock_page(page, vma);
+			if (mlocked)
+				break;	/* stop if actually mlocked page */
+		}
 	}
 
 	page_unlock_anon_vma(anon_vma);
+
+	if (mlocked)
+		ret = SWAP_MLOCK;	/* actually mlocked the page */
+	else if (ret == SWAP_MLOCK)
+		ret = SWAP_AGAIN;	/* saw VM_LOCKED vma */
+
 	return ret;
 }
 
 /**
- * try_to_unmap_file - unmap file page using the object-based rmap method
- * @page: the page to unmap
- * @migration: migration flag
+ * try_to_unmap_file - unmap/unlock file page using the object-based rmap method
+ * @page: the page to unmap/unlock
+ * @unlock:  request for unlock rather than unmap [unlikely]
+ * @migration:  unmapping for migration - ignored if @unlock
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
  *
- * This function is only called from try_to_unmap for object-based pages.
+ * This function is only called from try_to_unmap/try_to_unlock for
+ * object-based pages.
+ * When called from try_to_unlock(), the mmap_sem of the mm containing the vma
+ * where the page was found will be held for write.  So, we won't recheck
+ * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
+ * 'LOCKED.
  */
-static int try_to_unmap_file(struct page *page, int migration)
+static int try_to_unmap_file(struct page *page, int unlock, int migration)
 {
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -923,20 +1049,44 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_cursor = 0;
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
+	unsigned int mlocked = 0;
+
+	if (MLOCK_PAGES && unlikely(unlock))
+		ret = SWAP_SUCCESS;	/* default for try_to_unlock() */
 
 	spin_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
-		ret = try_to_unmap_one(page, vma, migration);
-		if (ret == SWAP_FAIL || !page_mapped(page))
-			goto out;
+		if (MLOCK_PAGES && unlikely(unlock)) {
+			if (!(vma->vm_flags & VM_LOCKED))
+				continue;	/* must visit all vmas */
+			ret = SWAP_MLOCK;
+		} else {
+			ret = try_to_unmap_one(page, vma, migration);
+			if (ret == SWAP_FAIL || !page_mapped(page))
+				goto out;
+		}
+		if (ret == SWAP_MLOCK) {
+			mlocked = try_to_mlock_page(page, vma);
+			if (mlocked)
+				break;  /* stop if actually mlocked page */
+		}
 	}
 
+	if (mlocked)
+		goto out;
+
 	if (list_empty(&mapping->i_mmap_nonlinear))
 		goto out;
 
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
 						shared.vm_set.list) {
-		if ((vma->vm_flags & VM_LOCKED) && !migration)
+		if (MLOCK_PAGES && unlikely(unlock)) {
+			if (!(vma->vm_flags & VM_LOCKED))
+				continue;	/* must visit all vmas */
+			ret = SWAP_MLOCK;	/* leave mlocked == 0 */
+			goto out;		/* no need to look further */
+		}
+		if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED))
 			continue;
 		cursor = (unsigned long) vma->vm_private_data;
 		if (cursor > max_nl_cursor)
@@ -946,7 +1096,7 @@ static int try_to_unmap_file(struct page
 			max_nl_size = cursor;
 	}
 
-	if (max_nl_size == 0) {	/* any nonlinears locked or reserved */
+	if (max_nl_size == 0) {	/* all nonlinears locked or reserved ? */
 		ret = SWAP_FAIL;
 		goto out;
 	}
@@ -970,12 +1120,16 @@ static int try_to_unmap_file(struct page
 	do {
 		list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
 						shared.vm_set.list) {
-			if ((vma->vm_flags & VM_LOCKED) && !migration)
+			if (!MLOCK_PAGES && !migration &&
+			    (vma->vm_flags & VM_LOCKED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
 			while ( cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
-				try_to_unmap_cluster(cursor, &mapcount, vma);
+				ret = try_to_unmap_cluster(cursor, &mapcount,
+								vma, page);
+				if (ret == SWAP_MLOCK)
+					mlocked = 2;	/* to return below */
 				cursor += CLUSTER_SIZE;
 				vma->vm_private_data = (void *) cursor;
 				if ((int)mapcount <= 0)
@@ -996,6 +1150,10 @@ static int try_to_unmap_file(struct page
 		vma->vm_private_data = NULL;
 out:
 	spin_unlock(&mapping->i_mmap_lock);
+	if (mlocked)
+		ret = SWAP_MLOCK;	/* actually mlocked the page */
+	else if (ret == SWAP_MLOCK)
+		ret = SWAP_AGAIN;	/* saw VM_LOCKED vma */
 	return ret;
 }
 
@@ -1011,6 +1169,7 @@ out:
  * SWAP_SUCCESS	- we succeeded in removing all mappings
  * SWAP_AGAIN	- we missed a mapping, try again later
  * SWAP_FAIL	- the page is unswappable
+ * SWAP_MLOCK	- page is mlocked.
  */
 int try_to_unmap(struct page *page, int migration)
 {
@@ -1019,12 +1178,33 @@ int try_to_unmap(struct page *page, int 
 	BUG_ON(!PageLocked(page));
 
 	if (PageAnon(page))
-		ret = try_to_unmap_anon(page, migration);
+		ret = try_to_unmap_anon(page, 0, migration);
 	else
-		ret = try_to_unmap_file(page, migration);
-
-	if (!page_mapped(page))
+		ret = try_to_unmap_file(page, 0, migration);
+	if (ret != SWAP_MLOCK && !page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;
 }
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+/**
+ * try_to_unlock - Check page's rmap for other vma's holding page locked.
+ * @page: the page to be unlocked.   will be returned with PG_mlocked
+ * cleared if no vmas are VM_LOCKED.
+ *
+ * Return values are:
+ *
+ * SWAP_SUCCESS	- no vma's holding page locked.
+ * SWAP_AGAIN	- page mapped in mlocked vma -- couldn't acquire mmap sem
+ * SWAP_MLOCK	- page is now mlocked.
+ */
+int try_to_unlock(struct page *page)
+{
+	VM_BUG_ON(!PageLocked(page) || PageLRU(page));
+
+	if (PageAnon(page))
+		return try_to_unmap_anon(page, 1, 0);
+	else
+		return try_to_unmap_file(page, 1, 0);
+}
+#endif
Index: linux-2.6.26-rc2-mm1/mm/migrate.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/migrate.c	2008-06-06 16:05:15.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/migrate.c	2008-06-06 16:06:28.000000000 -0400
@@ -359,6 +359,8 @@ static void migrate_page_copy(struct pag
 		__set_page_dirty_nobuffers(newpage);
  	}
 
+	mlock_migrate_page(newpage, page);
+
 #ifdef CONFIG_SWAP
 	ClearPageSwapCache(page);
 #endif
Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c	2008-06-06 16:05:57.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/page_alloc.c	2008-06-06 16:06:28.000000000 -0400
@@ -258,6 +258,9 @@ static void bad_page(struct page *page)
 			1 << PG_active	|
 #ifdef CONFIG_NORECLAIM_LRU
 			1 << PG_noreclaim	|
+#ifdef CONFIG_NORECLAIM_MLOCK
+			1 << PG_mlocked |
+#endif
 #endif
 			1 << PG_dirty	|
 			1 << PG_reclaim |
@@ -497,6 +500,9 @@ static inline int free_pages_check(struc
 #ifdef CONFIG_NORECLAIM_LRU
 			1 << PG_noreclaim |
 #endif
+#ifdef CONFIG_NORECLAIM_MLOCK
+			1 << PG_mlocked |
+#endif
 			1 << PG_buddy ))))
 		bad_page(page);
 	if (PageDirty(page))
@@ -650,6 +656,9 @@ static int prep_new_page(struct page *pa
 			1 << PG_active	|
 #ifdef CONFIG_NORECLAIM_LRU
 			1 << PG_noreclaim	|
+#ifdef CONFIG_NORECLAIM_MLOCK
+			1 << PG_mlocked |
+#endif
 #endif
 			1 << PG_dirty	|
 			1 << PG_slab    |
@@ -669,7 +678,11 @@ static int prep_new_page(struct page *pa
 
 	page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_reclaim |
 			1 << PG_referenced | 1 << PG_arch_1 |
-			1 << PG_owner_priv_1 | 1 << PG_mappedtodisk);
+			1 << PG_owner_priv_1 | 1 << PG_mappedtodisk
+#ifdef CONFIG_NORECLAIM_MLOCK
+			| 1 << PG_mlocked
+#endif
+			);
 	set_page_private(page, 0);
 	set_page_refcounted(page);
 
Index: linux-2.6.26-rc2-mm1/mm/swap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/swap.c	2008-06-06 16:05:15.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/swap.c	2008-06-06 16:06:28.000000000 -0400
@@ -307,7 +307,7 @@ void lru_add_drain(void)
 	put_cpu();
 }
 
-#ifdef CONFIG_NUMA
+#if defined(CONFIG_NUMA) || defined(CONFIG_NORECLAIM_MLOCK)
 static void lru_add_drain_per_cpu(struct work_struct *dummy)
 {
 	lru_add_drain();
Index: linux-2.6.26-rc2-mm1/mm/memory.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/memory.c	2008-05-23 14:21:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/memory.c	2008-06-06 16:06:28.000000000 -0400
@@ -61,6 +61,8 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 
+#include "internal.h"
+
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 /* use the per-pgdat data instead for discontigmem - mbligh */
 unsigned long max_mapnr;
@@ -1734,6 +1736,15 @@ gotten:
 	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
 	if (!new_page)
 		goto oom;
+	/*
+	 * Don't let another task, with possibly unlocked vma,
+	 * keep the mlocked page.
+	 */
+	if (vma->vm_flags & VM_LOCKED) {
+		lock_page(old_page);	/* for LRU manipulation */
+		clear_page_mlock(old_page);
+		unlock_page(old_page);
+	}
 	cow_user_page(new_page, old_page, address, vma);
 	__SetPageUptodate(new_page);
 
@@ -2176,7 +2187,7 @@ static int do_swap_page(struct mm_struct
 	page_add_anon_rmap(page, vma, address);
 
 	swap_free(entry);
-	if (vm_swap_full())
+	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
 		remove_exclusive_swap_page(page);
 	unlock_page(page);
 
@@ -2316,6 +2327,12 @@ static int __do_fault(struct mm_struct *
 				ret = VM_FAULT_OOM;
 				goto out;
 			}
+			/*
+			 * Don't let another task, with possibly unlocked vma,
+			 * keep the mlocked page.
+			 */
+			if (vma->vm_flags & VM_LOCKED)
+				clear_page_mlock(vmf.page);
 			copy_user_highpage(page, vmf.page, address, vma);
 			__SetPageUptodate(page);
 		} else {
Index: linux-2.6.26-rc2-mm1/mm/mmap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/mmap.c	2008-05-15 11:20:57.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/mmap.c	2008-06-06 16:06:28.000000000 -0400
@@ -652,7 +652,6 @@ again:			remove_next = 1 + (end > next->
  * If the vma has a ->close operation then the driver probably needs to release
  * per-vma resources, so we don't attempt to merge those.
  */
-#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
 
 static inline int is_mergeable_vma(struct vm_area_struct *vma,
 			struct file *file, unsigned long vm_flags)
Index: linux-2.6.26-rc2-mm1/include/linux/mm.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mm.h	2008-06-06 16:06:24.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mm.h	2008-06-06 16:06:28.000000000 -0400
@@ -126,6 +126,11 @@ extern unsigned int kobjsize(const void 
 #define VM_RandomReadHint(v)		((v)->vm_flags & VM_RAND_READ)
 
 /*
+ * special vmas that are non-mergable, non-mlock()able
+ */
+#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
+
+/*
  * mapping from the currently active vm_flags protection bits (the
  * low four bits) to a page protection mask..
  */

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 18/25] Downgrade mmap sem while populating mlocked regions
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (16 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 17/25] Mlocked Pages " Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-06 20:28 ` [PATCH -mm 19/25] Handle mlocked pages during map, remap, unmap Rik van Riel, Rik van Riel
                   ` (7 subsequent siblings)
  25 siblings, 0 replies; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro

[-- Attachment #1: rvr-18-lts-noreclaim-mlocked-page-statistics.patch --]
[-- Type: text/plain, Size: 4708 bytes --]

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

Against:  2.6.26-rc2-mm1

We need to hold the mmap_sem for write to initiatate mlock()/munlock()
because we may need to merge/split vmas.  However, this can lead to
very long lock hold times attempting to fault in a large memory region
to mlock it into memory.   This can hold off other faults against the
mm [multithreaded tasks] and other scans of the mm, such as via /proc.
To alleviate this, downgrade the mmap_sem to read mode during the 
population of the region for locking.  This is especially the case 
if we need to reclaim memory to lock down the region.  We [probably?]
don't need to do this for unlocking as all of the pages should be
resident--they're already mlocked.

Now, the caller's of the mlock functions [mlock_fixup() and 
mlock_vma_pages_range()] expect the mmap_sem to be returned in write
mode.  Changing all callers appears to be way too much effort at this
point.  So, restore write mode before returning.  Note that this opens
a window where the mmap list could change in a multithreaded process.
So, at least for mlock_fixup(), where we could be called in a loop over
multiple vmas, we check that a vma still exists at the start address
and that vma still covers the page range [start,end).  If not, we return
an error, -EAGAIN, and let the caller deal with it.

Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup()
if the vma at 'start' disappears or changes so that the page range
[start,end) is no longer contained in the vma.  Again, let the caller
deal with it.  Looks like only sys_remap_file_pages() [via mmap_region()]
should actually care.

With this patch, I no longer see processes like ps(1) blocked for seconds
or minutes at a time waiting for a large [multiple gigabyte] region to be
locked down.  However, I occassionally see delays while unlocking or
unmapping a large mlocked region.  Should we also downgrade the mmap_sem
for the unlock path?

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>

--- 
V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series [no change]
+ fix function return types [void -> int] to fix build when
  not configured.

New in V2.

 mm/mlock.c |   43 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 41 insertions(+), 2 deletions(-)

Index: linux-2.6.26-rc2-mm1/mm/mlock.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/mlock.c	2008-06-06 16:06:28.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/mlock.c	2008-06-06 16:06:32.000000000 -0400
@@ -309,6 +309,7 @@ static void __munlock_vma_pages_range(st
 int mlock_vma_pages_range(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	int nr_pages = (end - start) / PAGE_SIZE;
 	BUG_ON(!(vma->vm_flags & VM_LOCKED));

@@ -323,7 +324,17 @@ int mlock_vma_pages_range(struct vm_area
 			vma == get_gate_vma(current))
 		goto make_present;

-	return __mlock_vma_pages_range(vma, start, end);
+	downgrade_write(&mm->mmap_sem);
+	nr_pages = __mlock_vma_pages_range(vma, start, end);
+
+	up_read(&mm->mmap_sem);
+	/* vma can change or disappear */
+	down_write(&mm->mmap_sem);
+	vma = find_vma(mm, start);
+	/* non-NULL vma must contain @start, but need to check @end */
+	if (!vma ||  end > vma->vm_end)
+		return -EAGAIN;
+	return nr_pages;

 make_present:
 	/*
@@ -418,13 +429,41 @@ success:
 	vma->vm_flags = newflags;

 	if (lock) {
+		/*
+		 * mmap_sem is currently held for write.  Downgrade the write
+		 * lock to a read lock so that other faults, mmap scans, ...
+		 * while we fault in all pages.
+		 */
+		downgrade_write(&mm->mmap_sem);
+
 		ret = __mlock_vma_pages_range(vma, start, end);
 		if (ret > 0) {
 			mm->locked_vm -= ret;
 			ret = 0;
 		}
-	} else
+		/*
+		 * Need to reacquire mmap sem in write mode, as our callers
+		 * expect this.  We have no support for atomically upgrading
+		 * a sem to write, so we need to check for ranges while sem
+		 * is unlocked.
+		 */
+		up_read(&mm->mmap_sem);
+		/* vma can change or disappear */
+		down_write(&mm->mmap_sem);
+		*prev = find_vma(mm, start);
+		/* non-NULL *prev must contain @start, but need to check @end */
+		if (!(*prev) || end > (*prev)->vm_end)
+			ret = -EAGAIN;
+	} else {
+		/*
+		 * TODO:  for unlocking, pages will already be resident, so
+		 * we don't need to wait for allocations/reclaim/pagein, ...
+		 * However, unlocking a very large region can still take a
+		 * while.  Should we downgrade the semaphore for both lock
+		 * AND unlock ?
+		 */
 		__munlock_vma_pages_range(vma, start, end);
+	}

 out:
 	*prev = vma;

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 19/25] Handle mlocked pages during map, remap, unmap
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (17 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 18/25] Downgrade mmap sem while populating mlocked regions Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-06 20:28 ` [PATCH -mm 20/25] Mlocked Pages statistics Rik van Riel, Rik van Riel
                   ` (6 subsequent siblings)
  25 siblings, 0 replies; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, linux-mm,
	Eric Whitney

[-- Attachment #1: rvr-19-lts-noreclaim-cull-non-reclaimable-anon-pages-in-fault-path.patch --]
[-- Type: text/plain, Size: 11643 bytes --]

Originally
From: Nick Piggin <npiggin@suse.de>

Against:  2.6.26-rc2-mm1

Remove mlocked pages from the LRU using "NoReclaim infrastructure"
during mmap(), munmap(), mremap() and truncate().  Try to move back
to normal LRU lists on munmap() when last mlocked mapping removed.
Removed PageMlocked() status when page truncated from file.

Originally Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---

V6:
+ munlock page in range of VM_LOCKED vma being covered by
  remap_file_pages(), as this is an implied unmap of the
  range.
+ in support of special vma filtering, don't account for
  non-mlockable vmas as locked_vm. 

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series [no changes]

V1 -> V2:
+  modified mmap.c:mmap_region() to return error if mlock_vma_pages_range()
   does.  This can only occur if the vma gets removed/changed while
   we're switching mmap_sem lock modes.   Most callers don't care, but
   sys_remap_file_pages() appears to.

Rework of Nick Piggins's "mm: move mlocked pages off the LRU" patch
-- part 2 0f 2.

 mm/fremap.c   |   26 +++++++++++++++++---
 mm/internal.h |   13 ++++++++--
 mm/mlock.c    |   10 ++++---
 mm/mmap.c     |   75 ++++++++++++++++++++++++++++++++++++++++++++--------------
 mm/mremap.c   |    8 +++---
 mm/truncate.c |    4 +++
 6 files changed, 106 insertions(+), 30 deletions(-)

Index: linux-2.6.26-rc2-mm1/mm/mmap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/mmap.c	2008-06-06 16:06:28.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/mmap.c	2008-06-06 16:06:35.000000000 -0400
@@ -32,6 +32,8 @@
 #include <asm/tlb.h>
 #include <asm/mmu_context.h>
 
+#include "internal.h"
+
 #ifndef arch_mmap_check
 #define arch_mmap_check(addr, len, flags)	(0)
 #endif
@@ -961,6 +963,7 @@ unsigned long do_mmap_pgoff(struct file 
 			return -EPERM;
 		vm_flags |= VM_LOCKED;
 	}
+
 	/* mlock MCL_FUTURE? */
 	if (vm_flags & VM_LOCKED) {
 		unsigned long locked, lock_limit;
@@ -1121,10 +1124,12 @@ munmap_back:
 	 * The VM_SHARED test is necessary because shmem_zero_setup
 	 * will create the file object for a shared anonymous map below.
 	 */
-	if (!file && !(vm_flags & VM_SHARED) &&
-	    vma_merge(mm, prev, addr, addr + len, vm_flags,
-					NULL, NULL, pgoff, NULL))
-		goto out;
+	if (!file && !(vm_flags & VM_SHARED)) {
+		vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
+					NULL, NULL, pgoff, NULL);
+		if (vma)
+			goto out;
+	}
 
 	/*
 	 * Determine the object being mapped and call the appropriate
@@ -1206,10 +1211,14 @@ out:
 	mm->total_vm += len >> PAGE_SHIFT;
 	vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
-		mm->locked_vm += len >> PAGE_SHIFT;
-		make_pages_present(addr, addr + len);
-	}
-	if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
+		/*
+		 * makes pages present; downgrades, drops, reacquires mmap_sem
+		 */
+		int nr_pages = mlock_vma_pages_range(vma, addr, addr + len);
+		if (nr_pages < 0)
+			return nr_pages;	/* vma gone! */
+		mm->locked_vm += (len >> PAGE_SHIFT) - nr_pages;
+	} else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
 		make_pages_present(addr, addr + len);
 	return addr;
 
@@ -1682,8 +1691,11 @@ find_extend_vma(struct mm_struct *mm, un
 		return vma;
 	if (!prev || expand_stack(prev, addr))
 		return NULL;
-	if (prev->vm_flags & VM_LOCKED)
-		make_pages_present(addr, prev->vm_end);
+	if (prev->vm_flags & VM_LOCKED) {
+		int nr_pages = mlock_vma_pages_range(prev, addr, prev->vm_end);
+		if (nr_pages < 0)
+			return NULL;	/* vma gone! */
+	}
 	return prev;
 }
 #else
@@ -1709,8 +1721,11 @@ find_extend_vma(struct mm_struct * mm, u
 	start = vma->vm_start;
 	if (expand_stack(vma, addr))
 		return NULL;
-	if (vma->vm_flags & VM_LOCKED)
-		make_pages_present(addr, start);
+	if (vma->vm_flags & VM_LOCKED) {
+		int nr_pages = mlock_vma_pages_range(vma, addr, start);
+		if (nr_pages < 0)
+			return NULL;	/* vma gone! */
+	}
 	return vma;
 }
 #endif
@@ -1895,6 +1910,18 @@ int do_munmap(struct mm_struct *mm, unsi
 	vma = prev? prev->vm_next: mm->mmap;
 
 	/*
+	 * unlock any mlock()ed ranges before detaching vmas
+	 */
+	if (mm->locked_vm) {
+		struct vm_area_struct *tmp = vma;
+		while (tmp && tmp->vm_start < end) {
+			if (tmp->vm_flags & VM_LOCKED)
+				munlock_vma_pages_all(tmp);
+			tmp = tmp->vm_next;
+		}
+	}
+
+	/*
 	 * Remove the vma's, and unmap the actual pages
 	 */
 	detach_vmas_to_be_unmapped(mm, vma, prev, end);
@@ -2006,8 +2033,9 @@ unsigned long do_brk(unsigned long addr,
 		return -ENOMEM;
 
 	/* Can we just expand an old private anonymous mapping? */
-	if (vma_merge(mm, prev, addr, addr + len, flags,
-					NULL, NULL, pgoff, NULL))
+	vma = vma_merge(mm, prev, addr, addr + len, flags,
+					NULL, NULL, pgoff, NULL);
+	if (vma)
 		goto out;
 
 	/*
@@ -2029,8 +2057,9 @@ unsigned long do_brk(unsigned long addr,
 out:
 	mm->total_vm += len >> PAGE_SHIFT;
 	if (flags & VM_LOCKED) {
-		mm->locked_vm += len >> PAGE_SHIFT;
-		make_pages_present(addr, addr + len);
+		int nr_pages = mlock_vma_pages_range(vma, addr, addr + len);
+		if (nr_pages >= 0)
+			mm->locked_vm += (len >> PAGE_SHIFT) - nr_pages;
 	}
 	return addr;
 }
@@ -2041,13 +2070,25 @@ EXPORT_SYMBOL(do_brk);
 void exit_mmap(struct mm_struct *mm)
 {
 	struct mmu_gather *tlb;
-	struct vm_area_struct *vma = mm->mmap;
+	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
 
 	/* mm's last user has gone, and its about to be pulled down */
 	arch_exit_mmap(mm);
 
+	if (mm->locked_vm) {
+		vma = mm->mmap;
+		while (vma) {
+			if (vma->vm_flags & VM_LOCKED)
+				munlock_vma_pages_all(vma);
+			vma = vma->vm_next;
+		}
+	}
+
+	vma = mm->mmap;
+
+
 	lru_add_drain();
 	flush_cache_mm(mm);
 	tlb = tlb_gather_mmu(mm, 1);
Index: linux-2.6.26-rc2-mm1/mm/mremap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/mremap.c	2008-05-15 11:20:24.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/mremap.c	2008-06-06 16:06:35.000000000 -0400
@@ -23,6 +23,8 @@
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 
+#include "internal.h"
+
 static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
@@ -232,8 +234,8 @@ static unsigned long move_vma(struct vm_
 	if (vm_flags & VM_LOCKED) {
 		mm->locked_vm += new_len >> PAGE_SHIFT;
 		if (new_len > old_len)
-			make_pages_present(new_addr + old_len,
-					   new_addr + new_len);
+			mlock_vma_pages_range(new_vma, new_addr + old_len,
+						       new_addr + new_len);
 	}
 
 	return new_addr;
@@ -373,7 +375,7 @@ unsigned long do_mremap(unsigned long ad
 			vm_stat_account(mm, vma->vm_flags, vma->vm_file, pages);
 			if (vma->vm_flags & VM_LOCKED) {
 				mm->locked_vm += pages;
-				make_pages_present(addr + old_len,
+				mlock_vma_pages_range(vma, addr + old_len,
 						   addr + new_len);
 			}
 			ret = addr;
Index: linux-2.6.26-rc2-mm1/mm/truncate.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/truncate.c	2008-05-15 11:20:57.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/truncate.c	2008-06-06 16:06:35.000000000 -0400
@@ -18,6 +18,7 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/buffer_head.h>	/* grr. try_to_release_page,
 				   do_invalidatepage */
+#include "internal.h"
 
 
 /**
@@ -104,6 +105,7 @@ truncate_complete_page(struct address_sp
 	cancel_dirty_page(page, PAGE_CACHE_SIZE);
 
 	remove_from_page_cache(page);
+	clear_page_mlock(page);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
 	page_cache_release(page);	/* pagecache ref */
@@ -128,6 +130,7 @@ invalidate_complete_page(struct address_
 	if (PagePrivate(page) && !try_to_release_page(page, 0))
 		return 0;
 
+	clear_page_mlock(page);
 	ret = remove_mapping(mapping, page);
 
 	return ret;
@@ -353,6 +356,7 @@ invalidate_complete_page2(struct address
 	if (PageDirty(page))
 		goto failed;
 
+	clear_page_mlock(page);
 	BUG_ON(PagePrivate(page));
 	__remove_from_page_cache(page);
 	write_unlock_irq(&mapping->tree_lock);
Index: linux-2.6.26-rc2-mm1/mm/mlock.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/mlock.c	2008-06-06 16:06:32.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/mlock.c	2008-06-06 16:06:35.000000000 -0400
@@ -270,7 +270,8 @@ static void __munlock_vma_pages_range(st
 	struct munlock_page_walk mpw;
 
 	VM_BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK);
-	VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
+	VM_BUG_ON((!rwsem_is_locked(&vma->vm_mm->mmap_sem)) &&
+		  (atomic_read(&mm->mm_users) != 0));
 	VM_BUG_ON(start < vma->vm_start);
 	VM_BUG_ON(end > vma->vm_end);
 
@@ -354,12 +355,13 @@ no_mlock:
 
 
 /*
- * munlock all pages in vma.   For munmap() and exit().
+ * munlock all pages in the vma range.   For mremap(), munmap() and exit().
  */
-void munlock_vma_pages_all(struct vm_area_struct *vma)
+void munlock_vma_pages_range(struct vm_area_struct *vma,
+			   unsigned long start, unsigned long end)
 {
 	vma->vm_flags &= ~VM_LOCKED;
-	__munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
+	__munlock_vma_pages_range(vma, start, end);
 }
 
 /*
Index: linux-2.6.26-rc2-mm1/mm/fremap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/fremap.c	2008-05-15 11:20:43.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/fremap.c	2008-06-06 16:06:35.000000000 -0400
@@ -20,6 +20,8 @@
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 
+#include "internal.h"
+
 static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long addr, pte_t *ptep)
 {
@@ -214,13 +216,29 @@ asmlinkage long sys_remap_file_pages(uns
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	if (vma->vm_flags & VM_LOCKED) {
+		/*
+		 * drop PG_Mlocked flag for over-mapped range
+		 */
+		unsigned int saved_flags = vma->vm_flags;
+		munlock_vma_pages_range(vma, start, start + size);
+		vma->vm_flags = saved_flags;
+	}
+
 	err = populate_range(mm, vma, start, size, pgoff);
 	if (!err && !(flags & MAP_NONBLOCK)) {
-		if (unlikely(has_write_lock)) {
-			downgrade_write(&mm->mmap_sem);
-			has_write_lock = 0;
+		if (vma->vm_flags & VM_LOCKED) {
+			/*
+			 * might be mapping previously unmapped range of file
+			 */
+			mlock_vma_pages_range(vma, start, start + size);
+		} else {
+			if (unlikely(has_write_lock)) {
+				downgrade_write(&mm->mmap_sem);
+				has_write_lock = 0;
+			}
+			make_pages_present(start, start+size);
 		}
-		make_pages_present(start, start+size);
 	}
 
 	/*
Index: linux-2.6.26-rc2-mm1/mm/internal.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/internal.h	2008-06-06 16:06:28.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/internal.h	2008-06-06 16:06:35.000000000 -0400
@@ -63,9 +63,18 @@ extern int mlock_vma_pages_range(struct 
 			unsigned long start, unsigned long end);
 
 /*
- * munlock all pages in vma.   For munmap() and exit().
+ * munlock all pages in vma range.   For mremap().
  */
-extern void munlock_vma_pages_all(struct vm_area_struct *vma);
+extern void munlock_vma_pages_range(struct vm_area_struct *vma,
+			       unsigned long start, unsigned long end);
+
+/*
+ * munlock all pages in vma.   For munmap and exit().
+ */
+static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
+{
+	munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
+}
 
 #ifdef CONFIG_NORECLAIM_LRU
 /*

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 20/25] Mlocked Pages statistics
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (18 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 19/25] Handle mlocked pages during map, remap, unmap Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-06 20:28 ` [PATCH -mm 21/25] Cull non-reclaimable pages in fault path Rik van Riel, Rik van Riel
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, Nick Piggin

[-- Attachment #1: rvr-20-lts-noreclaim-optional-vm-events-for-debug.patch --]
[-- Type: text/plain, Size: 7634 bytes --]

From: Nick Piggin <npiggin@suse.de>

Add NR_MLOCK zone page state, which provides a (conservative) count of
mlocked pages (actually, the number of mlocked pages moved off the LRU).

Reworked by lts to fit in with the modified mlock page support in the
Reclaim Scalability series.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series
+ fix definitions of NR_MLOCK to fix build errors when not configured.

V1 -> V2:
+  new in V2 -- pulled in & reworked from Nick's previous series


 drivers/base/node.c    |   24 +++++++++++++++---------
 fs/proc/proc_misc.c    |    6 ++++++
 include/linux/mmzone.h |    5 +++++
 mm/internal.h          |   14 +++++++++++---
 mm/mlock.c             |   22 ++++++++++++++++++----
 mm/vmstat.c            |    3 +++
 6 files changed, 58 insertions(+), 16 deletions(-)

Index: linux-2.6.26-rc2-mm1/drivers/base/node.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/drivers/base/node.c	2008-06-06 16:05:58.000000000 -0400
+++ linux-2.6.26-rc2-mm1/drivers/base/node.c	2008-06-06 16:06:38.000000000 -0400
@@ -69,6 +69,9 @@ static ssize_t node_read_meminfo(struct 
 		       "Node %d Inactive(file): %8lu kB\n"
 #ifdef CONFIG_NORECLAIM_LRU
 		       "Node %d Noreclaim:      %8lu kB\n"
+#ifdef CONFIG_NORECLAIM_MLOCK
+		       "Node %d Mlocked:        %8lu kB\n"
+#endif
 #endif
 #ifdef CONFIG_HIGHMEM
 		       "Node %d HighTotal:      %8lu kB\n"
@@ -91,16 +94,19 @@ static ssize_t node_read_meminfo(struct 
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
 		       nid, K(i.totalram - i.freeram),
-		       nid, node_page_state(nid, NR_ACTIVE_ANON) +
-				node_page_state(nid, NR_ACTIVE_FILE),
-		       nid, node_page_state(nid, NR_INACTIVE_ANON) +
-				node_page_state(nid, NR_INACTIVE_FILE),
-		       nid, node_page_state(nid, NR_ACTIVE_ANON),
-		       nid, node_page_state(nid, NR_INACTIVE_ANON),
-		       nid, node_page_state(nid, NR_ACTIVE_FILE),
-		       nid, node_page_state(nid, NR_INACTIVE_FILE),
+		       nid, K(node_page_state(nid, NR_ACTIVE_ANON) +
+				node_page_state(nid, NR_ACTIVE_FILE)),
+		       nid, K(node_page_state(nid, NR_INACTIVE_ANON) +
+				node_page_state(nid, NR_INACTIVE_FILE)),
+		       nid, K(node_page_state(nid, NR_ACTIVE_ANON)),
+		       nid, K(node_page_state(nid, NR_INACTIVE_ANON)),
+		       nid, K(node_page_state(nid, NR_ACTIVE_FILE)),
+		       nid, K(node_page_state(nid, NR_INACTIVE_FILE)),
 #ifdef CONFIG_NORECLAIM_LRU
-		       nid, node_page_state(nid, NR_NORECLAIM),
+		       nid, K(node_page_state(nid, NR_NORECLAIM)),
+#ifdef CONFIG_NORECLAIM_MLOCK
+		       nid, K(node_page_state(nid, NR_MLOCK)),
+#endif
 #endif
 #ifdef CONFIG_HIGHMEM
 		       nid, K(i.totalhigh),
Index: linux-2.6.26-rc2-mm1/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/fs/proc/proc_misc.c	2008-06-06 16:05:58.000000000 -0400
+++ linux-2.6.26-rc2-mm1/fs/proc/proc_misc.c	2008-06-06 16:06:38.000000000 -0400
@@ -176,6 +176,9 @@ static int meminfo_read_proc(char *page,
 		"Inactive(file): %8lu kB\n"
 #ifdef CONFIG_NORECLAIM_LRU
 		"Noreclaim:      %8lu kB\n"
+#ifdef CONFIG_NORECLAIM_MLOCK
+		"Mlocked:        %8lu kB\n"
+#endif
 #endif
 #ifdef CONFIG_HIGHMEM
 		"HighTotal:      %8lu kB\n"
@@ -214,6 +217,9 @@ static int meminfo_read_proc(char *page,
 		K(inactive_file),
 #ifdef CONFIG_NORECLAIM_LRU
 		K(global_page_state(NR_NORECLAIM)),
+#ifdef CONFIG_NORECLAIM_MLOCK
+		K(global_page_state(NR_MLOCK)),
+#endif
 #endif
 #ifdef CONFIG_HIGHMEM
 		K(i.totalhigh),
Index: linux-2.6.26-rc2-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/mmzone.h	2008-06-06 16:05:15.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/mmzone.h	2008-06-06 16:06:38.000000000 -0400
@@ -90,6 +90,11 @@ enum zone_stat_item {
 #else
 	NR_NORECLAIM = NR_ACTIVE_FILE, /* avoid compiler errors in dead code */
 #endif
+#ifdef CONFIG_NORECLAIM_MLOCK
+	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
+#else
+	NR_MLOCK = NR_ACTIVE_FILE,	/* avoid compier errors... */
+#endif
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
Index: linux-2.6.26-rc2-mm1/mm/mlock.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/mlock.c	2008-06-06 16:06:35.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/mlock.c	2008-06-06 16:06:38.000000000 -0400
@@ -56,6 +56,7 @@ void __clear_page_mlock(struct page *pag
 {
 	VM_BUG_ON(!PageLocked(page));	/* for LRU islolate/putback */
 
+	dec_zone_page_state(page, NR_MLOCK);
 	if (!isolate_lru_page(page)) {
 		putback_lru_page(page);
 	} else {
@@ -76,8 +77,11 @@ void mlock_vma_page(struct page *page)
 {
 	BUG_ON(!PageLocked(page));
 
-	if (!TestSetPageMlocked(page) && !isolate_lru_page(page))
+	if (!TestSetPageMlocked(page)) {
+		inc_zone_page_state(page, NR_MLOCK);
+		if (!isolate_lru_page(page))
 			putback_lru_page(page);
+	}
 }
 
 /*
@@ -102,9 +106,19 @@ static void munlock_vma_page(struct page
 {
 	BUG_ON(!PageLocked(page));
 
-	if (TestClearPageMlocked(page) && !isolate_lru_page(page)) {
-		try_to_unlock(page);
-		putback_lru_page(page);
+	if (TestClearPageMlocked(page)) {
+		dec_zone_page_state(page, NR_MLOCK);
+		if (!isolate_lru_page(page)) {
+			try_to_unlock(page);	/* maybe relock the page */
+			putback_lru_page(page);
+		}
+		/*
+		 * Else we lost the race.  let try_to_unmap() deal with it.
+		 * At least we get the page state and mlock stats right.
+		 * However, page is still on the noreclaim list.  We'll fix
+		 * that up when the page is eventually freed or we scan the
+		 * noreclaim list.
+		 */
 	}
 }
 
Index: linux-2.6.26-rc2-mm1/mm/vmstat.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmstat.c	2008-06-06 16:05:58.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmstat.c	2008-06-06 16:06:38.000000000 -0400
@@ -702,6 +702,9 @@ static const char * const vmstat_text[] 
 #ifdef CONFIG_NORECLAIM_LRU
 	"nr_noreclaim",
 #endif
+#ifdef CONFIG_NORECLAIM_MLOCK
+	"nr_mlock",
+#endif
 	"nr_anon_pages",
 	"nr_mapped",
 	"nr_file_pages",
Index: linux-2.6.26-rc2-mm1/mm/internal.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/internal.h	2008-06-06 16:06:35.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/internal.h	2008-06-06 16:06:38.000000000 -0400
@@ -107,7 +107,8 @@ static inline int is_mlocked_vma(struct 
 	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
 		return 0;
 
-	SetPageMlocked(page);
+	if (!TestSetPageMlocked(page))
+		inc_zone_page_state(page, NR_MLOCK);
 	return 1;
 }
 
@@ -134,12 +135,19 @@ static inline void clear_page_mlock(stru
 
 /*
  * mlock_migrate_page - called only from migrate_page_copy() to
- * migrate the Mlocked page flag
+ * migrate the Mlocked page flag; update statistics.
  */
 static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 {
-	if (TestClearPageMlocked(page))
+	if (TestClearPageMlocked(page)) {
+		unsigned long flags;
+
+		local_irq_save(flags);
+		__dec_zone_page_state(page, NR_MLOCK);
 		SetPageMlocked(newpage);
+		__inc_zone_page_state(newpage, NR_MLOCK);
+		local_irq_restore(flags);
+	}
 }
 
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 21/25] Cull non-reclaimable pages in fault path
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (19 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 20/25] Mlocked Pages statistics Rik van Riel, Rik van Riel
@ 2008-06-06 20:28 ` Rik van Riel, Rik van Riel
  2008-06-06 20:29 ` [PATCH -mm 22/25] Noreclaim and Mlocked pages vm events Rik van Riel, Rik van Riel
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, linux-mm,
	Eric Whitney

[-- Attachment #1: rvr-21-lts-noreclaim-optional-scan-noreclaim-list-for-reclaimable-pages.patch --]
[-- Type: text/plain, Size: 6191 bytes --]

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

Against:  2.6.26-rc2-mm1

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series.

V1 -> V2:
+  no changes

"Optional" part of "noreclaim infrastructure"

In the fault paths that install new anonymous pages, check whether
the page is reclaimable or not using lru_cache_add_active_or_noreclaim().
If the page is reclaimable, just add it to the active lru list [via
the pagevec cache], else add it to the noreclaim list.  

This "proactive" culling in the fault path mimics the handling of
mlocked pages in Nick Piggin's series to keep mlocked pages off
the lru lists.

Notes:

1) This patch is optional--e.g., if one is concerned about the
   additional test in the fault path.  We can defer the moving of
   nonreclaimable pages until when vmscan [shrink_*_list()]
   encounters them.  Vmscan will only need to handle such pages
   once.

2) The 'vma' argument to page_reclaimable() is require to notice that
   we're faulting a page into an mlock()ed vma w/o having to scan the
   page's rmap in the fault path.   Culling mlock()ed anon pages is
   currently the only reason for this patch.

3) We can't cull swap pages in read_swap_cache_async() because the
   vma argument doesn't necessarily correspond to the swap cache
   offset passed in by swapin_readahead().  This could [did!] result
   in mlocking pages in non-VM_LOCKED vmas if [when] we tried to
   cull in this path.

4) Move set_pte_at() to after where we add page to lru to keep it
   hidden from other tasks that might walk the page table.
   We already do it in this order in do_anonymous() page.  And,
   these are COW'd anon pages.  Is this safe?


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>

 include/linux/swap.h |    2 ++
 mm/memory.c          |   20 ++++++++++++--------
 mm/swap.c            |   21 +++++++++++++++++++++
 3 files changed, 35 insertions(+), 8 deletions(-)

Index: linux-2.6.26-rc2-mm1/mm/memory.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/memory.c	2008-06-06 16:06:28.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/memory.c	2008-06-06 16:06:44.000000000 -0400
@@ -1774,12 +1774,15 @@ gotten:
 		 * thread doing COW.
 		 */
 		ptep_clear_flush(vma, address, page_table);
-		set_pte_at(mm, address, page_table, entry);
-		update_mmu_cache(vma, address, entry);
+
 		SetPageSwapBacked(new_page);
-		lru_cache_add_active_anon(new_page);
+		lru_cache_add_active_or_noreclaim(new_page, vma);
 		page_add_new_anon_rmap(new_page, vma, address);
 
+//TODO:  is this safe?  do_anonymous_page() does it this way.
+		set_pte_at(mm, address, page_table, entry);
+		update_mmu_cache(vma, address, entry);
+
 		/* Free the old page.. */
 		new_page = old_page;
 		ret |= VM_FAULT_WRITE;
@@ -2246,7 +2249,7 @@ static int do_anonymous_page(struct mm_s
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	SetPageSwapBacked(page);
-	lru_cache_add_active_anon(page);
+	lru_cache_add_active_or_noreclaim(page, vma);
 	page_add_new_anon_rmap(page, vma, address);
 	set_pte_at(mm, address, page_table, entry);
 
@@ -2390,12 +2393,11 @@ static int __do_fault(struct mm_struct *
 		entry = mk_pte(page, vma->vm_page_prot);
 		if (flags & FAULT_FLAG_WRITE)
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		set_pte_at(mm, address, page_table, entry);
 		if (anon) {
-                        inc_mm_counter(mm, anon_rss);
+			inc_mm_counter(mm, anon_rss);
 			SetPageSwapBacked(page);
-                        lru_cache_add_active_anon(page);
-                        page_add_new_anon_rmap(page, vma, address);
+			lru_cache_add_active_or_noreclaim(page, vma);
+			page_add_new_anon_rmap(page, vma, address);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(page);
@@ -2404,6 +2406,8 @@ static int __do_fault(struct mm_struct *
 				get_page(dirty_page);
 			}
 		}
+//TODO:  is this safe?  do_anonymous_page() does it this way.
+		set_pte_at(mm, address, page_table, entry);
 
 		/* no need to invalidate: a not-present page won't be cached */
 		update_mmu_cache(vma, address, entry);
Index: linux-2.6.26-rc2-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/swap.h	2008-06-06 16:06:24.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/swap.h	2008-06-06 16:06:44.000000000 -0400
@@ -173,6 +173,8 @@ extern unsigned int nr_free_pagecache_pa
 /* linux/mm/swap.c */
 extern void __lru_cache_add(struct page *, enum lru_list lru);
 extern void lru_cache_add_lru(struct page *, enum lru_list lru);
+extern void lru_cache_add_active_or_noreclaim(struct page *,
+					struct vm_area_struct *);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
Index: linux-2.6.26-rc2-mm1/mm/swap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/swap.c	2008-06-06 16:06:28.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/swap.c	2008-06-06 16:06:44.000000000 -0400
@@ -31,6 +31,8 @@
 #include <linux/backing-dev.h>
 #include <linux/memcontrol.h>
 
+#include "internal.h"
+
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
 
@@ -273,6 +275,25 @@ void add_page_to_noreclaim_list(struct p
 	spin_unlock_irq(&zone->lru_lock);
 }
 
+/**
+ * lru_cache_add_active_or_noreclaim
+ * @page:  the page to be added to LRU
+ * @vma:   vma in which page is mapped for determining reclaimability
+ *
+ * place @page on active or noreclaim LRU list, depending on
+ * page_reclaimable().  Note that if the page is not reclaimable,
+ * it goes directly back onto it's zone's noreclaim list.  It does
+ * NOT use a per cpu pagevec.
+ */
+void lru_cache_add_active_or_noreclaim(struct page *page,
+					struct vm_area_struct *vma)
+{
+	if (page_reclaimable(page, vma))
+		lru_cache_add_lru(page, LRU_ACTIVE + page_file_cache(page));
+	else
+		add_page_to_noreclaim_list(page);
+}
+
 /*
  * Drain pages out of the cpu's pagevecs.
  * Either "cpu" is the current CPU, and preemption has already been

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 22/25] Noreclaim and Mlocked pages vm events
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (20 preceding siblings ...)
  2008-06-06 20:28 ` [PATCH -mm 21/25] Cull non-reclaimable pages in fault path Rik van Riel, Rik van Riel
@ 2008-06-06 20:29 ` Rik van Riel, Rik van Riel
  2008-06-06 20:29 ` [PATCH -mm 23/25] Noreclaim LRU scan sysctl Rik van Riel, Rik van Riel
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro

[-- Attachment #1: rvr-22-lts-noreclaim-vm-events.patch --]
[-- Type: text/plain, Size: 7246 bytes --]

From:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Against:  2.6.26-rc2-mm1

Add some event counters to vmstats for testing noreclaim/mlock.  
Some of these might be interesting enough to keep around.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>

---
 include/linux/vmstat.h |   11 +++++++++++
 mm/internal.h          |    4 +++-
 mm/mlock.c             |   33 +++++++++++++++++++++++++--------
 mm/vmscan.c            |   16 +++++++++++++++-
 mm/vmstat.c            |   12 ++++++++++++
 5 files changed, 66 insertions(+), 10 deletions(-)

Index: linux-2.6.26-rc2-mm1/include/linux/vmstat.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/vmstat.h	2008-05-23 14:21:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/vmstat.h	2008-06-06 16:06:48.000000000 -0400
@@ -41,6 +41,17 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
+#ifdef CONFIG_NORECLAIM_LRU
+		NORECL_PGCULLED,	/* culled to noreclaim list */
+		NORECL_PGSCANNED,	/* scanned for reclaimability */
+		NORECL_PGRESCUED,	/* rescued from noreclaim list */
+#ifdef CONFIG_NORECLAIM_MLOCK
+		NORECL_PGMLOCKED,
+		NORECL_PGMUNLOCKED,
+		NORECL_PGCLEARED,
+		NORECL_PGSTRANDED,	/* unable to isolate on unlock */
+#endif
+#endif
 		NR_VM_EVENT_ITEMS
 };
 
Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-06-06 16:06:28.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-06-06 16:06:48.000000000 -0400
@@ -453,12 +453,13 @@ int putback_lru_page(struct page *page)
 {
 	int lru;
 	int ret = 1;
+	int was_nonreclaimable;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageLRU(page));
 
 	lru = !!TestClearPageActive(page);
-	ClearPageNoreclaim(page);	/* for page_reclaimable() */
+	was_nonreclaimable = TestClearPageNoreclaim(page);
 
 	if (unlikely(!page->mapping)) {
 		/*
@@ -478,6 +479,10 @@ int putback_lru_page(struct page *page)
 		lru += page_file_cache(page);
 		lru_cache_add_lru(page, lru);
 		mem_cgroup_move_lists(page, lru);
+#ifdef CONFIG_NORECLAIM_LRU
+		if (was_nonreclaimable)
+			count_vm_event(NORECL_PGRESCUED);
+#endif
 	} else {
 		/*
 		 * Put non-reclaimable pages directly on zone's noreclaim
@@ -485,6 +490,10 @@ int putback_lru_page(struct page *page)
 		 */
 		add_page_to_noreclaim_list(page);
 		mem_cgroup_move_lists(page, LRU_NORECLAIM);
+#ifdef CONFIG_NORECLAIM_LRU
+		if (!was_nonreclaimable)
+			count_vm_event(NORECL_PGCULLED);
+#endif
 	}
 
 	put_page(page);		/* drop ref from isolate */
@@ -2366,6 +2375,7 @@ static void check_move_noreclaim_page(st
 		__dec_zone_state(zone, NR_NORECLAIM);
 		list_move(&page->lru, &zone->list[l]);
 		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
+		__count_vm_event(NORECL_PGRESCUED);
 	} else {
 		/*
 		 * rotate noreclaim list
@@ -2397,6 +2407,7 @@ void scan_mapping_noreclaim_pages(struct
 	while (next < end &&
 		pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
 		int i;
+		int pg_scanned = 0;
 
 		zone = NULL;
 
@@ -2405,6 +2416,7 @@ void scan_mapping_noreclaim_pages(struct
 			pgoff_t page_index = page->index;
 			struct zone *pagezone = page_zone(page);
 
+			pg_scanned++;
 			if (page_index > next)
 				next = page_index;
 			next++;
@@ -2435,6 +2447,8 @@ void scan_mapping_noreclaim_pages(struct
 		if (zone)
 			spin_unlock_irq(&zone->lru_lock);
 		pagevec_release(&pvec);
+
+		count_vm_events(NORECL_PGSCANNED, pg_scanned);
 	}
 
 }
Index: linux-2.6.26-rc2-mm1/mm/vmstat.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmstat.c	2008-06-06 16:06:38.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmstat.c	2008-06-06 16:06:48.000000000 -0400
@@ -759,6 +759,18 @@ static const char * const vmstat_text[] 
 	"htlb_buddy_alloc_success",
 	"htlb_buddy_alloc_fail",
 #endif
+
+#ifdef CONFIG_NORECLAIM_LRU
+	"noreclaim_pgs_culled",
+	"noreclaim_pgs_scanned",
+	"noreclaim_pgs_rescued",
+#ifdef CONFIG_NORECLAIM_MLOCK
+	"noreclaim_pgs_mlocked",
+	"noreclaim_pgs_munlocked",
+	"noreclaim_pgs_cleared",
+	"noreclaim_pgs_stranded",
+#endif
+#endif
 #endif
 };
 
Index: linux-2.6.26-rc2-mm1/mm/mlock.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/mlock.c	2008-06-06 16:06:38.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/mlock.c	2008-06-06 16:06:48.000000000 -0400
@@ -18,6 +18,7 @@
 #include <linux/rmap.h>
 #include <linux/mmzone.h>
 #include <linux/hugetlb.h>
+#include <linux/vmstat.h>
 
 #include "internal.h"
 
@@ -57,6 +58,7 @@ void __clear_page_mlock(struct page *pag
 	VM_BUG_ON(!PageLocked(page));	/* for LRU islolate/putback */
 
 	dec_zone_page_state(page, NR_MLOCK);
+	count_vm_event(NORECL_PGCLEARED);
 	if (!isolate_lru_page(page)) {
 		putback_lru_page(page);
 	} else {
@@ -66,6 +68,8 @@ void __clear_page_mlock(struct page *pag
 		lru_add_drain_all();
 		if (!isolate_lru_page(page))
 			putback_lru_page(page);
+		else if (PageNoreclaim(page))
+			count_vm_event(NORECL_PGSTRANDED);
 	}
 }
 
@@ -79,6 +83,7 @@ void mlock_vma_page(struct page *page)
 
 	if (!TestSetPageMlocked(page)) {
 		inc_zone_page_state(page, NR_MLOCK);
+		count_vm_event(NORECL_PGMLOCKED);
 		if (!isolate_lru_page(page))
 			putback_lru_page(page);
 	}
@@ -109,16 +114,28 @@ static void munlock_vma_page(struct page
 	if (TestClearPageMlocked(page)) {
 		dec_zone_page_state(page, NR_MLOCK);
 		if (!isolate_lru_page(page)) {
-			try_to_unlock(page);	/* maybe relock the page */
+			int ret = try_to_unlock(page);
+			/*
+			 * did try_to_unlock() succeed or punt?
+			 */
+			if (ret == SWAP_SUCCESS || ret == SWAP_AGAIN)
+				count_vm_event(NORECL_PGMUNLOCKED);
+
 			putback_lru_page(page);
+		} else {
+			/*
+			 * We lost the race.  let try_to_unmap() deal
+			 * with it.  At least we get the page state and
+			 * mlock stats right.  However, page is still on
+			 * the noreclaim list.  We'll fix that up when
+			 * the page is eventually freed or we scan the
+			 * noreclaim list.
+			 */
+			if (PageNoreclaim(page))
+				count_vm_event(NORECL_PGSTRANDED);
+			else
+				count_vm_event(NORECL_PGMUNLOCKED);
 		}
-		/*
-		 * Else we lost the race.  let try_to_unmap() deal with it.
-		 * At least we get the page state and mlock stats right.
-		 * However, page is still on the noreclaim list.  We'll fix
-		 * that up when the page is eventually freed or we scan the
-		 * noreclaim list.
-		 */
 	}
 }
 
Index: linux-2.6.26-rc2-mm1/mm/internal.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/internal.h	2008-06-06 16:06:38.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/internal.h	2008-06-06 16:06:48.000000000 -0400
@@ -107,8 +107,10 @@ static inline int is_mlocked_vma(struct 
 	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
 		return 0;
 
-	if (!TestSetPageMlocked(page))
+	if (!TestSetPageMlocked(page)) {
 		inc_zone_page_state(page, NR_MLOCK);
+		count_vm_event(NORECL_PGMLOCKED);
+	}
 	return 1;
 }
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 23/25] Noreclaim LRU scan sysctl
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (21 preceding siblings ...)
  2008-06-06 20:29 ` [PATCH -mm 22/25] Noreclaim and Mlocked pages vm events Rik van Riel, Rik van Riel
@ 2008-06-06 20:29 ` Rik van Riel, Rik van Riel
  2008-06-06 20:29 ` [PATCH -mm 24/25] Mlocked Pages: count attempts to free mlocked page Rik van Riel, Rik van Riel
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, linux-mm,
	Eric Whitney

[-- Attachment #1: rvr-23-lts-noreclaim-lru-scan-sysctl.patch --]
[-- Type: text/plain, Size: 11463 bytes --]

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

Against:  2.6.26-rc2-mm1

V6:
+ moved to end of series as optional debug patch

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split LRU series

New in V2

This patch adds a function to scan individual or all zones' noreclaim
lists and move any pages that have become reclaimable onto the respective
zone's inactive list, where shrink_inactive_list() will deal with them.

Adds sysctl to scan all nodes, and per node attributes to individual
nodes' zones.

Kosaki:
If reclaimable page found in noreclaim lru when write
/proc/sys/vm/scan_noreclaim_pages, print filename and file offset of
these pages.

TODO:  DEBUGGING ONLY: NOT FOR UPSTREAM MERGE

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


 drivers/base/node.c  |    5 +
 include/linux/rmap.h |    3 
 include/linux/swap.h |   15 ++++
 kernel/sysctl.c      |   10 +++
 mm/rmap.c            |    4 -
 mm/vmscan.c          |  161 +++++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 196 insertions(+), 2 deletions(-)

Index: linux-2.6.26-rc2-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/swap.h	2008-06-06 16:06:44.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/swap.h	2008-06-06 16:06:52.000000000 -0400
@@ -7,6 +7,7 @@
 #include <linux/list.h>
 #include <linux/memcontrol.h>
 #include <linux/sched.h>
+#include <linux/node.h>
 
 #include <asm/atomic.h>
 #include <asm/page.h>
@@ -235,15 +236,29 @@ static inline int zone_reclaim(struct zo
 #ifdef CONFIG_NORECLAIM_LRU
 extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
 extern void scan_mapping_noreclaim_pages(struct address_space *);
+
+extern unsigned long scan_noreclaim_pages;
+extern int scan_noreclaim_handler(struct ctl_table *, int, struct file *,
+					void __user *, size_t *, loff_t *);
+extern int scan_noreclaim_register_node(struct node *node);
+extern void scan_noreclaim_unregister_node(struct node *node);
 #else
 static inline int page_reclaimable(struct page *page,
 						struct vm_area_struct *vma)
 {
 	return 1;
 }
+
 static inline void scan_mapping_noreclaim_pages(struct address_space *mapping)
 {
 }
+
+static inline int scan_noreclaim_register_node(struct node *node)
+{
+	return 0;
+}
+
+static inline void scan_noreclaim_unregister_node(struct node *node) { }
 #endif
 
 extern int kswapd_run(int nid);
Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-06-06 16:06:48.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-06-06 16:06:52.000000000 -0400
@@ -39,6 +39,7 @@
 #include <linux/kthread.h>
 #include <linux/freezer.h>
 #include <linux/memcontrol.h>
+#include <linux/sysctl.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -2355,6 +2356,37 @@ int page_reclaimable(struct page *page, 
 	return 1;
 }
 
+static void show_page_path(struct page *page)
+{
+	char buf[256];
+	if (page_file_cache(page)) {
+		struct address_space *mapping = page->mapping;
+		struct dentry *dentry;
+		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+		spin_lock(&mapping->i_mmap_lock);
+		dentry = d_find_alias(mapping->host);
+		printk(KERN_INFO "rescued: %s %lu\n",
+		       dentry_path(dentry, buf, 256), pgoff);
+		spin_unlock(&mapping->i_mmap_lock);
+	} else {
+		struct anon_vma *anon_vma;
+		struct vm_area_struct *vma;
+
+		anon_vma = page_lock_anon_vma(page);
+		if (!anon_vma)
+			return;
+
+		list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+			printk(KERN_INFO "rescued: anon %s\n",
+			       vma->vm_mm->owner->comm);
+			break;
+		}
+		page_unlock_anon_vma(anon_vma);
+	}
+}
+
+
 /**
  * check_move_noreclaim_page - check page for reclaimability and move to appropriate lru list
  * @page: page to check reclaimability and move to appropriate lru list
@@ -2372,6 +2404,9 @@ static void check_move_noreclaim_page(st
 	ClearPageNoreclaim(page); /* for page_reclaimable() */
 	if (page_reclaimable(page, NULL)) {
 		enum lru_list l = LRU_INACTIVE_ANON + page_file_cache(page);
+
+		show_page_path(page);
+
 		__dec_zone_state(zone, NR_NORECLAIM);
 		list_move(&page->lru, &zone->list[l]);
 		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
@@ -2452,4 +2487,130 @@ void scan_mapping_noreclaim_pages(struct
 	}
 
 }
+
+/**
+ * scan_zone_noreclaim_pages - check noreclaim list for reclaimable pages
+ * @zone - zone of which to scan the noreclaim list
+ *
+ * Scan @zone's noreclaim LRU lists to check for pages that have become
+ * reclaimable.  Move those that have to @zone's inactive list where they
+ * become candidates for reclaim, unless shrink_inactive_zone() decides
+ * to reactivate them.  Pages that are still non-reclaimable are rotated
+ * back onto @zone's noreclaim list.
+ */
+#define SCAN_NORECLAIM_BATCH_SIZE 16UL	/* arbitrary lock hold batch size */
+void scan_zone_noreclaim_pages(struct zone *zone)
+{
+	struct list_head *l_noreclaim = &zone->list[LRU_NORECLAIM];
+	unsigned long scan;
+	unsigned long nr_to_scan = zone_page_state(zone, NR_NORECLAIM);
+
+	while (nr_to_scan > 0) {
+		unsigned long batch_size = min(nr_to_scan,
+						SCAN_NORECLAIM_BATCH_SIZE);
+
+		spin_lock_irq(&zone->lru_lock);
+		for (scan = 0;  scan < batch_size; scan++) {
+			struct page *page = lru_to_page(l_noreclaim);
+
+			if (TestSetPageLocked(page))
+				continue;
+
+			prefetchw_prev_lru_page(page, l_noreclaim, flags);
+
+			if (likely(PageLRU(page) && PageNoreclaim(page)))
+				check_move_noreclaim_page(page, zone);
+
+			unlock_page(page);
+		}
+		spin_unlock_irq(&zone->lru_lock);
+
+		nr_to_scan -= batch_size;
+	}
+}
+
+
+/**
+ * scan_all_zones_noreclaim_pages - scan all noreclaim lists for reclaimable pages
+ *
+ * A really big hammer:  scan all zones' noreclaim LRU lists to check for
+ * pages that have become reclaimable.  Move those back to the zones'
+ * inactive list where they become candidates for reclaim.
+ * This occurs when, e.g., we have unswappable pages on the noreclaim lists,
+ * and we add swap to the system.  As such, it runs in the context of a task
+ * that has possibly/probably made some previously non-reclaimable pages
+ * reclaimable.
+ */
+void scan_all_zones_noreclaim_pages(void)
+{
+	struct zone *zone;
+
+	for_each_zone(zone) {
+		scan_zone_noreclaim_pages(zone);
+	}
+}
+
+/*
+ * scan_noreclaim_pages [vm] sysctl handler.  On demand re-scan of
+ * all nodes' noreclaim lists for reclaimable pages
+ */
+unsigned long scan_noreclaim_pages;
+
+int scan_noreclaim_handler(struct ctl_table *table, int write,
+			   struct file *file, void __user *buffer,
+			   size_t *length, loff_t *ppos)
+{
+	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
+
+	if (write && *(unsigned long *)table->data)
+		scan_all_zones_noreclaim_pages();
+
+	scan_noreclaim_pages = 0;
+	return 0;
+}
+
+/*
+ * per node 'scan_noreclaim_pages' attribute.  On demand re-scan of
+ * a specified node's per zone noreclaim lists for reclaimable pages.
+ */
+
+static ssize_t read_scan_noreclaim_node(struct sys_device *dev, char *buf)
+{
+	return sprintf(buf, "0\n");	/* always zero; should fit... */
+}
+
+static ssize_t write_scan_noreclaim_node(struct sys_device *dev,
+					const char *buf, size_t count)
+{
+	struct zone *node_zones = NODE_DATA(dev->id)->node_zones;
+	struct zone *zone;
+	unsigned long res;
+	unsigned long req = strict_strtoul(buf, 10, &res);
+
+	if (!req)
+		return 1;	/* zero is no-op */
+
+	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
+		if (!populated_zone(zone))
+			continue;
+		scan_zone_noreclaim_pages(zone);
+	}
+	return 1;
+}
+
+
+static SYSDEV_ATTR(scan_noreclaim_pages, S_IRUGO | S_IWUSR,
+			read_scan_noreclaim_node,
+			write_scan_noreclaim_node);
+
+int scan_noreclaim_register_node(struct node *node)
+{
+	return sysdev_create_file(&node->sysdev, &attr_scan_noreclaim_pages);
+}
+
+void scan_noreclaim_unregister_node(struct node *node)
+{
+	sysdev_remove_file(&node->sysdev, &attr_scan_noreclaim_pages);
+}
+
 #endif
Index: linux-2.6.26-rc2-mm1/kernel/sysctl.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/kernel/sysctl.c	2008-05-15 11:21:11.000000000 -0400
+++ linux-2.6.26-rc2-mm1/kernel/sysctl.c	2008-06-06 16:06:52.000000000 -0400
@@ -1151,6 +1151,16 @@ static struct ctl_table vm_table[] = {
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_NORECLAIM_LRU
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "scan_noreclaim_pages",
+		.data		= &scan_noreclaim_pages,
+		.maxlen		= sizeof(scan_noreclaim_pages),
+		.mode		= 0644,
+		.proc_handler	= &scan_noreclaim_handler,
+	},
+#endif
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
Index: linux-2.6.26-rc2-mm1/drivers/base/node.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/drivers/base/node.c	2008-06-06 16:06:38.000000000 -0400
+++ linux-2.6.26-rc2-mm1/drivers/base/node.c	2008-06-06 16:06:52.000000000 -0400
@@ -13,6 +13,7 @@
 #include <linux/nodemask.h>
 #include <linux/cpu.h>
 #include <linux/device.h>
+#include <linux/swap.h>
 
 static struct sysdev_class node_class = {
 	.name = "node",
@@ -190,6 +191,8 @@ int register_node(struct node *node, int
 		sysdev_create_file(&node->sysdev, &attr_meminfo);
 		sysdev_create_file(&node->sysdev, &attr_numastat);
 		sysdev_create_file(&node->sysdev, &attr_distance);
+
+		scan_noreclaim_register_node(node);
 	}
 	return error;
 }
@@ -209,6 +212,8 @@ void unregister_node(struct node *node)
 	sysdev_remove_file(&node->sysdev, &attr_numastat);
 	sysdev_remove_file(&node->sysdev, &attr_distance);
 
+	scan_noreclaim_unregister_node(node);
+
 	sysdev_unregister(&node->sysdev);
 }
 
Index: linux-2.6.26-rc2-mm1/include/linux/rmap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/rmap.h	2008-06-06 16:06:28.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/rmap.h	2008-06-06 16:06:52.000000000 -0400
@@ -55,6 +55,9 @@ void anon_vma_unlink(struct vm_area_stru
 void anon_vma_link(struct vm_area_struct *);
 void __anon_vma_link(struct vm_area_struct *);
 
+extern struct anon_vma *page_lock_anon_vma(struct page *page);
+extern void page_unlock_anon_vma(struct anon_vma *anon_vma);
+
 /*
  * rmap interfaces called when adding or removing pte of page
  */
Index: linux-2.6.26-rc2-mm1/mm/rmap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/rmap.c	2008-06-06 16:06:28.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/rmap.c	2008-06-06 16:06:52.000000000 -0400
@@ -168,7 +168,7 @@ void __init anon_vma_init(void)
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page)
 {
 	struct anon_vma *anon_vma;
 	unsigned long anon_mapping;
@@ -188,7 +188,7 @@ out:
 	return NULL;
 }
 
-static void page_unlock_anon_vma(struct anon_vma *anon_vma)
+void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	spin_unlock(&anon_vma->lock);
 	rcu_read_unlock();

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 24/25] Mlocked Pages: count attempts to free mlocked page
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (22 preceding siblings ...)
  2008-06-06 20:29 ` [PATCH -mm 23/25] Noreclaim LRU scan sysctl Rik van Riel, Rik van Riel
@ 2008-06-06 20:29 ` Rik van Riel, Rik van Riel
  2008-06-06 20:29 ` [PATCH -mm 25/25] Noreclaim LRU and Mlocked Pages Documentation Rik van Riel, Rik van Riel
  2008-06-06 21:02 ` [PATCH -mm 00/25] VM pageout scalability improvements (V10) Andrew Morton
  25 siblings, 0 replies; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro

[-- Attachment #1: rvr-24-lts-count-free-mlocked-pages.patch --]
[-- Type: text/plain, Size: 3334 bytes --]


From: Lee Schermerhorn <lee.schermerhorn@hp.com>

Allow free of mlock()ed pages.  This shouldn't happen, but during
developement, it occasionally did.

This patch allows us to survive that condition, while keeping the
statistics and events correct for debug.


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>

---

 include/linux/vmstat.h |    1 +
 mm/internal.h          |   17 +++++++++++++++++
 mm/page_alloc.c        |    1 +
 mm/vmstat.c            |    1 +
 4 files changed, 20 insertions(+)

Index: linux-2.6.26-rc2-mm1/mm/internal.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/internal.h	2008-06-06 16:06:48.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/internal.h	2008-06-06 16:06:56.000000000 -0400
@@ -152,6 +152,22 @@ static inline void mlock_migrate_page(st
 	}
 }
 
+/*
+ * free_page_mlock() -- clean up attempts to free and mlocked() page.
+ * Page should not be on lru, so no need to fix that up.
+ * free_pages_check() will verify...
+ */
+static inline void free_page_mlock(struct page *page)
+{
+	if (unlikely(TestClearPageMlocked(page))) {
+		unsigned long flags;
+
+		local_irq_save(flags);
+		__dec_zone_page_state(page, NR_MLOCK);
+		__count_vm_event(NORECL_MLOCKFREED);
+		local_irq_restore(flags);
+	}
+}
 
 #else /* CONFIG_NORECLAIM_MLOCK */
 static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
@@ -161,6 +177,7 @@ static inline int is_mlocked_vma(struct 
 static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page) { }
 static inline void mlock_migrate_page(struct page *new, struct page *old) { }
+static inline void free_page_mlock(struct page *page) { }
 
 #endif /* CONFIG_NORECLAIM_MLOCK */
 
Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c	2008-06-06 16:06:28.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/page_alloc.c	2008-06-06 16:06:56.000000000 -0400
@@ -484,6 +484,7 @@ static inline void __free_one_page(struc
 
 static inline int free_pages_check(struct page *page)
 {
+	free_page_mlock(page);
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
 		(page_get_page_cgroup(page) != NULL) |
Index: linux-2.6.26-rc2-mm1/include/linux/vmstat.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/vmstat.h	2008-06-06 16:06:48.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/vmstat.h	2008-06-06 16:06:56.000000000 -0400
@@ -50,6 +50,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
 		NORECL_PGMUNLOCKED,
 		NORECL_PGCLEARED,
 		NORECL_PGSTRANDED,	/* unable to isolate on unlock */
+		NORECL_MLOCKFREED,
 #endif
 #endif
 		NR_VM_EVENT_ITEMS
Index: linux-2.6.26-rc2-mm1/mm/vmstat.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmstat.c	2008-06-06 16:06:48.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmstat.c	2008-06-06 16:06:56.000000000 -0400
@@ -769,6 +769,7 @@ static const char * const vmstat_text[] 
 	"noreclaim_pgs_munlocked",
 	"noreclaim_pgs_cleared",
 	"noreclaim_pgs_stranded",
+	"noreclaim_pgs_mlockfreed",
 #endif
 #endif
 #endif

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH -mm 25/25] Noreclaim LRU and Mlocked Pages Documentation
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (23 preceding siblings ...)
  2008-06-06 20:29 ` [PATCH -mm 24/25] Mlocked Pages: count attempts to free mlocked page Rik van Riel, Rik van Riel
@ 2008-06-06 20:29 ` Rik van Riel, Rik van Riel
  2008-06-06 21:02 ` [PATCH -mm 00/25] VM pageout scalability improvements (V10) Andrew Morton
  25 siblings, 0 replies; 102+ messages in thread
From: Rik van Riel, Rik van Riel @ 2008-06-06 20:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Lee Schermerhorn, Kosaki Motohiro, linux-mm,
	Eric Whitney

[-- Attachment #1: rvr-25-lts-noreclaim-mlock-documentation.patch --]
[-- Type: text/plain, Size: 36022 bytes --]


From: Lee Schermerhorn <lee.schermerhorn@hp.com>

Documentation for noreclaim lru list and its usage.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>

---
 Documentation/vm/noreclaim-lru.txt |  609 +++++++++++++++++++++++++++++++++++++
 1 file changed, 609 insertions(+)

Index: linux-2.6.26-rc2-mm1/Documentation/vm/noreclaim-lru.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.26-rc2-mm1/Documentation/vm/noreclaim-lru.txt	2008-06-06 16:07:01.000000000 -0400
@@ -0,0 +1,609 @@
+
+This document describes the Linux memory management "Noreclaim LRU"
+infrastructure and the use of this infrastructure to manage several types
+of "non-reclaimable" pages.  The document attempts to provide the overall
+rationale behind this mechanism and the rationale for some of the design
+decisions that drove the implementation.  The latter design rationale is
+discussed in the context of an implementation description.  Admittedly, one
+can obtain the implementation details--the "what does it do?"--by reading the
+code.  One hopes that the descriptions below add value by provide the answer
+to "why does it do that?".
+
+Noreclaim LRU Infrastructure:
+
+The Noreclaim LRU adds an additional LRU list to track non-reclaimable pages
+and to hide these pages from vmscan.  This mechanism is based on a patch by
+Larry Woodman of Red Hat to address several scalability problems with page
+reclaim in Linux.  The problems have been observed at customer sites on large
+memory x86_64 systems.  For example, a non-numal x86_64 platform with 128GB
+of main memory will have over 32 million 4k pages in a single zone.  When a
+large fraction of these pages are not reclaimable for any reason [see below],
+vmscan will spend a lot of time scanning the LRU lists looking for the small
+fraction of pages that are reclaimable.  This can result in a situation where
+all cpus are spending 100% of their time in vmscan for hours or days on end,
+with the system completely unresponsive.
+
+The Noreclaim LRU infrastructure addresses the following classes of
+non-reclaimable pages:
+
++ page owned by ram disks or ramfs
++ page mapped into SHM_LOCKed shared memory regions
++ page mapped into VM_LOCKED [mlock()ed] vmas
+
+The infrastructure might be able to handle other conditions that make pages
+nonreclaimable, either by definition or by circumstance, in the future.
+
+
+The Noreclaim LRU List
+
+The Noreclaim LRU infrastructure consists of an additional, per-zone, LRU list
+called the "noreclaim" list and an associated page flag, PG_noreclaim, to
+indicate that the page is being managed on the noreclaim list.  The PG_noreclaim
+flag is analogous to, and mutually exclusive with, the PG_active flag in that
+it indicates on which LRU list a page resides when PG_lru is set.  The
+noreclaim LRU list is source configurable based on the NORECLAIM_LRU Kconfig
+option.
+
+Why maintain nonreclaimable pages on an additional LRU list?  The Linux memory
+management subsystem has well established protocols for managing pages on the
+LRU.  Vmscan is based on LRU lists.  LRU list exist per zone, and we want to
+maintain pages relative to their "home zone".  All of these make the use of
+an additional list, parallel to the LRU active and inactive lists, a natural
+mechanism to employ.  Note, however, that the noreclaim list does not
+differentiate between file backed and swap backed [anon] pages.  This
+differentiation is only important while the pages are, in fact, reclaimable.
+
+The noreclaim LRU list benefits from the "arrayification" of the per-zone
+LRU lists and statistics originally proposed and posted by Christoph Lameter.
+
+Note that the noreclaim list does not use the lru pagevec mechanism. Rather,
+nonreclaimable pages are placed directly on the page's zone's noreclaim
+list under the zone lru_lock.  The reason for this is to prevent stranding
+of pages on the noreclaim list when one task has the page isolated from the
+lru and other tasks are changing the "reclaimability" state of the page.
+
+
+Noreclaim LRU and Memory Controller Interaction
+
+The memory controller data structure automatically gets a per zone noreclaim
+lru list as a result of the "arrayification" of the per-zone LRU lists.  The
+memory controller tracks the movement of pages to and from the noreclaim list.
+When a memory control group comes under memory pressure, the controller will
+not attempt to reclaim pages on the noreclaim list.  This has a couple of
+effects.  Because the pages are "hidden" from reclaim on the noreclaim list,
+the reclaim process can be more efficient, dealing only with pages that have
+a chance of being reclaimed.  On the other hand, if too many of the pages
+charged to the control group are non-reclaimable, the reclaimable portion of the
+working set of the tasks in the control group may not fit into the available
+memory.  This can cause the control group to thrash or to oom-kill tasks.
+
+
+Noreclaim LRU:  Detecting Non-reclaimable Pages
+
+The function page_reclaimable(page, vma) in vmscan.c determines whether a
+page is reclaimable or not.  For ramfs and ram disk [brd] pages and pages in
+SHM_LOCKed regions, page_reclaimable() tests a new address space flag,
+AS_NORECLAIM, in the page's address space using a wrapper function.
+Wrapper functions are used to set, clear and test the flag to reduce the
+requirement for #ifdef's throughout the source code.  AS_NORECLAIM is set on
+ramfs inode/mapping when it is created and on ram disk inode/mappings at open
+time.   This flag remains for the life of the inode.
+
+For shared memory regions, AS_NORECLAIM is set when an application successfully
+SHM_LOCKs the region and is removed when the region is SHM_UNLOCKed.  Note that
+shmctl(SHM_LOCK, ...) does not populate the page tables for the region as does,
+for example, mlock().   So, we make no special effort to push any pages in the
+SHM_LOCKed region to the noreclaim list.  Vmscan will do this when/if it
+encounters the pages during reclaim.  On SHM_UNLOCK, shmctl() scans the pages
+in the region and "rescues" them from the noreclaim list if no other condition
+keeps them non-reclaimable.  If a SHM_LOCKed region is destroyed, the pages
+are also "rescued" from the noreclaim list in the process of freeing them.
+
+page_reclaimable() detects mlock()ed pages by testing an additional page flag,
+PG_mlocked via the PageMlocked() wrapper.  If the page is NOT mlocked, and a
+non-NULL vma is supplied, page_reclaimable() will check whether the vma is
+VM_LOCKED via is_mlocked_vma().  is_mlocked_vma() will SetPageMlocked() and
+update the appropriate statistics if the vma is VM_LOCKED.  This method allows
+efficient "culling" of pages in the fault path that are being faulted in to
+VM_LOCKED vmas.
+
+
+Non-reclaimable Pages and Vmscan [shrink_*_list()]
+
+If non-reclaimable pages are culled in the fault path, or moved to the
+noreclaim list at mlock() or mmap() time, vmscan will never encounter the pages
+until they have become reclaimable again, for example, via munlock() and have
+been "rescued" from the noreclaim list.  However, there may be situations where
+we decide, for the sake of expediency, to leave a non-reclaimable page on one of
+the regular active/inactive LRU lists for vmscan to deal with.  Vmscan checks
+for such pages in all of the shrink_{active|inactive|page}_list() functions and
+will "cull" such pages that it encounters--that is, it diverts those pages to
+the noreclaim list for the zone being scanned.
+
+There may be situations where a page is mapped into a VM_LOCKED vma, but the
+page is not marked as PageMlocked.  Such pages will make it all the way to
+shrink_page_list() where they will be detected when vmscan walks the reverse
+map in try_to_unmap().  If try_to_unmap() returns SWAP_MLOCK, shrink_page_list()
+will cull the page at that point.
+
+Note that for anonymous pages, shrink_page_list() attempts to add the page to
+the swap cache before it tries to unmap the page.  To avoid this unnecessary
+consumption of swap space, shrink_page_list() calls try_to_unlock() to check
+whether any VM_LOCKED vmas map the page without attempting to unmap the page.
+If try_to_unlock() returns SWAP_MLOCK, shrink_page_list() will cull the page
+without consuming swap space.  try_to_unlock() will be described below.
+
+
+Mlocked Page:  Prior Work
+
+The "Noreclaim Mlocked Pages" infrastructure is based on work originally posted
+by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU".  Nick's
+posted his patch as an alternative to a patch posted by Christoph Lameter to
+achieve the same objective--hiding mlocked pages from vmscan.  In Nick's patch,
+he used one of the struct page lru list link fields as a count of VM_LOCKED
+vmas that map the page.  This use of the link field for a count prevent the
+management of the pages on an LRU list.  When Nick's patch was integrated with
+the Noreclaim LRU work, the count was replaced by walking the reverse map to
+determine whether any VM_LOCKED vmas mapped the page.  More on this below.
+The primary reason for wanting to keep mlocked pages on an LRU list is that
+mlocked pages are migratable, and the LRU list is used to arbitrate tasks
+attempting to migrate the same page.  Whichever task succeeds in "isolating"
+the page from the LRU performs the migration.
+
+
+Mlocked Pages:  Basic Management
+
+Mlocked pages--pages mapped into a VM_LOCKED vma--represent one class of
+nonreclaimable pages.  When such a page has been "noticed" by the memory
+management subsystem, the page is marked with the PG_mlocked [PageMlocked()]
+flag.  A PageMlocked() page will be placed on the noreclaim LRU list when
+it is added to the LRU.   Pages can be "noticed" by memory management in
+several places:
+
+1) in the mlock()/mlockall() system call handlers.
+2) in the mmap() system call handler when mmap()ing a region with the
+   MAP_LOCKED flag, or mmap()ing a region in a task that has called
+   mlockall() with the MCL_FUTURE flag.  Both of these conditions result
+   in the VM_LOCKED flag being set for the vma.
+3) in the fault path, if mlocked pages are "culled" in the fault path,
+   and when a VM_LOCKED stack segment is expanded.
+4) as mentioned above, in vmscan:shrink_page_list() with attempting to
+   reclaim a page in a VM_LOCKED vma--via try_to_unmap() or try_to_unlock().
+
+Mlocked pages become unlocked and rescued from the noreclaim list when:
+
+1) mapped in a range unlocked via the munlock()/munlockall() system calls.
+2) munmapped() out of the last VM_LOCKED vma that maps the page, including
+   unmapping at task exit.
+3) when the page is truncated from the last VM_LOCKED vma of an mmap()ed file.
+4) before a page is COWed in a VM_LOCKED vma.
+
+
+Mlocked Pages:  mlock()/mlockall() System Call Handling
+
+Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup()
+for each vma in the range specified by the call.  In the case of mlockall(),
+this is the entire active address space of the task.  Note that mlock_fixup()
+is used for both mlock()ing and munlock()ing a range of memory.  A call to
+mlock() an already VM_LOCKED vma, or to munlock() a vma that is not VM_LOCKED
+is treated as a no-op--mlock_fixup() simply returns.
+
+If the vma passes some filtering described in "Mlocked Pages:  Filtering Vmas"
+below, mlock_fixup() will attempt to merge the vma with its neighbors or split
+off a subset of the vma if the range does not cover the entire vma.  Once the
+vma has been merged or split or neither, mlock_fixup() will call
+__mlock_vma_pages_range() to fault in the pages via get_user_pages() and
+to mark the pages as mlocked via mlock_vma_page().
+
+Note that the vma being mlocked might be mapped with PROT_NONE.  In this case,
+get_user_pages() will be unable to fault in the pages.  That's OK.  If pages
+do end up getting faulted into this VM_LOCKED vma, we'll handle them in the
+fault path or in vmscan.
+
+Also note that a page returned by get_user_pages() could be truncated or
+migrated out from under us, while we're trying to mlock it.  To detect
+this, __mlock_vma_pages_range() tests the page_mapping after acquiring
+the page lock.  If the page is still associated with its mapping, we'll
+go ahead and call mlock_vma_page().  If the mapping is gone, we just
+unlock the page and move on.  Worse case, this results in page mapped
+in a VM_LOCKED vma remaining on a normal LRU list without being
+PageMlocked().  Again, vmscan will detect and cull such pages.
+
+mlock_vma_page(), called with the page locked [N.B., not "mlocked"] will
+TestSetPageMlocked() for each page returned by get_user_pages().  We use
+TestSetPageMlocked() because the page might already be mlocked by another
+task/vma and we don't want to do extra work.  We especially do not want to
+count an mlocked page more than once in the statistics.  If the page was
+already mlocked, mlock_vma_page() is done.
+
+If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
+page from the LRU, as it is likely on the appropriate active or inactive list
+at that time.  If the isolate_lru_page() succeeds, mlock_vma_page() will
+putback the page--putback_lru_page()--which will notice that the page is now
+mlocked and divert the page to the zone's noreclaim LRU list.  If
+mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
+it later if/when it attempts to reclaim the page.
+
+
+Mlocked Pages:  Filtering Vmas
+
+mlock_fixup() filters several classes of "special" vmas:
+
+1) vmas with VM_IO|VM_PFNMAP set are skipped entirely.  The pages behind
+   these mappings are inherently pinned, so we don't need to mark them as
+   mlocked.  In any case, most of the pages have no struct page in which to
+   so mark the page.  Because of this, get_user_pages() will fail for these
+   vmas, so there is no sense in attempting to visit them.
+
+2) vmas mapping hugetlbfs page are already effectively pinned into memory.
+   We don't need nor want to mlock() these pages.  However, to preserve the
+   prior behavior of mlock()--before the noreclaim/mlock changes--mlock_fixup()
+   will call make_pages_present() in the hugetlbfs vma range to allocate the
+   huge pages and populate the ptes.
+
+3) vmas with VM_DONTEXPAND|VM_RESERVED are generally user space mappings of
+   kernel pages, such as the vdso page, relay channel pages, etc.  These pages
+   are inherently non-reclaimable and are not managed on the LRU lists.
+   mlock_fixup() treats these vmas the same as hugetlbfs vmas.  It calls
+   make_pages_present() to populate the ptes.
+
+Note that for all of these special vmas, mlock_fixup() does not set the
+VM_LOCKED flag.  Therefore, we won't have to deal with them later during
+munlock() or munmap()--for example, at task exit.  Neither does mlock_fixup()
+account these vmas against the task's "locked_vm".
+
+Mlocked Pages:  Downgrading the Mmap Semaphore.
+
+mlock_fixup() must be called with the mmap semaphore held for write, because
+it may have to merge or split vmas.  However, mlocking a large region of
+memory can take a long time--especially if vmscan must reclaim pages to
+satisfy the regions requirements.  Faulting in a large region with the mmap
+semaphore held for write can hold off other faults on the address space, in
+the case of a multi-threaded task.  It can also hold off scans of the task's
+address space via /proc.  While testing under heavy load, it was observed that
+the ps(1) command could be held off for many minutes while a large segment was
+mlock()ed down.
+
+To address this issue, and to make the system more responsive during mlock()ing
+of large segments, mlock_fixup() downgrades the mmap semaphore to read mode
+during the call to __mlock_vma_pages_range().  This works fine.  However, the
+callers of mlock_fixup() expect the semaphore to be returned in write mode.
+So, mlock_fixup() "upgrades" the semphore to write mode.  Linux does not
+support an atomic upgrade_sem() call, so mlock_fixup() must drop the semaphore
+and reacquire it in write mode.  In a multi-threaded task, it is possible for
+the task memory map to change while the semaphore is dropped.  Therefore,
+mlock_fixup() looks up the vma at the range start address after reacquiring
+the semaphore in write mode and verifies that it still covers the original
+range.  If not, mlock_fixup() returns an error [-EAGAIN].  All callers of
+mlock_fixup() have been changed to deal with this new error condition.
+
+Note:  when munlocking a region, all of the pages should already be resident--
+unless we have racing threads mlocking() and munlocking() regions.  So,
+unlocking should not have to wait for page allocations nor faults  of any kind.
+Therefore mlock_fixup() does not downgrade the semaphore for munlock().
+
+
+Mlocked Pages:  munlock()/munlockall() System Call Handling
+
+The munlock() and munlockall() system calls are handled by the same functions--
+do_mlock[all]()--as the mlock() and mlockall() system calls with the unlock
+vs lock operation indicated by an argument.  So, these system calls are also
+handled by mlock_fixup().  Again, if called for an already munlock()ed vma,
+mlock_fixup() simply returns.  Because of the vma filtering discussed above,
+VM_LOCKED will not be set in any "special" vmas.  So, these vmas will be
+ignored for munlock.
+
+If the vma is VM_LOCKED, mlock_fixup() again attempts to merge or split off
+the specified range.  The range is then munlocked via the function
+__munlock_vma_pages_range().  Because the vma access protections could have
+been changed to PROT_NONE after faulting in and mlocking some pages,
+get_user_pages() is unreliable for visiting these pages for munlocking.  We
+don't want to leave pages mlocked(), so __munlock_vma_pages_range() uses a
+custom page table walker to find all pages mapped into the specified range.
+Note that this again assumes that all pages in the mlocked() range are resident
+and mapped by the task's page table.
+
+As with __mlock_vma_pages_range(), unlocking can race with truncation and
+migration.  It is very important that munlock of a page succeeds, lest we
+leak pages by stranding them in the mlocked state on the noreclaim list.
+The munlock page walk pte handler resolves the race with page migration
+by checking the pte for a special swap pte indicating that the page is
+being migrated.  If this is the case, the pte handler will wait for the
+migration entry to be replaced and then refetch the pte for the new page.
+Once the pte handler has locked the page, it checks the page_mapping to
+ensure that it still exists.  If not, the handler unlocks the page and
+retries the entire process after refetching the pte.
+
+The munlock page walk pte handler unlocks individual pages by calling
+munlock_vma_page().  munlock_vma_page() unconditionally clears the PG_mlocked
+flag using TestClearPageMlocked().  As with mlock_vma_page(), munlock_vma_page()
+use the Test*PageMlocked() function to handle the case where the page might
+have already been unlocked by another task.  If the page was mlocked,
+munlock_vma_page() updates that zone statistics for the number of mlocked
+pages.  Note, however, that at this point we haven't checked whether the page
+is mapped by other VM_LOCKED vmas.
+
+We can't call try_to_unlock(), the function that walks the reverse map to check
+for other VM_LOCKED vmas, without first isolating the page from the LRU.
+try_to_unlock() is a variant of try_to_unmap() and thus requires that the page
+not be on an lru list.  [More on these below.]  However, the call to
+isolate_lru_page() could fail, in which case we couldn't try_to_unlock().
+So, we go ahead and clear PG_mlocked up front, as this might be the only chance
+we have.  If we can successfully isolate the page, we go ahead and
+try_to_unlock(), which will restore the PG_mlocked flag and update the zone
+page statistics if it finds another vma holding the page mlocked.  If we fail
+to isolate the page, we'll have left a potentially mlocked page on the LRU.
+This is fine, because we'll catch it later when/if vmscan tries to reclaim the
+page.  This should be relatively rare.
+
+Mlocked Pages:  Migrating Them...
+
+A page that is being migrated has been isolated from the lru lists and is
+held locked across unmapping of the page, updating the page's mapping
+[address_space] entry and copying the contents and state, until the
+page table entry has been replaced with an entry that refers to the new
+page.  Linux supports migration of mlocked pages and other non-reclaimable
+pages.  This involves simply moving the PageMlocked and PageNoreclaim states
+from the old page to the new page.
+
+Note that page migration can race with mlocking or munlocking of the same
+page.  This has been discussed from the mlock/munlock perspective in the
+respective sections above.  Both processes [migration, m[un]locking], hold
+the page locked.  This provides the first level of synchronization.  Page
+migration zeros out the page_mapping of the old page before unlocking it,
+so m[un]lock can skip these pages.  However, as discussed above, munlock
+must wait for a migrating page to be replaced with the new page to prevent
+the new page from remaining mlocked outside of any VM_LOCKED vma.
+
+To ensure that we don't strand pages on the noreclaim list because of a
+race between munlock and migration, we must also prevent the munlock pte
+handler from acquiring the old or new page lock from the time that the
+migration subsystem acquires the old page lock, until either migration
+succeeds and the new page is added to the lru or migration fails and
+the old page is putback to the lru.  The achieve this coordination,
+the migration subsystem places the new page on success, or the old
+page on failure, back on the lru lists before dropping the respective
+page's lock.  It uses the putback_lru_page() function to accomplish this,
+which rechecks the page's overall reclaimability and adjusts the page
+flags accordingly.  To free the old page on success or the new page on
+failure, the migration subsystem just drops what it knows to be the last
+page reference via put_page().
+
+
+Mlocked Pages:  mmap(MAP_LOCKED) System Call Handling
+
+In addition the the mlock()/mlockall() system calls, an application can request
+that a region of memory be mlocked using the MAP_LOCKED flag with the mmap()
+call.  Furthermore, any mmap() call or brk() call that expands the heap by a
+task that has previously called mlockall() with the MCL_FUTURE flag will result
+in the newly mapped memory being mlocked.  Before the noreclaim/mlock changes,
+the kernel simply called make_pages_present() to allocate pages and populate
+the page table.
+
+To mlock a range of memory under the noreclaim/mlock infrastructure, the
+mmap() handler and task address space expansion functions call
+mlock_vma_pages_range() specifying the vma and the address range to mlock.
+mlock_vma_pages_range() filters vmas like mlock_fixup(), as described above in
+"Mlocked Pages:  Filtering Vmas".  It will clear the VM_LOCKED flag, which will
+have already been set by the caller, in filtered vmas.  Thus these vma's need
+not be visited for munlock when the region is unmapped.
+
+For "normal" vmas, mlock_vma_pages_range() calls __mlock_vma_pages_range() to
+fault/allocate the pages and mlock them.  Again, like mlock_fixup(),
+mlock_vma_pages_range() downgrades the mmap semaphore to read mode before
+attempting to fault/allocate and mlock the pages; and "upgrades" the semaphore
+back to write mode before returning.
+
+The callers of mlock_vma_pages_range() will have already added the memory
+range to be mlocked to the task's "locked_vm".  To account for filtered vmas,
+mlock_vma_pages_range() returns the number of pages NOT mlocked.  All of the
+callers then subtract a non-negative return value from the task's locked_vm.
+A negative return value represent an error--for example, from get_user_pages()
+attempting to fault in a vma with PROT_NONE access.  In this case, we leave
+the memory range accounted as locked_vm, as the protections could be changed
+later and pages allocated into that region.
+
+
+Mlocked Pages:  munmap()/exit()/exec() System Call Handling
+
+When unmapping an mlocked region of memory, whether by an explicit call to
+munmap() or via an internal unmap from exit() or exec() processing, we must
+munlock the pages if we're removing the last VM_LOCKED vma that maps the pages.
+Before the noreclaim/mlock changes, mlocking did not mark the pages in any way,
+so unmapping them required no processing.
+
+To munlock a range of memory under the noreclaim/mlock infrastructure, the
+munmap() hander and task address space tear down function call
+munlock_vma_pages_all().  The name reflects the observation that one always
+specifies the entire vma range when munlock()ing during unmap of a region.
+Because of the vma filtering when mlocking() regions, only "normal" vmas that
+actually contain mlocked pages will be passed to munlock_vma_pages_all().
+
+munlock_vma_pages_all() clears the VM_LOCKED vma flag and, like mlock_fixup()
+for the munlock case, calls __munlock_vma_pages_range() to walk the page table
+for the vma's memory range and munlock_vma_page() each resident page mapped by
+the vma.  This effectively munlocks the page, only if this is the last
+VM_LOCKED vma that maps the page.
+
+
+Mlocked Page:  try_to_unmap()
+
+[Note:  the code changes represented by this section are really quite small
+compared to the text to describe what happening and why, and to discuss the
+implications.]
+
+Pages can, of course, be mapped into multiple vmas.  Some of these vmas may
+have VM_LOCKED flag set.  It is possible for a page mapped into one or more
+VM_LOCKED vmas not to have the PG_mlocked flag set and therefore reside on one
+of the active or inactive LRU lists.  This could happen if, for example, a
+task in the process of munlock()ing the page could not isolate the page from
+the LRU.  As a result, vmscan/shrink_page_list() might encounter such a page
+as described in "Non-reclaimable Pages and Vmscan [shrink_*_list()]".  To
+handle this situation, try_to_unmap() has been enhanced to check for VM_LOCKED
+vmas while it is walking a page's reverse map.
+
+try_to_unmap() is always called, by either vmscan for reclaim or for page
+migration, with the argument page locked and isolated from the LRU.  BUG_ON()
+assertions enforce this requirement.  Separate functions handle anonymous and
+mapped file pages, as these types of pages have different reverse map
+mechanisms.
+
+	try_to_unmap_anon()
+
+To unmap anonymous pages, each vma in the list anchored in the anon_vma must be
+visited--at least until a VM_LOCKED vma is encountered.  If the page is being
+unmapped for migration, VM_LOCKED vmas do not stop the process because mlocked
+pages are migratable.  However, for reclaim, if the page is mapped into a
+VM_LOCKED vma, the scan stops.  try_to_unmap() attempts to acquire the mmap
+semphore of the mm_struct to which the vma belongs in read mode.  If this is
+successful, try_to_unmap() will mlock the page via mlock_vma_page()--we
+wouldn't have gotten to try_to_unmap() if the page were already mlocked--and
+will return SWAP_MLOCK, indicating that the page is nonreclaimable.  If the
+mmap semaphore cannot be acquired, we are not sure whether the page is really
+nonreclaimable or not.  In this case, try_to_unmap() will return SWAP_AGAIN.
+
+	try_to_unmap_file() -- linear mappings
+
+Unmapping of a mapped file page works the same, except that the scan visits
+all vmas that maps the page's index/page offset in the page's mapping's
+reverse map priority search tree.  It must also visit each vma in the page's
+mapping's non-linear list, if the list is non-empty.  As for anonymous pages,
+on encountering a VM_LOCKED vma for a mapped file page, try_to_unmap() will
+attempt to acquire the associated mm_struct's mmap semaphore to mlock the page,
+returning SWAP_MLOCK if this is successful, and SWAP_AGAIN, if not.
+
+	try_to_unmap_file() -- non-linear mappings
+
+If a page's mapping contains a non-empty non-linear mapping vma list, then
+try_to_un{map|lock}() must also visit each vma in that list to determine
+whether the page is mapped in a VM_LOCKED vma.  Again, the scan must visit
+all vmas in the non-linear list to ensure that the pages is not/should not be
+mlocked.  If a VM_LOCKED vma is found in the list, the scan could terminate.
+However, there is no easy way to determine whether the page is actually mapped
+in a given vma--either for unmapping or testing whether the VM_LOCKED vma
+actually pins the page.
+
+So, try_to_unmap_file() handles non-linear mappings by scanning a certain
+number of pages--a "cluster"--in each non-linear vma associated with the page's
+mapping, for each file mapped page that vmscan tries to unmap.  If this happens
+to unmap the page we're trying to unmap, try_to_unmap() will notice this on
+return--(page_mapcount(page) == 0)--and return SWAP_SUCCESS.  Otherwise, it
+will return SWAP_AGAIN, causing vmscan to recirculate this page.  We take
+advantage of the cluster scan in try_to_unmap_cluster() as follows:
+
+For each non-linear vma, try_to_unmap_cluster() attempts to acquire the mmap
+semaphore of the associated mm_struct for read without blocking.  If this
+attempt is successful and the vma is VM_LOCKED, try_to_unmap_cluster() will
+retain the mmap semaphore for the scan; otherwise it drops it here.  Then,
+for each page in the cluster, if we're holding the mmap semaphore for a locked
+vma, try_to_unmap_cluster() calls mlock_vma_page() to mlock the page.  This
+call is a no-op if the page is already locked, but will mlock any pages in
+the non-linear mapping that happen to be unlocked.  If one of the pages so
+mlocked is the page passed in to try_to_unmap(), try_to_unmap_cluster() will
+return SWAP_MLOCK, rather than the default SWAP_AGAIN.  This will allow vmscan
+to cull the page, rather than recirculating it on the inactive list.  Again,
+if try_to_unmap_cluster() cannot acquire the vma's mmap sem, it returns
+SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED vma, but
+couldn't be mlocked.
+
+
+Mlocked pages:  try_to_unlock() Reverse Map Scan
+
+TODO/FIXME:  a better name might be page_mlocked()--analogous to the
+page_referenced() reverse map walker--especially if we continue to call this
+from shrink_page_list().  See related TODO/FIXME below.
+
+When munlock_vma_page()--see "Mlocked Pages:  munlock()/munlockall() System
+Call Handling" above--tries to munlock a page, or when shrink_page_list()
+encounters an anonymous page that is not yet in the swap cache, they need to
+determine whether or not the page is mapped by any VM_LOCKED vma, without
+actually attempting to unmap all ptes from the page.  For this purpose, the
+noreclaim/mlock infrastructure introduced a variant of try_to_unmap() called
+try_to_unlock().
+
+try_to_unlock() calls the same functions as try_to_unmap() for anonymous and
+mapped file pages with an additional argument specifing unlock versus unmap
+processing.  Again, these functions walk the respective reverse maps looking
+for VM_LOCKED vmas.  When such a vma is found for anonymous pages and file
+pages mapped in linear VMAs, as in the try_to_unmap() case, the functions
+attempt to acquire the associated mmap semphore, mlock the page via
+mlock_vma_page() and return SWAP_MLOCK.  This effectively undoes the
+pre-clearing of the page's PG_mlocked done by munlock_vma_page() and informs
+shrink_page_list() that the anonymous page should be culled rather than added
+to the swap cache in preparation for a try_to_unmap() that will almost
+certainly fail.
+
+If try_to_unmap() is unable to acquire a VM_LOCKED vma's associated mmap
+semaphore, it will return SWAP_AGAIN.  This will allow shrink_page_list()
+to recycle the page on the inactive list and hope that it has better luck
+with the page next time.
+
+For file pages mapped into non-linear vmas, the try_to_unlock() logic works
+slightly differently.  On encountering a VM_LOCKED non-linear vma that might
+map the page, try_to_unlock() returns SWAP_AGAIN without actually mlocking
+the page.  munlock_vma_page() will just leave the page unlocked and let
+vmscan deal with it--the usual fallback position.
+
+Note that try_to_unlock()'s reverse map walk must visit every vma in a pages'
+reverse map to determine that a page is NOT mapped into any VM_LOCKED vma.
+However, the scan can terminate when it encounters a VM_LOCKED vma and can
+successfully acquire the vma's mmap semphore for read and mlock the page.
+Although try_to_unlock() can be called many [very many!] times when
+munlock()ing a large region or tearing down a large address space that has been
+mlocked via mlockall(), overall this is a fairly rare event.  In addition,
+although shrink_page_list() calls try_to_unlock() for every anonymous page that
+it handles that is not yet in the swap cache, on average anonymous pages will
+have very short reverse map lists.
+
+Mlocked Page:  Page Reclaim in shrink_*_list()
+
+shrink_active_list() culls any obviously nonreclaimable pages--i.e.,
+!page_reclaimable(page, NULL)--diverting these to the noreclaim lru
+list.  However, shrink_active_list() only sees nonreclaimable pages that
+made it onto the active/inactive lru lists.  Note that these pages do not
+have PageNoreclaim set--otherwise, they would be on the noreclaim list and
+shrink_active_list would never see them.
+
+Some examples of these nonreclaimable pages on the LRU lists are:
+
+1) ramfs and ram disk pages that have been placed on the lru lists when
+   first allocated.
+
+2) SHM_LOCKed shared memory pages.  shmctl(SHM_LOCK) does not attempt to
+   allocate or fault in the pages in the shared memory region.  This happens
+   when an application accesses the page the first time after SHM_LOCKing
+   the segment.
+
+3) Mlocked pages that could not be isolated from the lru and moved to the
+   noreclaim list in mlock_vma_page().
+
+3) Pages mapped into multiple VM_LOCKED vmas, but try_to_unlock() couldn't
+   acquire the vma's mmap semaphore to test the flags and set PageMlocked.
+   munlock_vma_page() was forced to let the page back on to the normal
+   LRU list for vmscan to handle.
+
+shrink_inactive_list() also culls any nonreclaimable pages that it finds
+on the inactive lists, again diverting them to the appropriate zone's noreclaim
+lru list.  shrink_inactive_list() should only see SHM_LOCKed pages that became
+SHM_LOCKed after shrink_active_list() had moved them to the inactive list, or
+pages mapped into VM_LOCKED vmas that munlock_vma_page() couldn't isolate from
+the lru to recheck via try_to_unlock().  shrink_inactive_list() won't notice
+the latter, but will pass on to shrink_page_list().
+
+shrink_page_list() again culls obviously nonreclaimable pages that it could
+encounter for similar reason to shrink_inactive_list().  As already discussed,
+shrink_page_list() proactively looks for anonymous pages that should have
+PG_mlocked set but don't--these would not be detected by page_reclaimable()--to
+avoid adding them to the swap cache unnecessarily.  File pages mapped into
+VM_LOCKED vmas but without PG_mlocked set will make it all the way to
+try_to_unmap().  shrink_page_list() will divert them to the noreclaim list when
+try_to_unmap() returns SWAP_MLOCK, as discussed above.
+
+TODO/FIXME:  If we can enhance the swap cache to reliably remove entries
+with page_count(page) > 2, as long as all ptes are mapped to the page and
+not the swap entry, we can probably remove the call to try_to_unlock() in
+shrink_page_list() and just remove the page from the swap cache when
+try_to_unmap() returns SWAP_MLOCK.   Currently, remove_exclusive_swap_page()
+doesn't seem to allow that.
+
+

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 00/25] VM pageout scalability improvements (V10)
  2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
                   ` (24 preceding siblings ...)
  2008-06-06 20:29 ` [PATCH -mm 25/25] Noreclaim LRU and Mlocked Pages Documentation Rik van Riel, Rik van Riel
@ 2008-06-06 21:02 ` Andrew Morton
  2008-06-06 21:08   ` Rik van Riel
  25 siblings, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2008-06-06 21:02 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro

On Fri, 06 Jun 2008 16:28:38 -0400
Rik van Riel <riel@redhat.com> wrote:

> On large memory systems, the VM can spend way too much time scanning
> through pages that it cannot (or should not) evict from memory. Not
> only does it use up CPU time, but it also provokes lock contention
> and can leave large systems under memory presure in a catatonic state.
> 
> Against 2.6.26-rc2-mm1
> 
> This patch series improves VM scalability by:

-mm has a patch called
vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru.patch
which has been sitting there for some time waiting for us to work out
whether or not it is a desirable thing.

This patchset of yours apparently retains the change which
vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru.patch
makes.

But do we know that it was a good one?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 00/25] VM pageout scalability improvements (V10)
  2008-06-06 21:02 ` [PATCH -mm 00/25] VM pageout scalability improvements (V10) Andrew Morton
@ 2008-06-06 21:08   ` Rik van Riel
  0 siblings, 0 replies; 102+ messages in thread
From: Rik van Riel @ 2008-06-06 21:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro

On Fri, 6 Jun 2008 14:02:16 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Fri, 06 Jun 2008 16:28:38 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
> > On large memory systems, the VM can spend way too much time scanning
> > through pages that it cannot (or should not) evict from memory. Not
> > only does it use up CPU time, but it also provokes lock contention
> > and can leave large systems under memory presure in a catatonic state.
> > 
> > Against 2.6.26-rc2-mm1
> > 
> > This patch series improves VM scalability by:
> 
> -mm has a patch called
> vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru.patch
> which has been sitting there for some time waiting for us to work out
> whether or not it is a desirable thing.
> 
> This patchset of yours apparently retains the change which
> vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru.patch
> makes.
> 
> But do we know that it was a good one?

Benchmark results here indicate that it helps some workloads by up
to 10%, but makes the page cache pages that fall off of the active
list more prone to being replaced by streaming IO.

I have added in a fix in this series to set the referenced bit on
unmapped page cache pages that get deactivated, so that defect is
resolved.

I've been busy with some other stuff this week;  I'll try to get
you some numbers ASAP.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 02/25] Use an indexed array for LRU variables
  2008-06-06 20:28 ` [PATCH -mm 02/25] Use an indexed array for LRU variables Rik van Riel, Rik van Riel
@ 2008-06-07  1:04   ` Andrew Morton
  2008-06-07  5:43     ` KOSAKI Motohiro
  2008-06-07 18:42     ` Rik van Riel
  0 siblings, 2 replies; 102+ messages in thread
From: Andrew Morton @ 2008-06-07  1:04 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, clameter

On Fri, 06 Jun 2008 16:28:40 -0400
Rik van Riel <riel@redhat.com> wrote:

> From: Christoph Lameter <clameter@sgi.com>
> 
> Currently we are defining explicit variables for the inactive
> and active list. An indexed array can be more generic and avoid
> repeating similar code in several places in the reclaim code.
> 
> We are saving a few bytes in terms of code size:
> 
> Before:
> 
>    text    data     bss     dec     hex filename
> 4097753  573120 4092484 8763357  85b7dd vmlinux
> 
> After:
> 
>    text    data     bss     dec     hex filename
> 4097729  573120 4092484 8763333  85b7c5 vmlinux
> 
> Having an easy way to add new lru lists may ease future work on
> the reclaim code.

I would have spat the dummy at pointless churn and code uglification
but I see that we end up with five LRU lsits so ho hum.

>
> ...
>
>  
>  	/* Fields commonly accessed by the page reclaim scanner */
>  	spinlock_t		lru_lock;	
> -	struct list_head	active_list;
> -	struct list_head	inactive_list;
> -	unsigned long		nr_scan_active;
> -	unsigned long		nr_scan_inactive;
> +	struct list_head	list[NR_LRU_LISTS];
> +	unsigned long		nr_scan[NR_LRU_LISTS];

It'd be a little cache-friendlier to lay this out as

	struct {
		struct list_head list;
		unsigned long nr_scan;
	} lru_stuff[NR_LRU_LISTS];


>  	unsigned long		pages_scanned;	   /* since last reclaim */
>  	unsigned long		flags;		   /* zone flags, see below */
>  
> Index: linux-2.6.26-rc2-mm1/include/linux/mm_inline.h
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/include/linux/mm_inline.h	2008-05-23 14:21:21.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/include/linux/mm_inline.h	2008-05-23 14:21:33.000000000 -0400
> @@ -1,40 +1,51 @@
>  static inline void
> +add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
> +{
> +	list_add(&page->lru, &zone->list[l]);
> +	__inc_zone_state(zone, NR_INACTIVE + l);

                               ^ that's a bug, isn't it?

oh, no it isn't.

Can we rename NR_INACTIVE?  Maybe VMSCAN_BASE or something?

> +}
> +
>
> ...
>
> @@ -945,10 +945,7 @@ static unsigned long shrink_inactive_lis
>  			VM_BUG_ON(PageLRU(page));
>  			SetPageLRU(page);
>  			list_del(&page->lru);
> -			if (PageActive(page))
> -				add_page_to_active_list(zone, page);
> -			else
> -				add_page_to_inactive_list(zone, page);
> +			add_page_to_lru_list(zone, page, PageActive(page));

urgh.  the third arg to add_page_to_lru_list() is an `enum lru_list'
and here we are secretly coercing PageActive()'s boolean return into a
just-happens-to-be-right `enum lru_list'.

That's pretty nasty?

>  			if (!pagevec_add(&pvec, page)) {
>  				spin_unlock_irq(&zone->lru_lock);
>  				__pagevec_release(&pvec);
>
> ...
>
> +static unsigned long shrink_list(enum lru_list l, unsigned long nr_to_scan,
> +	struct zone *zone, struct scan_control *sc, int priority)
> +{
> +	if (l == LRU_ACTIVE) {
> +		shrink_active_list(nr_to_scan, zone, sc, priority);
> +		return 0;
> +	}
> +	return shrink_inactive_list(nr_to_scan, zone, sc);
> +}

I guess a lot of this code gets changed a lot later on.

>
> ...
>


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 04/25] free swap space on swap-in/activation
  2008-06-06 20:28 ` [PATCH -mm 04/25] free swap space on swap-in/activation Rik van Riel, Rik van Riel
@ 2008-06-07  1:04   ` Andrew Morton
  2008-06-07 19:56     ` Rik van Riel
  2008-06-09  2:14     ` MinChan Kim
  0 siblings, 2 replies; 102+ messages in thread
From: Andrew Morton @ 2008-06-07  1:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, minchan.kim,
	Lee.Schermerhorn, Hugh Dickins

On Fri, 06 Jun 2008 16:28:42 -0400
Rik van Riel <riel@redhat.com> wrote:

> From: Rik van Riel <riel@redhat.com>
> 
> Free swap cache entries when swapping in pages if vm_swap_full()
> [swap space > 1/2 used].  Uses new pagevec to reduce pressure
> on locks.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> Signed-off-by: MinChan Kim <minchan.kim@gmail.com>
> 
> ---
>  include/linux/pagevec.h |    1 +
>  include/linux/swap.h    |    6 ++++++
>  mm/swap.c               |   18 ++++++++++++++++++
>  mm/swapfile.c           |   25 ++++++++++++++++++++++---
>  mm/vmscan.c             |    7 +++++++
>  5 files changed, 54 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-05-23 14:21:33.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-05-23 14:21:33.000000000 -0400
> @@ -619,6 +619,9 @@ free_it:
>  		continue;
>  
>  activate_locked:
> +		/* Not a candidate for swapping, so reclaim swap space. */
> +		if (PageSwapCache(page) && vm_swap_full())

The patch puts rather a lot of pressure onto vm_swap_full().  We might
want to look into optimising that.

- Is the 50% thing optimum?  Could go higher and perhaps should be
  based on amount-of-memory.

- Can precalculate the fraction rather than doing it inline all the time.

- Can make total_swap_pages __read_mostly and have a think about
  nr_swap_pages too.

- Can completely optimise the thing away if !CONFIG_SWAP.


Has all this code been tested with CONFIG_SWAP=n?

> Index: linux-2.6.26-rc2-mm1/mm/swap.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/swap.c	2008-05-23 14:21:33.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/swap.c	2008-05-23 14:21:33.000000000 -0400
> @@ -443,6 +443,24 @@ void pagevec_strip(struct pagevec *pvec)
>  	}
>  }
>  
> +/*
> + * Try to free swap space from the pages in a pagevec
> + */
> +void pagevec_swap_free(struct pagevec *pvec)
> +{
> +	int i;
> +
> +	for (i = 0; i < pagevec_count(pvec); i++) {
> +		struct page *page = pvec->pages[i];
> +
> +		if (PageSwapCache(page) && !TestSetPageLocked(page)) {
> +			if (PageSwapCache(page))
> +				remove_exclusive_swap_page_ref(page);
> +			unlock_page(page);
> +		}
> +	}
> +}

What's going on here.

Normally we'll bump a page's refcount to account for its presence in a
pagevec.  This code doesn't do that.

Is it safe?  If so, how come?

Suitable code comments should be added which explain this unusual and
dangerous optimisation.  Or fix the bug :)



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 05/25] define page_file_cache() function
  2008-06-06 20:28 ` [PATCH -mm 05/25] define page_file_cache() function Rik van Riel, Rik van Riel
@ 2008-06-07  1:04   ` Andrew Morton
  2008-06-07 23:38     ` Rik van Riel
  0 siblings, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2008-06-07  1:04 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, minchan.kim

On Fri, 06 Jun 2008 16:28:43 -0400
Rik van Riel <riel@redhat.com> wrote:

> From: Rik van Riel <riel@redhat.com>
> 
> Define page_file_cache() function to answer the question:
> 	is page backed by a file?
> 
> Originally part of Rik van Riel's split-lru patch.  Extracted
> to make available for other, independent reclaim patches.
> 
> Moved inline function to linux/mm_inline.h where it will
> be needed by subsequent "split LRU" and "noreclaim" patches.  
> 
> Unfortunately this needs to use a page flag, since the
> PG_swapbacked state needs to be preserved all the way
> to the point where the page is last removed from the
> LRU.  Trying to derive the status from other info in
> the page resulted in wrong VM statistics in earlier
> split VM patchsets.
>

argh.  How many are left?

> +#ifndef LINUX_MM_INLINE_H
> +#define LINUX_MM_INLINE_H
> +
> +/**
> + * page_file_cache - should the page be on a file LRU or anon LRU?
> + * @page: the page to test
> + *
> + * Returns !0 if @page is page cache page backed by a regular filesystem,
> + * or 0 if @page is anonymous, tmpfs or otherwise ram or swap backed.
> + *
> + * We would like to get this info without a page flag, but the state
> + * needs to survive until the page is last deleted from the LRU, which
> + * could be as far down as __page_cache_release.
> + */
> +static inline int page_file_cache(struct page *page)
> +{
> +	if (PageSwapBacked(page))
> +		return 0;
> +
> +	/* The page is page cache backed by a normal filesystem. */
> +	return 2;

2?

Maybe bool would suit here.

Maybe a better name would be page_is_file_cache().  The gnu (gcc?)
convention of putting _p at the end of predicate functions' names makes
heaps of sense.


This function doesn't do enough stuff to do that which it says it does.
There must be a whole pile of preconditions which the caller must
evaluate before this function can be usefully used.  I mean, it would
be a bug to pass an anonymous page or a slab page or whatever into
here?

> +}
> +
>
> ...
>
> --- linux-2.6.26-rc2-mm1.orig/include/linux/page-flags.h	2008-05-23 14:21:21.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/include/linux/page-flags.h	2008-05-23 14:21:34.000000000 -0400
> @@ -93,6 +93,7 @@ enum pageflags {
>  	PG_mappedtodisk,	/* Has blocks allocated on-disk */
>  	PG_reclaim,		/* To be reclaimed asap */
>  	PG_buddy,		/* Page is free, on buddy lists */
> +	PG_swapbacked,		/* Page is backed by RAM/swap */
>  #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
>  	PG_uncached,		/* Page has been mapped as uncached */
>  #endif
> @@ -160,6 +161,7 @@ PAGEFLAG(Pinned, owner_priv_1) TESTSCFLA
>  PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved)
>  PAGEFLAG(Private, private) __CLEARPAGEFLAG(Private, private)
>  	__SETPAGEFLAG(Private, private)
> +PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)

Those __ClearPageFoo() functions scare my pants into the next suburb. 
They can cause such horridly subtle bugs if misused.  Every single
callsite should have special attention and careful justification in
comments, IMO.

>  /*
>   * Only test-and-set exist for PG_writeback.  The unconditional operators are
> Index: linux-2.6.26-rc2-mm1/mm/memory.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/memory.c	2008-05-23 14:21:21.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/memory.c	2008-05-23 14:21:34.000000000 -0400
> @@ -1765,6 +1765,7 @@ gotten:
>  		ptep_clear_flush(vma, address, page_table);
>  		set_pte_at(mm, address, page_table, entry);
>  		update_mmu_cache(vma, address, entry);
> +		SetPageSwapBacked(new_page);
>  		lru_cache_add_active(new_page);
>  		page_add_new_anon_rmap(new_page, vma, address);
>  
> @@ -2233,6 +2234,7 @@ static int do_anonymous_page(struct mm_s
>  	if (!pte_none(*page_table))
>  		goto release;
>  	inc_mm_counter(mm, anon_rss);
> +	SetPageSwapBacked(page);
>  	lru_cache_add_active(page);
>  	page_add_new_anon_rmap(page, vma, address);
>  	set_pte_at(mm, address, page_table, entry);
> @@ -2374,6 +2376,7 @@ static int __do_fault(struct mm_struct *
>  		set_pte_at(mm, address, page_table, entry);
>  		if (anon) {
>                          inc_mm_counter(mm, anon_rss);
> +			SetPageSwapBacked(page);
>                          lru_cache_add_active(page);
>                          page_add_new_anon_rmap(page, vma, address);

OK, someone lost their tab key and it wasn't you.

<does git-blame>

<blames Nick>

>  		} else {
>
> ...
>
> @@ -261,6 +261,7 @@ static void bad_page(struct page *page)
>  			1 << PG_slab    |
>  			1 << PG_swapcache |
>  			1 << PG_writeback |
> +			1 << PG_swapbacked |
>  			1 << PG_buddy );
>  	set_page_count(page, 0);
>  	reset_page_mapcount(page);
> @@ -494,6 +495,8 @@ static inline int free_pages_check(struc
>  		bad_page(page);
>  	if (PageDirty(page))
>  		__ClearPageDirty(page);
> +	if (PageSwapBacked(page))
> +		__ClearPageSwapBacked(page);

OK, that one isn't so scary.

>  	/*
>  	 * For now, we report if PG_reserved was found set, but do not
>  	 * clear it, and do not free the page.  But we shall soon need
> @@ -644,6 +647,7 @@ static int prep_new_page(struct page *pa
>  			1 << PG_swapcache |
>  			1 << PG_writeback |
>  			1 << PG_reserved |
> +			1 << PG_swapbacked |
>  			1 << PG_buddy ))))
>  		bad_page(page);


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 06/25] split LRU lists into anon & file sets
  2008-06-06 20:28 ` [PATCH -mm 06/25] split LRU lists into anon & file sets Rik van Riel, Rik van Riel
@ 2008-06-07  1:04   ` Andrew Morton
  2008-06-07  1:22     ` Rik van Riel
  0 siblings, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2008-06-07  1:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, Lee.Schermerhorn

On Fri, 06 Jun 2008 16:28:44 -0400
Rik van Riel <riel@redhat.com> wrote:

> From: Rik van Riel <riel@redhat.com>
> 
> Split the LRU lists in two, one set for pages that are backed by
> real file systems ("file") and one for pages that are backed by
> memory and swap ("anon").  The latter includes tmpfs.
> 
> Eventually mlocked pages will be taken off the LRUs alltogether.
> A patch for that already exists and just needs to be integrated
> into this series.
> 
> This patch mostly has the infrastructure and a basic policy to
> balance how much we scan the anon lists and how much we scan
> the file lists. The big policy changes are in separate patches.
>

The changelogs are a bit scrappy and could do with some care

- stale assertions such as the above

- "From:<random number of spaces>Lee" in various places

- Some have the --- separator and others don't (this trips me up).

- Stuff like "Against: 2.6.26-rc2-mm1" right in the middle of the
  changelog for me to hunt down and squish.

- "TODO:  DEBUGGING ONLY: NOT FOR UPSTREAM MERGE" <<-- what's up with this?

- random Capitalisation in Various patch Titles.

- "V2 -> V3:" logging in the main changelog - not relevant in the
  final commit hence more for me to edit away.

- Strange inventions like "Originally Signed-off-by:"

- please prefer to prefix the patch titles with a suitable subsystem
  identifier.  In this case "vmscan: " would suit.

- Other stuff I forgot.  A general recheck and cleanup would be nice.

  I've actually fixed all ofthe above but I don't yet know whether
  I'll be checking all this in.

> Index: linux-2.6.26-rc2-mm1/fs/proc/proc_misc.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/fs/proc/proc_misc.c	2008-05-23 14:21:21.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/fs/proc/proc_misc.c	2008-05-23 14:21:34.000000000 -0400
> @@ -132,6 +132,10 @@ static int meminfo_read_proc(char *page,
>  	unsigned long allowed;
>  	struct vmalloc_info vmi;
>  	long cached;
> +	unsigned long inactive_anon;
> +	unsigned long active_anon;
> +	unsigned long inactive_file;
> +	unsigned long active_file;

Shouldn't this be an array[NR_LRU_LISTS]?

>  /*
>   * display in kilobytes.
> @@ -150,48 +154,61 @@ static int meminfo_read_proc(char *page,
>  
>  	get_vmalloc_info(&vmi);
>  
> +	inactive_anon = global_page_state(NR_INACTIVE_ANON);
> +	active_anon   = global_page_state(NR_ACTIVE_ANON);
> +	inactive_file = global_page_state(NR_INACTIVE_FILE);
> +	active_file   = global_page_state(NR_ACTIVE_FILE);

then this can perhaps become a loop.

>  	/*
>  	 * Tagged format, for easy grepping and expansion.
>  	 */
>  	len = sprintf(page,
> -		"MemTotal:     %8lu kB\n"
> -		"MemFree:      %8lu kB\n"
> -		"Buffers:      %8lu kB\n"
> -		"Cached:       %8lu kB\n"
> -		"SwapCached:   %8lu kB\n"
> -		"Active:       %8lu kB\n"
> -		"Inactive:     %8lu kB\n"
> +		"MemTotal:       %8lu kB\n"
> +		"MemFree:        %8lu kB\n"
> +		"Buffers:        %8lu kB\n"
> +		"Cached:         %8lu kB\n"
> +		"SwapCached:     %8lu kB\n"
> +		"Active:         %8lu kB\n"
> +		"Inactive:       %8lu kB\n"
> +		"Active(anon):   %8lu kB\n"
> +		"Inactive(anon): %8lu kB\n"
> +		"Active(file):   %8lu kB\n"
> +		"Inactive(file): %8lu kB\n"
>  #ifdef CONFIG_HIGHMEM
> -		"HighTotal:    %8lu kB\n"
> -		"HighFree:     %8lu kB\n"
> -		"LowTotal:     %8lu kB\n"
> -		"LowFree:      %8lu kB\n"
> -#endif
> -		"SwapTotal:    %8lu kB\n"
> -		"SwapFree:     %8lu kB\n"
> -		"Dirty:        %8lu kB\n"
> -		"Writeback:    %8lu kB\n"
> -		"AnonPages:    %8lu kB\n"
> -		"Mapped:       %8lu kB\n"
> -		"Slab:         %8lu kB\n"
> -		"SReclaimable: %8lu kB\n"
> -		"SUnreclaim:   %8lu kB\n"
> -		"PageTables:   %8lu kB\n"
> -		"NFS_Unstable: %8lu kB\n"
> -		"Bounce:       %8lu kB\n"
> -		"WritebackTmp: %8lu kB\n"
> -		"CommitLimit:  %8lu kB\n"
> -		"Committed_AS: %8lu kB\n"
> -		"VmallocTotal: %8lu kB\n"
> -		"VmallocUsed:  %8lu kB\n"
> -		"VmallocChunk: %8lu kB\n",
> +		"HighTotal:      %8lu kB\n"
> +		"HighFree:       %8lu kB\n"
> +		"LowTotal:       %8lu kB\n"
> +		"LowFree:        %8lu kB\n"
> +#endif
> +		"SwapTotal:      %8lu kB\n"
> +		"SwapFree:       %8lu kB\n"
> +		"Dirty:          %8lu kB\n"
> +		"Writeback:      %8lu kB\n"
> +		"AnonPages:      %8lu kB\n"
> +		"Mapped:         %8lu kB\n"
> +		"Slab:           %8lu kB\n"
> +		"SReclaimable:   %8lu kB\n"
> +		"SUnreclaim:     %8lu kB\n"
> +		"PageTables:     %8lu kB\n"
> +		"NFS_Unstable:   %8lu kB\n"
> +		"Bounce:         %8lu kB\n"
> +		"WritebackTmp:   %8lu kB\n"
> +		"CommitLimit:    %8lu kB\n"
> +		"Committed_AS:   %8lu kB\n"
> +		"VmallocTotal:   %8lu kB\n"
> +		"VmallocUsed:    %8lu kB\n"
> +		"VmallocChunk:   %8lu kB\n",
>  		K(i.totalram),
>  		K(i.freeram),
>  		K(i.bufferram),
>  		K(cached),
>  		K(total_swapcache_pages),
> -		K(global_page_state(NR_ACTIVE)),
> -		K(global_page_state(NR_INACTIVE)),
> +		K(active_anon   + active_file),
> +		K(inactive_anon + inactive_file),
> +		K(active_anon),
> +		K(inactive_anon),
> +		K(active_file),
> +		K(inactive_file),

Do we really want to put all this stuff into /proc/meminfo?

Would it be better to aggregate it in some manner for meminfo and show
the fine-grained info in /proc/vmstat?




^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 07/25] second chance replacement for anonymous pages
  2008-06-06 20:28 ` [PATCH -mm 07/25] second chance replacement for anonymous pages Rik van Riel, Rik van Riel
@ 2008-06-07  1:04   ` Andrew Morton
  2008-06-07  6:03     ` KOSAKI Motohiro
  2008-06-08 15:04     ` Rik van Riel
  0 siblings, 2 replies; 102+ messages in thread
From: Andrew Morton @ 2008-06-07  1:04 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro

On Fri, 06 Jun 2008 16:28:45 -0400
Rik van Riel <riel@redhat.com> wrote:

> From: Rik van Riel <riel@redhat.com>
> 
> We avoid evicting and scanning anonymous pages for the most part, but
> under some workloads we can end up with most of memory filled with
> anonymous pages.  At that point, we suddenly need to clear the referenced
> bits on all of memory, which can take ages on very large memory systems.
> 
> We can reduce the maximum number of pages that need to be scanned by
> not taking the referenced state into account when deactivating an
> anonymous page.  After all, every anonymous page starts out referenced,
> so why check?
> 
> If an anonymous page gets referenced again before it reaches the end
> of the inactive list, we move it back to the active list.
> 
> To keep the maximum amount of necessary work reasonable, we scale the
> active to inactive ratio with the size of memory, using the formula
> active:inactive ratio = sqrt(memory in GB * 10).

Should be scaled by PAGE_SIZE?

> Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
> instead of by the amount of memory present in the system.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> ---
>  include/linux/mm_inline.h |   12 ++++++++++++
>  include/linux/mmzone.h    |    5 +++++
>  mm/page_alloc.c           |   40 ++++++++++++++++++++++++++++++++++++++++
>  mm/vmscan.c               |   38 +++++++++++++++++++++++++++++++-------
>  mm/vmstat.c               |    6 ++++--
>  5 files changed, 92 insertions(+), 9 deletions(-)
> 
> Index: linux-2.6.26-rc2-mm1/include/linux/mm_inline.h
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/include/linux/mm_inline.h	2008-05-23 14:21:34.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/include/linux/mm_inline.h	2008-05-28 12:09:06.000000000 -0400
> @@ -97,4 +97,16 @@ del_page_from_lru(struct zone *zone, str
>  	__dec_zone_state(zone, NR_INACTIVE_ANON + l);
>  }
>  
> +static inline int inactive_anon_low(struct zone *zone)
> +{
> +	unsigned long active, inactive;
> +
> +	active = zone_page_state(zone, NR_ACTIVE_ANON);
> +	inactive = zone_page_state(zone, NR_INACTIVE_ANON);
> +
> +	if (inactive * zone->inactive_ratio < active)
> +		return 1;
> +
> +	return 0;
> +}

inactive_anon_low: "number of inactive anonymous pages which are in lowmem"?

Nope.

Needs a comment.  And maybe a better name, like inactive_anon_is_low. 
Although making the return type a bool kind-of does that.

>  #endif
> Index: linux-2.6.26-rc2-mm1/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/include/linux/mmzone.h	2008-05-23 14:21:34.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/include/linux/mmzone.h	2008-05-28 12:09:06.000000000 -0400
> @@ -311,6 +311,11 @@ struct zone {
>  	 */
>  	int prev_priority;
>  
> +	/*
> +	 * The ratio of active to inactive pages.
> +	 */
> +	unsigned int inactive_ratio;

That comment needs a lot of help please.  For a start, it's plain wrong
- inactive_ratio would need to be a float to be able to record that ratio.

The comment should describe the units too.

Now poor-old-reviewer has to go off and work out what this thing is.

>  
>  	ZONE_PADDING(_pad2_)
>  	/* Rarely used or read-mostly fields */
> Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c	2008-05-23 14:21:34.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/page_alloc.c	2008-05-28 12:09:06.000000000 -0400
> @@ -4269,6 +4269,45 @@ void setup_per_zone_pages_min(void)
>  	calculate_totalreserve_pages();
>  }
>  
> +/**
> + * setup_per_zone_inactive_ratio - called when min_free_kbytes changes.
> + *
> + * The inactive anon list should be small enough that the VM never has to
> + * do too much work, but large enough that each inactive page has a chance
> + * to be referenced again before it is swapped out.
> + *
> + * The inactive_anon ratio is the ratio of active to inactive anonymous

target ratio?  Desired ratio?

> + * pages.  Ie. a ratio of 3 means 3:1 or 25% of the anonymous pages are
> + * on the inactive list.
> + *
> + * total     return    max
> + * memory    value     inactive anon

This function doesn't "return" a "value".

> + * -------------------------------------
> + *   10MB       1         5MB
> + *  100MB       1        50MB
> + *    1GB       3       250MB
> + *   10GB      10       0.9GB
> + *  100GB      31         3GB
> + *    1TB     101        10GB
> + *   10TB     320        32GB
> + */
> +void setup_per_zone_inactive_ratio(void)
> +{
> +	struct zone *zone;
> +
> +	for_each_zone(zone) {
> +		unsigned int gb, ratio;
> +
> +		/* Zone size in gigabytes */
> +		gb = zone->present_pages >> (30 - PAGE_SHIFT);
> +		ratio = int_sqrt(10 * gb);
> +		if (!ratio)
> +			ratio = 1;
> +
> +		zone->inactive_ratio = ratio;
> +	}
> +}

OK, so inactive_ratio is an integer 1 ..  N which determines our target
number of inactive pages according to the formula

	nr_inactive = nr_active / inactive_ratio

yes?

Can nr_inactive get larger than this?  I assume so.  I guess that
doesn't matter much.  Except the problems which you're trying to sovle
here can reoccur.   What would I need to do to trigger that?

>  /*
>   * Initialise min_free_kbytes.
>   *
> @@ -4306,6 +4345,7 @@ static int __init init_per_zone_pages_mi
>  		min_free_kbytes = 65536;
>  	setup_per_zone_pages_min();
>  	setup_per_zone_lowmem_reserve();
> +	setup_per_zone_inactive_ratio();
>  	return 0;
>  }
>  module_init(init_per_zone_pages_min)
> Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-05-23 14:21:34.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-05-28 12:11:38.000000000 -0400
> @@ -114,7 +114,7 @@ struct scan_control {
>  /*
>   * From 0 .. 100.  Higher means more swappy.
>   */
> -int vm_swappiness = 60;
> +int vm_swappiness = 20;

<goes back to check the changelog>

Whoa.  Where'd this come from?

>  long vm_total_pages;	/* The total number of pages which the VM controls */
>  
>  static LIST_HEAD(shrinker_list);
> @@ -1008,7 +1008,7 @@ static inline int zone_is_near_oom(struc
>  static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>  			struct scan_control *sc, int priority, int file)
>  {
> -	unsigned long pgmoved;
> +	unsigned long pgmoved = 0;
>  	int pgdeactivate = 0;
>  	unsigned long pgscanned;
>  	LIST_HEAD(l_hold);	/* The pages which were snipped off */
> @@ -1036,17 +1036,32 @@ static void shrink_active_list(unsigned 
>  		__mod_zone_page_state(zone, NR_ACTIVE_ANON, -pgmoved);
>  	spin_unlock_irq(&zone->lru_lock);
>  
> +	pgmoved = 0;

didn't we just do that?

>  	while (!list_empty(&l_hold)) {
>  		cond_resched();
>  		page = lru_to_page(&l_hold);
>  		list_del(&page->lru);
> -		if (page_referenced(page, 0, sc->mem_cgroup))
> -			list_add(&page->lru, &l_active);
> -		else
> +		if (page_referenced(page, 0, sc->mem_cgroup)) {
> +			if (file) {
> +				/* Referenced file pages stay active. */
> +				list_add(&page->lru, &l_active);
> +			} else {
> +				/* Anonymous pages always get deactivated. */

hm.  That's going to make the machine swap like hell.  I guess I don't
understand all this yet.

> +				list_add(&page->lru, &l_inactive);
> +				pgmoved++;
> +			}
> +		} else
>  			list_add(&page->lru, &l_inactive);
>  	}
>  
>  	/*
> +	 * Count the referenced anon pages as rotated, to balance pageout
> +	 * scan pressure between file and anonymous pages in get_sacn_ratio.

tpyo



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 08/25] add some sanity checks to get_scan_ratio
  2008-06-06 20:28 ` [PATCH -mm 08/25] add some sanity checks to get_scan_ratio Rik van Riel, Rik van Riel
@ 2008-06-07  1:04   ` Andrew Morton
  2008-06-08 15:11     ` Rik van Riel
  0 siblings, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2008-06-07  1:04 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro

On Fri, 06 Jun 2008 16:28:46 -0400
Rik van Riel <riel@redhat.com> wrote:

> From: Rik van Riel <riel@redhat.com>
> 
> The access ratio based scan rate determination in get_scan_ratio
> works ok in most situations, but needs to be corrected in some
> corner cases:
> - if we run out of swap space, do not bother scanning the anon LRUs
> - if we have already freed all of the page cache, we need to scan
>   the anon LRUs

Strange.  We'll never free *all* the pagecache?

> - restore the *actual* access ratio based scan rate algorithm, the
>   previous versions of this patch series had the wrong version
> - scale the number of pages added to zone->nr_scan[l]
> 
> ...
>
> @@ -1180,15 +1191,19 @@ static void get_scan_ratio(struct zone *
>  	file  = zone_page_state(zone, NR_ACTIVE_FILE) +
>  		zone_page_state(zone, NR_INACTIVE_FILE);
>  
> -	rotate_sum = zone->recent_rotated_file + zone->recent_rotated_anon;
> -
>  	/* Keep a floating average of RECENT references. */
> -	if (unlikely(rotate_sum > min(anon, file))) {
> +	if (unlikely(zone->recent_scanned_anon > anon / zone->inactive_ratio)) {
>  		spin_lock_irq(&zone->lru_lock);
> -		zone->recent_rotated_file /= 2;
> +		zone->recent_scanned_anon /= 2;
>  		zone->recent_rotated_anon /= 2;
>  		spin_unlock_irq(&zone->lru_lock);
> -		rotate_sum /= 2;
> +	}
> +
> +	if (unlikely(zone->recent_scanned_file > file / 4)) {

I see nothing in the changelog about this and there are no comments. 
How can a reader possibly work out what you were thinking when this
was typed in??

> +		spin_lock_irq(&zone->lru_lock);
> +		zone->recent_scanned_file /= 2;
> +		zone->recent_rotated_file /= 2;
> +		spin_unlock_irq(&zone->lru_lock);
>  	}
>  
>  	/*
> @@ -1201,23 +1216,33 @@ static void get_scan_ratio(struct zone *
>  	/*
>  	 *                  anon       recent_rotated_anon
>  	 * %anon = 100 * ----------- / ------------------- * IO cost
> -	 *               anon + file       rotate_sum
> +	 *               anon + file   recent_scanned_anon
>  	 */
> -	ap = (anon_prio * anon) / (anon + file + 1);
> -	ap *= rotate_sum / (zone->recent_rotated_anon + 1);
> -	if (ap == 0)
> -		ap = 1;
> -	else if (ap > 100)
> -		ap = 100;
> -	percent[0] = ap;
> -
> -	fp = (file_prio * file) / (anon + file + 1);
> -	fp *= rotate_sum / (zone->recent_rotated_file + 1);
> -	if (fp == 0)
> -		fp = 1;
> -	else if (fp > 100)
> -		fp = 100;
> -	percent[1] = fp;
> +	ap = (anon_prio + 1) * (zone->recent_scanned_anon + 1);
> +	ap /= zone->recent_rotated_anon + 1;
> +
> +	fp = (file_prio + 1) * (zone->recent_scanned_file + 1);
> +	fp /= zone->recent_rotated_file + 1;
> +
> +	/* Normalize to percentages */
> +	percent[0] = 100 * ap / (ap + fp + 1);
> +	percent[1] = 100 - percent[0];
> +
> +	free = zone_page_state(zone, NR_FREE_PAGES);
> +
> +	/*
> +	 * If we have no swap space, do not bother scanning anon pages.
> +	 */
> +	if (nr_swap_pages <= 0) {
> +		percent[0] = 0;
> +		percent[1] = 100;
> +	}
> +	/*
> +	 * If we already freed most file pages, scan the anon pages
> +	 * regardless of the page access ratios or swappiness setting.
> +	 */
> +	else if (file + free <= zone->pages_high)
> +		percent[0] = 100;
>  }

Perhaps the (nr_swap_pages <= 0) test could happen earlier on.

Please quadruple-check this code like a paranoid maniac looking for
underflows, overflows and divides-by-zero.  Bear in mind that x/(y+1)
can get a div-by-zero for sufficiently-unepected values of y.

The layout of the last bit is misleading, IMO.  Better and more typical
would be:

	if (nr_swap_pages <= 0) {
		/*
		 * If we have no swap space, do not bother scanning anon pages.
		 */
		percent[0] = 0;
		percent[1] = 100;
	} else if (file + free <= zone->pages_high) {
		/*
		 * If we already freed most file pages, scan the anon pages
		 * regardless of the page access ratios or swappiness setting.
		 */
		percent[0] = 100;
	}

(Was there no need to wite percent[1] here?)

>  
> @@ -1238,13 +1263,17 @@ static unsigned long shrink_zone(int pri
>  	for_each_lru(l) {
>  		if (scan_global_lru(sc)) {
>  			int file = is_file_lru(l);
> +			int scan;
>  			/*
>  			 * Add one to nr_to_scan just to make sure that the
> -			 * kernel will slowly sift through the active list.
> +			 * kernel will slowly sift through each list.
>  			 */
> -			zone->nr_scan[l] += (zone_page_state(zone,
> -				NR_INACTIVE_ANON + l) >> priority) + 1;
> -			nr[l] = zone->nr_scan[l] * percent[file] / 100;
> +			scan = zone_page_state(zone, NR_INACTIVE_ANON + l);
> +			scan >>= priority;
> +			scan = (scan * percent[file]) / 100;

Oh, so that's what the [0] and [1] in get_scan_ratio() mean.  Perhaps
doing this:

	if (nr_swap_pages <= 0) {
		percent[0] = 0;		/* anon */
		percent[1] = 100;	/* file */

would clarify things.  But much better would be

/* comment goes here */
struct scan_ratios {
	unsigned long anon;
	unsigned long file;
};

no?


> +			zone->nr_scan[l] += scan + 1;
> +			nr[l] = zone->nr_scan[l];
>  			if (nr[l] >= sc->swap_cluster_max)
>  				zone->nr_scan[l] = 0;
>  			else
> @@ -1261,7 +1290,7 @@ static unsigned long shrink_zone(int pri
>  	}
>  
>  	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> -						 nr[LRU_INACTIVE_FILE]) {
> +					nr[LRU_INACTIVE_FILE]) {
>  		for_each_lru(l) {
>  			if (nr[l]) {
>  				nr_to_scan = min(nr[l],
> @@ -1274,6 +1303,13 @@ static unsigned long shrink_zone(int pri
>  		}
>  	}
>  
> +	/*
> +	 * Even if we did not try to evict anon pages at all, we want to
> +	 * rebalance the anon lru active/inactive ratio.
> +	 */
> +	if (scan_global_lru(sc) && inactive_anon_low(zone))
> +		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
> +
>  	throttle_vm_writeout(sc->gfp_mask);
>  	return nr_reclaimed;
>  }
> Index: linux-2.6.26-rc2-mm1/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/include/linux/mmzone.h	2008-05-28 12:09:06.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/include/linux/mmzone.h	2008-05-28 12:11:51.000000000 -0400
> @@ -289,6 +289,8 @@ struct zone {
>  
>  	unsigned long		recent_rotated_anon;
>  	unsigned long		recent_rotated_file;
> +	unsigned long		recent_scanned_anon;
> +	unsigned long		recent_scanned_file;

I think struct zone is sufficiently important and obscure that
field-by-field /*documentation*/ is needed.  Not as kerneldoc, please -
better to do it at the definition site

>  	unsigned long		pages_scanned;	   /* since last reclaim */
>  	unsigned long		flags;		   /* zone flags, see below */
> Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c	2008-05-28 12:09:06.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/page_alloc.c	2008-05-28 12:11:51.000000000 -0400
> @@ -3512,7 +3512,8 @@ static void __paginginit free_area_init_
>  		}
>  		zone->recent_rotated_anon = 0;
>  		zone->recent_rotated_file = 0;
> -//TODO recent_scanned_* ???
> +		zone->recent_scanned_anon = 0;
> +		zone->recent_scanned_file = 0;
>  		zap_zone_vm_stats(zone);
>  		zone->flags = 0;
>  		if (!size)
> Index: linux-2.6.26-rc2-mm1/mm/swap.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/swap.c	2008-05-28 12:09:06.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/swap.c	2008-05-28 12:11:51.000000000 -0400
> @@ -176,8 +176,8 @@ void activate_page(struct page *page)
>  
>  	spin_lock_irq(&zone->lru_lock);
>  	if (PageLRU(page) && !PageActive(page)) {
> -		int lru = LRU_BASE;
> -		lru += page_file_cache(page);
> +		int file = page_file_cache(page);
> +		int lru = LRU_BASE + file;
>  		del_page_from_lru_list(zone, page, lru);
>  
>  		SetPageActive(page);
> @@ -185,6 +185,15 @@ void activate_page(struct page *page)
>  		add_page_to_lru_list(zone, page, lru);
>  		__count_vm_event(PGACTIVATE);
>  		mem_cgroup_move_lists(page, true);
> +
> +		if (file) {
> +			zone->recent_scanned_file++;
> +			zone->recent_rotated_file++;
> +		} else {
> +			/* Can this happen?  Maybe through tmpfs... */

What's the status here?

> +			zone->recent_scanned_anon++;
> +			zone->recent_rotated_anon++;
> +		}
>  	}
>  	spin_unlock_irq(&zone->lru_lock);
>  }


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 09/25] fix pagecache reclaim referenced bit check
  2008-06-06 20:28 ` [PATCH -mm 09/25] fix pagecache reclaim referenced bit check Rik van Riel, Rik van Riel
@ 2008-06-07  1:04   ` Andrew Morton
  2008-06-07  1:08     ` Rik van Riel
  0 siblings, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2008-06-07  1:04 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, mbligh

On Fri, 06 Jun 2008 16:28:47 -0400
Rik van Riel <riel@redhat.com> wrote:

> From: Rik van Riel <riel@redhat.com>
> 
> The -mm tree contains the patch
> vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru.patch
> which gives referenced pagecache pages another trip around
> the active list.  This seems to help keep frequently accessed
> pagecache pages in memory.
> 
> However, it means that pagecache pages that get moved to the
> inactive list do not have their referenced bit set, and a
> reference to the page will not get it moved back to the active
> list

Should that be "the next reference"?  Because two references _will_ cause
the page to be activated?

>.  This patch sets the referenced bit on pagecache pages
> that get deactivated, so the next access to the page will promote
> it back to the active list.
> 
> ---
>  mm/vmscan.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-05-28 12:11:51.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-05-28 12:14:34.000000000 -0400
> @@ -1062,8 +1062,13 @@ static void shrink_active_list(unsigned 
>  				list_add(&page->lru, &l_inactive);
>  				pgmoved++;
>  			}
> -		} else
> +		} else {
>  			list_add(&page->lru, &l_inactive);
> +			if (file && !page_mapped(page))
> +				/* Bypass use-once, make the next access count.
> +				 * See mark_page_accessed. */
> +				SetPageReferenced(page);
> +		}
>  	}

Will this change also cause these pages to get a second trip around the
inactive list?  Or do we at the end of the patch series end up
reclaiming pagecache regardless of their PageReferenced()?  If so, it
seems that we're making the pagecache pages a _lot_ stickier with this
change and my one - an additional trip around the active list and an
additional one around the inactive list.

Changelog should spell all this out, I guess.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 10/25] add newly swapped in pages to the inactive list
  2008-06-06 20:28 ` [PATCH -mm 10/25] add newly swapped in pages to the inactive list Rik van Riel, Rik van Riel
@ 2008-06-07  1:04   ` Andrew Morton
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Morton @ 2008-06-07  1:04 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro

On Fri, 06 Jun 2008 16:28:48 -0400
Rik van Riel <riel@redhat.com> wrote:

> From: Rik van Riel <riel@redhat.com>
> 
> Swapin_readahead can read in a lot of data that the processes in
> memory never need.  Adding swap cache pages to the inactive list
> prevents them from putting too much pressure on the working set.
> 
> This has the potential to help the programs that are already in
> memory, but it could also be a disadvantage to processes that
> are trying to get swapped in.
> 
> In short, this patch needs testing.
> 

<hopes that the changelog is out of date>

> 
> ---
>  mm/swap_state.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6.26-rc2-mm1/mm/swap_state.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/swap_state.c	2008-05-28 09:40:59.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/swap_state.c	2008-05-28 09:42:26.000000000 -0400
> @@ -302,7 +302,7 @@ struct page *read_swap_cache_async(swp_e
>  			/*
>  			 * Initiate read into locked page and return.
>  			 */
> -			lru_cache_add_active_anon(new_page);
> +			lru_cache_add_anon(new_page);
>  			swap_readpage(NULL, new_page);
>  			return new_page;
>  		}


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 11/25] more aggressively use lumpy reclaim
  2008-06-06 20:28 ` [PATCH -mm 11/25] more aggressively use lumpy reclaim Rik van Riel, Rik van Riel
@ 2008-06-07  1:05   ` Andrew Morton
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Morton @ 2008-06-07  1:05 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro

On Fri, 06 Jun 2008 16:28:49 -0400
Rik van Riel <riel@redhat.com> wrote:

> From: Rik van Riel <riel@redhat.com>
> 
> During an AIM7 run on a 16GB system, fork started failing around
> 32000 threads, despite the system having plenty of free swap and
> 15GB of pageable memory.

Can we upadte the changelog to explain why this actually happened?

>From reading the patch I _assume_ that

a) the kernel was using 8k (2-page) stacks and

b) all the memory was stuck on the active list, so reclaim wasn't
   able to find any order-1 pages and wasn't able to find any order-0
   pages which gave it allocatable order-1 pages.

?

> If normal pageout does not result in contiguous free pages for
> kernel stacks, fall back to lumpy reclaim instead of failing fork
> or doing excessive pageout IO.

hm, I guess that this para kinda says that.  Not sure what the
"excessive pageout IO" part is referring to?

> I do not know whether this change is needed due to the extreme
> stress test or because the inactive list is a smaller fraction
> of system memory on huge systems.
> 

I guess that tweaking the inactive_ratio could be used to determine
this?

> 
> ---
>  mm/vmscan.c |   20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-05-28 12:14:34.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-05-28 12:14:43.000000000 -0400
> @@ -857,7 +857,8 @@ int isolate_lru_page(struct page *page)
>   * of reclaimed pages
>   */
>  static unsigned long shrink_inactive_list(unsigned long max_scan,
> -			struct zone *zone, struct scan_control *sc, int file)
> +			struct zone *zone, struct scan_control *sc,
> +			int priority, int file)
>  {
>  	LIST_HEAD(page_list);
>  	struct pagevec pvec;
> @@ -875,8 +876,19 @@ static unsigned long shrink_inactive_lis
>  		unsigned long nr_freed;
>  		unsigned long nr_active;
>  		unsigned int count[NR_LRU_LISTS] = { 0, };
> -		int mode = (sc->order > PAGE_ALLOC_COSTLY_ORDER) ?
> -					ISOLATE_BOTH : ISOLATE_INACTIVE;
> +		int mode = ISOLATE_INACTIVE;
> +
> +		/*
> +		 * If we need a large contiguous chunk of memory, or have
> +		 * trouble getting a small set of contiguous pages, we
> +		 * will reclaim both active and inactive pages.
> +		 *
> +		 * We use the same threshold as pageout congestion_wait below.
> +		 */
> +		if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> +			mode = ISOLATE_BOTH;
> +		else if (sc->order && priority < DEF_PRIORITY - 2)
> +			mode = ISOLATE_BOTH;
>  
>  		nr_taken = sc->isolate_pages(sc->swap_cluster_max,
>  			     &page_list, &nr_scan, sc->order, mode,
> @@ -1171,7 +1183,7 @@ static unsigned long shrink_list(enum lr
>  		shrink_active_list(nr_to_scan, zone, sc, priority, file);
>  		return 0;
>  	}
> -	return shrink_inactive_list(nr_to_scan, zone, sc, file);
> +	return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
>  }


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-06 20:28 ` [PATCH -mm 13/25] Noreclaim LRU Infrastructure Rik van Riel, Rik van Riel
@ 2008-06-07  1:05   ` Andrew Morton
  2008-06-08 20:34     ` Rik van Riel
  2008-06-10 20:09     ` Rik van Riel
  0 siblings, 2 replies; 102+ messages in thread
From: Andrew Morton @ 2008-06-07  1:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
	eric.whitney

On Fri, 06 Jun 2008 16:28:51 -0400
Rik van Riel <riel@redhat.com> wrote:

> 
> From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> 
> Infrastructure to manage pages excluded from reclaim--i.e., hidden
> from vmscan.  Based on a patch by Larry Woodman of Red Hat. Reworked
> to maintain "nonreclaimable" pages on a separate per-zone LRU list,
> to "hide" them from vmscan.
> 
> Kosaki Motohiro added the support for the memory controller noreclaim
> lru list.
> 
> Pages on the noreclaim list have both PG_noreclaim and PG_lru set.
> Thus, PG_noreclaim is analogous to and mutually exclusive with
> PG_active--it specifies which LRU list the page is on.  
> 
> The noreclaim infrastructure is enabled by a new mm Kconfig option
> [CONFIG_]NORECLAIM_LRU.

Having a config option for this really sucks, and needs extra-special
justification, rather than none.

Plus..

akpm:/usr/src/25> find . -name '*.[ch]' | xargs grep CONFIG_NORECLAIM_LRU 
./drivers/base/node.c:#ifdef CONFIG_NORECLAIM_LRU
./drivers/base/node.c:#ifdef CONFIG_NORECLAIM_LRU
./fs/proc/proc_misc.c:#ifdef CONFIG_NORECLAIM_LRU
./fs/proc/proc_misc.c:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/mmzone.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/mmzone.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/mmzone.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/page-flags.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/page-flags.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/pagemap.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/swap.h:#ifdef CONFIG_NORECLAIM_LRU
./include/linux/vmstat.h:#ifdef CONFIG_NORECLAIM_LRU
./kernel/sysctl.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/internal.h:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/page_alloc.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/vmscan.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/vmscan.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/vmscan.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/vmstat.c:#ifdef CONFIG_NORECLAIM_LRU
./mm/vmstat.c:#ifdef CONFIG_NORECLAIM_LRU


> A new function 'page_reclaimable(page, vma)' in vmscan.c tests whether
> or not a page is reclaimable.  Subsequent patches will add the various
> !reclaimable tests.  We'll want to keep these tests light-weight for
> use in shrink_active_list() and, possibly, the fault path.
> 
> To avoid races between tasks putting pages [back] onto an LRU list and
> tasks that might be moving the page from nonreclaimable to reclaimable
> state, one should test reclaimability under page lock and place
> nonreclaimable pages directly on the noreclaim list before dropping the
> lock.  Otherwise, we risk "stranding" reclaimable pages on the noreclaim
> list.  It's OK to use the pagevec caches for reclaimable pages.  The new
> function 'putback_lru_page()'--inverse to 'isolate_lru_page()'--handles
> this transition, including potential page truncation while the page is
> unlocked.
> 

The changelog doesn't even mention, let alone explain and justify the
fact that this feature is not available on 32-bit systems.  This is a
large drawback - it means that a (hopefully useful) feature is
unavailable to the large majority of Linux systems and that it reduces
the testing coverage and that it adversely impacts MM maintainability.

> Index: linux-2.6.26-rc2-mm1/mm/Kconfig
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/Kconfig	2008-05-29 16:21:04.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/Kconfig	2008-06-06 16:05:15.000000000 -0400
> @@ -205,3 +205,13 @@ config NR_QUICK
>  config VIRT_TO_BUS
>  	def_bool y
>  	depends on !ARCH_NO_VIRT_TO_BUS
> +
> +config NORECLAIM_LRU
> +	bool "Add LRU list to track non-reclaimable pages (EXPERIMENTAL, 64BIT only)"
> +	depends on EXPERIMENTAL && 64BIT
> +	help
> +	  Supports tracking of non-reclaimable pages off the [in]active lists
> +	  to avoid excessive reclaim overhead on large memory systems.  Pages
> +	  may be non-reclaimable because:  they are locked into memory, they
> +	  are anonymous pages for which no swap space exists, or they are anon
> +	  pages that are expensive to unmap [long anon_vma "related vma" list.]

Aunt Tillie might be struggling with some of that.

> Index: linux-2.6.26-rc2-mm1/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/include/linux/page-flags.h	2008-05-29 16:21:04.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/include/linux/page-flags.h	2008-06-06 16:05:15.000000000 -0400
> @@ -94,6 +94,9 @@ enum pageflags {
>  	PG_reclaim,		/* To be reclaimed asap */
>  	PG_buddy,		/* Page is free, on buddy lists */
>  	PG_swapbacked,		/* Page is backed by RAM/swap */
> +#ifdef CONFIG_NORECLAIM_LRU
> +	PG_noreclaim,		/* Page is "non-reclaimable"  */
> +#endif

I fear that we're messing up the terminology here.

Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'.  The term
already means a few different things, but in the vmscan context,
"reclaimable" means that the page is unreferenced, clean and can be
stolen.  "reclaimable" also means a lot of other things, and we just
made that worse.

Can we think of a new term which uniquely describes this new concept
and use that, rather than flogging the old horse?

>
> ...
>
> +/**
> + * add_page_to_noreclaim_list
> + * @page:  the page to be added to the noreclaim list
> + *
> + * Add page directly to its zone's noreclaim list.  To avoid races with
> + * tasks that might be making the page reclaimble while it's not on the
> + * lru, we want to add the page while it's locked or otherwise "invisible"
> + * to other tasks.  This is difficult to do when using the pagevec cache,
> + * so bypass that.
> + */

How does a task "make a page reclaimable"?  munlock()?  fsync()? 
exit()?

Choice of terminology matters...

> +void add_page_to_noreclaim_list(struct page *page)
> +{
> +	struct zone *zone = page_zone(page);
> +
> +	spin_lock_irq(&zone->lru_lock);
> +	SetPageNoreclaim(page);
> +	SetPageLRU(page);
> +	add_page_to_lru_list(zone, page, LRU_NORECLAIM);
> +	spin_unlock_irq(&zone->lru_lock);
> +}
> +
>  /*
>   * Drain pages out of the cpu's pagevecs.
>   * Either "cpu" is the current CPU, and preemption has already been
> @@ -339,6 +370,7 @@ void release_pages(struct page **pages, 
>  
>  		if (PageLRU(page)) {
>  			struct zone *pagezone = page_zone(page);
> +
>  			if (pagezone != zone) {
>  				if (zone)
>  					spin_unlock_irqrestore(&zone->lru_lock,
> @@ -415,6 +447,7 @@ void ____pagevec_lru_add(struct pagevec 
>  {
>  	int i;
>  	struct zone *zone = NULL;
> +	VM_BUG_ON(is_noreclaim_lru(lru));
>  
>  	for (i = 0; i < pagevec_count(pvec); i++) {
>  		struct page *page = pvec->pages[i];
> @@ -426,6 +459,7 @@ void ____pagevec_lru_add(struct pagevec 
>  			zone = pagezone;
>  			spin_lock_irq(&zone->lru_lock);
>  		}
> +		VM_BUG_ON(PageActive(page) || PageNoreclaim(page));

If this ever triggers, you'll wish that it had been coded with two
separate assertions.

>  		VM_BUG_ON(PageLRU(page));
>  		SetPageLRU(page);
>  		if (is_active_lru(lru))
>
> ...
>
> +/**
> + * putback_lru_page
> + * @page to be put back to appropriate lru list
> + *
> + * Add previously isolated @page to appropriate LRU list.
> + * Page may still be non-reclaimable for other reasons.
> + *
> + * lru_lock must not be held, interrupts must be enabled.
> + * Must be called with page locked.
> + *
> + * return 1 if page still locked [not truncated], else 0
> + */

The kerneldoc function description is missing.

> +int putback_lru_page(struct page *page)
> +{
> +	int lru;
> +	int ret = 1;
> +
> +	VM_BUG_ON(!PageLocked(page));
> +	VM_BUG_ON(PageLRU(page));
> +
> +	lru = !!TestClearPageActive(page);
> +	ClearPageNoreclaim(page);	/* for page_reclaimable() */
> +
> +	if (unlikely(!page->mapping)) {
> +		/*
> +		 * page truncated.  drop lock as put_page() will
> +		 * free the page.
> +		 */
> +		VM_BUG_ON(page_count(page) != 1);
> +		unlock_page(page);
> +		ret = 0;
> +	} else if (page_reclaimable(page, NULL)) {
> +		/*
> +		 * For reclaimable pages, we can use the cache.
> +		 * In event of a race, worst case is we end up with a
> +		 * non-reclaimable page on [in]active list.
> +		 * We know how to handle that.
> +		 */
> +		lru += page_file_cache(page);
> +		lru_cache_add_lru(page, lru);
> +		mem_cgroup_move_lists(page, lru);
> +	} else {
> +		/*
> +		 * Put non-reclaimable pages directly on zone's noreclaim
> +		 * list.
> +		 */
> +		add_page_to_noreclaim_list(page);
> +		mem_cgroup_move_lists(page, LRU_NORECLAIM);
> +	}
> +
> +	put_page(page);		/* drop ref from isolate */
> +	return ret;		/* ret => "page still locked" */
> +}

<stares for a while>

<penny drops>

So THAT'S what the magical "return 2" is doing in page_file_cache()!

<looks>

OK, after all the patches are applied, the "2" becomes LRU_FILE and the
enumeration of `enum lru_list' reflects that.

> +/*
> + * Cull page that shrink_*_list() has detected to be non-reclaimable
> + * under page lock to close races with other tasks that might be making
> + * the page reclaimable.  Avoid stranding a reclaimable page on the
> + * noreclaim list.
> + */
> +static inline void cull_nonreclaimable_page(struct page *page)
> +{
> +	lock_page(page);
> +	if (putback_lru_page(page))
> +		unlock_page(page);
> +}

Again, the terminology is quite overloaded and confusing.  What does
"non-reclaimable" mean in this context?  _Any_ page which was dirty or
which had an elevated refcount?  Surely not referenced pages, which the
scanner also can treat as non-reclaimable.

Did you check whether all these inlined functions really should have
been inlined?  Even ones like this are probably too large.

>  /*
>   * shrink_page_list() returns the number of reclaimed pages
>   */
>
> ...
>
> @@ -647,6 +721,14 @@ int __isolate_lru_page(struct page *page
>  	if (mode != ISOLATE_BOTH && (!page_file_cache(page) != !file))
>  		return ret;
>  
> +	/*
> +	 * Non-reclaimable pages shouldn't make it onto either the active
> +	 * nor the inactive list. However, when doing lumpy reclaim of
> +	 * higher order pages we can still run into them.

I guess that something along the lines of "when this function is being
called for lumpy reclaim we can still .." would be clearer.

> +	 */
> +	if (PageNoreclaim(page))
> +		return ret;
> +
>  	ret = -EBUSY;
>  	if (likely(get_page_unless_zero(page))) {
>  		/*


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 15/25] Ramfs and Ram Disk pages are non-reclaimable
  2008-06-06 20:28 ` [PATCH -mm 15/25] Ramfs and Ram Disk pages are non-reclaimable Rik van Riel, Rik van Riel
@ 2008-06-07  1:05   ` Andrew Morton
  2008-06-08  4:32     ` Greg KH
  0 siblings, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2008-06-07  1:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
	eric.whitney

On Fri, 06 Jun 2008 16:28:53 -0400
Rik van Riel <riel@redhat.com> wrote:

> 
> From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> 
> Christoph Lameter pointed out that ram disk pages also clutter the
> LRU lists.  When vmscan finds them dirty and tries to clean them,
> the ram disk writeback function just redirties the page so that it
> goes back onto the active list.  Round and round she goes...
> 
> Define new address_space flag [shares address_space flags member
> with mapping's gfp mask] to indicate that the address space contains
> all non-reclaimable pages.  This will provide for efficient testing
> of ramdisk pages in page_reclaimable().
> 
> Also provide wrapper functions to set/test the noreclaim state to
> minimize #ifdefs in ramdisk driver and any other users of this
> facility.
> 
> Set the noreclaim state on address_space structures for new
> ramdisk inodes.  Test the noreclaim state in page_reclaimable()
> to cull non-reclaimable pages.
> 
> Similarly, ramfs pages are non-reclaimable.  Set the 'noreclaim'
> address_space flag for new ramfs inodes.
> 
> These changes depend on [CONFIG_]NORECLAIM_LRU.

hm

> 
> @@ -61,6 +61,7 @@ struct inode *ramfs_get_inode(struct sup
>  		inode->i_mapping->a_ops = &ramfs_aops;
>  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
>  		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> +		mapping_set_noreclaim(inode->i_mapping);
>  		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
>  		switch (mode & S_IFMT) {
>  		default:

That's OK.

> Index: linux-2.6.26-rc2-mm1/drivers/block/brd.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/drivers/block/brd.c	2008-05-29 16:21:04.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/drivers/block/brd.c	2008-06-06 16:06:20.000000000 -0400
> @@ -374,8 +374,21 @@ static int brd_ioctl(struct inode *inode
>  	return error;
>  }
>  
> +/*
> + * brd_open():
> + * Just mark the mapping as containing non-reclaimable pages
> + */
> +static int brd_open(struct inode *inode, struct file *filp)
> +{
> +	struct address_space *mapping = inode->i_mapping;
> +
> +	mapping_set_noreclaim(mapping);
> +	return 0;
> +}
> +
>  static struct block_device_operations brd_fops = {
>  	.owner =		THIS_MODULE,
> +	.open  =		brd_open,
>  	.ioctl =		brd_ioctl,
>  #ifdef CONFIG_BLK_DEV_XIP
>  	.direct_access =	brd_direct_access,

But this only works for pagecache in /dev/ramN.  afaict the pagecache
for files which are written onto that "blokk device" remain on the LRU.
But that's OK, isn't it?  For the ramdisk driver these pages _do_ have
backing store and _can_ be written back and reclaimed, yes?


Still, I'm unsure about the whole implementation.  We already maintain
this sort of information in the backing_dev.  Would it not be better to
just avoid ever putting such pages onto the LRU in the first place?


Also, I expect there are a whole host of pseudo-filesystems (sysfs?)
which have this problem.  Does the patch address all of them?  If not,
can we come up with something which _does_ address them all without
having to hunt down and change every such fs?



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 16/25] SHM_LOCKED pages are non-reclaimable
  2008-06-06 20:28 ` [PATCH -mm 16/25] SHM_LOCKED " Rik van Riel, Rik van Riel
@ 2008-06-07  1:05   ` Andrew Morton
  2008-06-07  5:21     ` KOSAKI Motohiro
  2008-06-10 21:03     ` Rik van Riel
  0 siblings, 2 replies; 102+ messages in thread
From: Andrew Morton @ 2008-06-07  1:05 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro

On Fri, 06 Jun 2008 16:28:54 -0400
Rik van Riel <riel@redhat.com> wrote:

> From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> 
> Against:  2.6.26-rc2-mm1
> 
> While working with Nick Piggin's mlock patches,

Change log refers to information which its reader has not got a hope
of actually locating.

> I noticed that
> shmem segments locked via shmctl(SHM_LOCKED) were not being handled.
> SHM_LOCKed pages work like ramdisk pages

Well, OK.  As long as one remembers that "ramdisk pages" are different
from "pages of a file which is on ramdisk".  Tricky, huh?

> --the writeback function
> just redirties the page so that it can't be reclaimed.  Deal with
> these using the same approach as for ram disk pages.
> 
> Use the AS_NORECLAIM flag to mark address_space of SHM_LOCKed
> shared memory regions as non-reclaimable.  Then these pages
> will be culled off the normal LRU lists during vmscan.

So I guess there's more justification for handling these pages in this
manner, because someone could come along later and unlock them.  But
that isn't true of /dev/ram0 pages and ramfs pages, etc.

> Add new wrapper function to clear the mapping's noreclaim state
> when/if shared memory segment is munlocked.
> 
> Add 'scan_mapping_noreclaim_page()' to mm/vmscan.c to scan all
> pages in the shmem segment's mapping [struct address_space] for
> reclaimability now that they're no longer locked.  If so, move
> them to the appropriate zone lru list.  Note that
> scan_mapping_noreclaim_page() must be able to sleep on page_lock(),
> so we can't call it holding the shmem info spinlock nor the shmid
> spinlock.  So, we pass the mapping [address_space] back to shmctl()
> on SHM_UNLOCK for rescuing any nonreclaimable pages after dropping
> the spinlocks.  Once we drop the shmid lock, the backing shmem file
> can be deleted if the calling task doesn't have the shm area
> attached.  To handle this, we take an extra reference on the file
> before dropping the shmid lock and drop the reference after scanning
> the mapping's noreclaim pages.
> 
>
> ...
>
> +
> +/**
> + * check_move_noreclaim_page - check page for reclaimability and move to appropriate zone lru list
> + * @page: page to check reclaimability and move to appropriate lru list
> + * @zone: zone page is in
> + *
> + * Checks a page for reclaimability and moves the page to the appropriate
> + * zone lru list.
> + *
> + * Restrictions: zone->lru_lock must be held, page must be on LRU and must
> + * have PageNoreclaim set.
> + */
> +static void check_move_noreclaim_page(struct page *page, struct zone *zone)
> +{
> +
> +	ClearPageNoreclaim(page); /* for page_reclaimable() */

Confused.  Didn't we just lose track of our NR_NORECLAIM accounting?

> +	if (page_reclaimable(page, NULL)) {
> +		enum lru_list l = LRU_INACTIVE_ANON + page_file_cache(page);
> +		__dec_zone_state(zone, NR_NORECLAIM);
> +		list_move(&page->lru, &zone->list[l]);
> +		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
> +	} else {
> +		/*
> +		 * rotate noreclaim list
> +		 */
> +		SetPageNoreclaim(page);
> +		list_move(&page->lru, &zone->list[LRU_NORECLAIM]);
> +	}
> +}
> +
> +/**
> + * scan_mapping_noreclaim_pages - scan an address space for reclaimable pages
> + * @mapping: struct address_space to scan for reclaimable pages
> + *
> + * Scan all pages in mapping.  Check non-reclaimable pages for
> + * reclaimability and move them to the appropriate zone lru list.
> + */
> +void scan_mapping_noreclaim_pages(struct address_space *mapping)
> +{
> +	pgoff_t next = 0;
> +	pgoff_t end   = (i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1) >>
> +			 PAGE_CACHE_SHIFT;
> +	struct zone *zone;
> +	struct pagevec pvec;
> +
> +	if (mapping->nrpages == 0)
> +		return;
> +
> +	pagevec_init(&pvec, 0);
> +	while (next < end &&
> +		pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> +		int i;
> +
> +		zone = NULL;
> +
> +		for (i = 0; i < pagevec_count(&pvec); i++) {
> +			struct page *page = pvec.pages[i];
> +			pgoff_t page_index = page->index;
> +			struct zone *pagezone = page_zone(page);
> +
> +			if (page_index > next)
> +				next = page_index;
> +			next++;
> +
> +			if (TestSetPageLocked(page)) {
> +				/*
> +				 * OK, let's do it the hard way...
> +				 */
> +				if (zone)
> +					spin_unlock_irq(&zone->lru_lock);
> +				zone = NULL;
> +				lock_page(page);
> +			}
> +
> +			if (pagezone != zone) {
> +				if (zone)
> +					spin_unlock_irq(&zone->lru_lock);
> +				zone = pagezone;
> +				spin_lock_irq(&zone->lru_lock);
> +			}
> +
> +			if (PageLRU(page) && PageNoreclaim(page))
> +				check_move_noreclaim_page(page, zone);
> +
> +			unlock_page(page);
> +
> +		}
> +		if (zone)
> +			spin_unlock_irq(&zone->lru_lock);
> +		pagevec_release(&pvec);
> +	}
> +
> +}

This function can spend fantastically large amounts of time under
spin_lock_irq().


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
  2008-06-06 20:28 ` [PATCH -mm 17/25] Mlocked Pages " Rik van Riel, Rik van Riel
@ 2008-06-07  1:07   ` Andrew Morton
  2008-06-07  5:38     ` KOSAKI Motohiro
                       ` (2 more replies)
  0 siblings, 3 replies; 102+ messages in thread
From: Andrew Morton @ 2008-06-07  1:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
	eric.whitney, npiggin

On Fri, 06 Jun 2008 16:28:55 -0400
Rik van Riel <riel@redhat.com> wrote:

> Originally
> From: Nick Piggin <npiggin@suse.de>
> 
> Against:  2.6.26-rc2-mm1
> 
> This patch:
> 
> 1) defines the [CONFIG_]NORECLAIM_MLOCK sub-option and the
>    stub version of the mlock/noreclaim APIs when it's
>    not configured.  Depends on [CONFIG_]NORECLAIM_LRU.

Oh sob.

akpm:/usr/src/25> find . -name '*.[ch]' | xargs grep CONFIG_NORECLAIM | wc -l
51

why oh why?  Must we really really do this to ourselves?  Cheerfully
unchangeloggedly?

> 2) add yet another page flag--PG_mlocked--to indicate that
>    the page is locked for efficient testing in vmscan and,
>    optionally, fault path.  This allows early culling of
>    nonreclaimable pages, preventing them from getting to
>    page_referenced()/try_to_unmap().  Also allows separate
>    accounting of mlock'd pages, as Nick's original patch
>    did.
> 
>    Note:  Nick's original mlock patch used a PG_mlocked
>    flag.  I had removed this in favor of the PG_noreclaim
>    flag + an mlock_count [new page struct member].  I
>    restored the PG_mlocked flag to eliminate the new
>    count field.  

How many page flags are left?  I keep on asking this and I end up
either a) not being told or b) forgetting.  I thought that we had
a whopping big comment somewhere which describes how all these
flags are allocated but I can't immediately locate it.

> 3) add the mlock/noreclaim infrastructure to mm/mlock.c,
>    with internal APIs in mm/internal.h.  This is a rework
>    of Nick's original patch to these files, taking into
>    account that mlocked pages are now kept on noreclaim
>    LRU list.
> 
> 4) update vmscan.c:page_reclaimable() to check PageMlocked()
>    and, if vma passed in, the vm_flags.  Note that the vma
>    will only be passed in for new pages in the fault path;
>    and then only if the "cull nonreclaimable pages in fault
>    path" patch is included.
> 
> 5) add try_to_unlock() to rmap.c to walk a page's rmap and
>    ClearPageMlocked() if no other vmas have it mlocked.  
>    Reuses as much of try_to_unmap() as possible.  This
>    effectively replaces the use of one of the lru list links
>    as an mlock count.  If this mechanism let's pages in mlocked
>    vmas leak through w/o PG_mlocked set [I don't know that it
>    does], we should catch them later in try_to_unmap().  One
>    hopes this will be rare, as it will be relatively expensive.
> 
> 6) Kosaki:  added munlock page table walk to avoid using
>    get_user_pages() for unlock.  get_user_pages() is unreliable
>    for some vma protections.
>    Lee:  modified to wait for in-flight migration to complete
>    to close munlock/migration race that could strand pages.

None of which is available on 32-bit machines.  That's pretty significant.


Do we do per-zone or global number-of-mlocked-pages accounting for
/proc/meminfo or /proc/vmstat, etc?  Seems not..

> --- linux-2.6.26-rc2-mm1.orig/mm/Kconfig	2008-06-06 16:05:15.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/Kconfig	2008-06-06 16:06:28.000000000 -0400
> @@ -215,3 +215,17 @@ config NORECLAIM_LRU
>  	  may be non-reclaimable because:  they are locked into memory, they
>  	  are anonymous pages for which no swap space exists, or they are anon
>  	  pages that are expensive to unmap [long anon_vma "related vma" list.]
> +
> +config NORECLAIM_MLOCK
> +	bool "Exclude mlock'ed pages from reclaim"
> +	depends on NORECLAIM_LRU
> +	help
> +	  Treats mlock'ed pages as no-reclaimable.  Removing these pages from
> +	  the LRU [in]active lists avoids the overhead of attempting to reclaim
> +	  them.  Pages marked non-reclaimable for this reason will become
> +	  reclaimable again when the last mlock is removed.
> +	  when no swap space exists.  Removing these pages from the LRU lists
> +	  avoids the overhead of attempting to reclaim them.  Pages marked
> +	  non-reclaimable for this reason will become reclaimable again when/if
> +	  sufficient swap space is added to the system.

The sentence "when no swap space exists." a) lacks capitalisation and
b) makes no sense.

The paramedics are caring for Aunt Tillie.

> Index: linux-2.6.26-rc2-mm1/mm/internal.h
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/internal.h	2008-06-06 16:05:15.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/internal.h	2008-06-06 16:06:28.000000000 -0400
> @@ -56,6 +56,17 @@ static inline unsigned long page_order(s
>  	return page_private(page);
>  }
>  
> +/*
> + * mlock all pages in this vma range.  For mmap()/mremap()/...
> + */
> +extern int mlock_vma_pages_range(struct vm_area_struct *vma,
> +			unsigned long start, unsigned long end);
> +
> +/*
> + * munlock all pages in vma.   For munmap() and exit().
> + */
> +extern void munlock_vma_pages_all(struct vm_area_struct *vma);

I don't think it's desirable that interfaces be documented in two
places.  The documentation which you have at the definition site is
more complete than this, and is at the place where people will expect
to find it.


>  #ifdef CONFIG_NORECLAIM_LRU
>  /*
>   * noreclaim_migrate_page() called only from migrate_page_copy() to
> @@ -74,6 +85,65 @@ static inline void noreclaim_migrate_pag
>  }
>  #endif
>  
> +#ifdef CONFIG_NORECLAIM_MLOCK
> +/*
> + * Called only in fault path via page_reclaimable() for a new page
> + * to determine if it's being mapped into a LOCKED vma.
> + * If so, mark page as mlocked.
> + */
> +static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
> +{
> +	VM_BUG_ON(PageLRU(page));
> +
> +	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
> +		return 0;
> +
> +	SetPageMlocked(page);
> +	return 1;
> +}

bool?  If you like that sort of thing.  It makes sense here...

> +/*
> + * must be called with vma's mmap_sem held for read, and page locked.
> + */
> +extern void mlock_vma_page(struct page *page);
> +
> +/*
> + * Clear the page's PageMlocked().  This can be useful in a situation where
> + * we want to unconditionally remove a page from the pagecache -- e.g.,
> + * on truncation or freeing.
> + *
> + * It is legal to call this function for any page, mlocked or not.
> + * If called for a page that is still mapped by mlocked vmas, all we do
> + * is revert to lazy LRU behaviour -- semantics are not broken.
> + */
> +extern void __clear_page_mlock(struct page *page);
> +static inline void clear_page_mlock(struct page *page)
> +{
> +	if (unlikely(TestClearPageMlocked(page)))
> +		__clear_page_mlock(page);
> +}
> +
> +/*
> + * mlock_migrate_page - called only from migrate_page_copy() to
> + * migrate the Mlocked page flag
> + */

So maybe just nuke it and open-code those two lines in mm/migrate.c?

> +static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> +{
> +	if (TestClearPageMlocked(page))
> +		SetPageMlocked(newpage);
> +}
> +
> +
> +#else /* CONFIG_NORECLAIM_MLOCK */
> +static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
> +{
> +	return 0;
> +}
> +static inline void clear_page_mlock(struct page *page) { }
> +static inline void mlock_vma_page(struct page *page) { }
> +static inline void mlock_migrate_page(struct page *new, struct page *old) { }

It would be neater if the arguments to the two versions of
mlock_migrate_page() had the same names.

> +#endif /* CONFIG_NORECLAIM_MLOCK */
>  
>  /*
>   * FLATMEM and DISCONTIGMEM configurations use alloc_bootmem_node,
> Index: linux-2.6.26-rc2-mm1/mm/mlock.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/mlock.c	2008-05-15 11:20:15.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/mlock.c	2008-06-06 16:06:28.000000000 -0400
> @@ -8,10 +8,18 @@
>  #include <linux/capability.h>
>  #include <linux/mman.h>
>  #include <linux/mm.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
> +#include <linux/pagemap.h>
>  #include <linux/mempolicy.h>
>  #include <linux/syscalls.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
> +#include <linux/rmap.h>
> +#include <linux/mmzone.h>
> +#include <linux/hugetlb.h>
> +
> +#include "internal.h"
>  
>  int can_do_mlock(void)
>  {
> @@ -23,17 +31,354 @@ int can_do_mlock(void)
>  }
>  EXPORT_SYMBOL(can_do_mlock);
>  
> +#ifdef CONFIG_NORECLAIM_MLOCK
> +/*
> + * Mlocked pages are marked with PageMlocked() flag for efficient testing
> + * in vmscan and, possibly, the fault path; and to support semi-accurate
> + * statistics.
> + *
> + * An mlocked page [PageMlocked(page)] is non-reclaimable.  As such, it will
> + * be placed on the LRU "noreclaim" list, rather than the [in]active lists.
> + * The noreclaim list is an LRU sibling list to the [in]active lists.
> + * PageNoreclaim is set to indicate the non-reclaimable state.
> + *
> + * When lazy mlocking via vmscan, it is important to ensure that the
> + * vma's VM_LOCKED status is not concurrently being modified, otherwise we
> + * may have mlocked a page that is being munlocked. So lazy mlock must take
> + * the mmap_sem for read, and verify that the vma really is locked
> + * (see mm/rmap.c).
> + */

That's a useful comment.

Where would the reader (and indeed the reviewer) go to find out about
"lazy mlocking"?  "grep -i 'lazy mlock' */*.c" doesn't work...

> +/*
> + *  LRU accounting for clear_page_mlock()
> + */
> +void __clear_page_mlock(struct page *page)
> +{
> +	VM_BUG_ON(!PageLocked(page));	/* for LRU islolate/putback */

typo

> +
> +	if (!isolate_lru_page(page)) {
> +		putback_lru_page(page);
> +	} else {
> +		/*
> +		 * Try hard not to leak this page ...
> +		 */
> +		lru_add_drain_all();
> +		if (!isolate_lru_page(page))
> +			putback_lru_page(page);
> +	}
> +}

When I review code I often come across stuff which I don't understand
(at least, which I don't understand sufficiently easily).  So I'll ask
questions, and I do think the best way in which those questions should
be answered is by adding a code comment to fix the problem for ever.

When I look at the isolate_lru_page()-failed cases above I wonder what
just happened.  We now have a page which is still on the LRU (how did
it get there in the first place?). Well no.  I _think_ what happened is
that this function is using isolate_lru_page() and putback_lru_page()
to move a page off a now-inappropriate LRU list and to put it back onto
the proper one.  But heck, maybe I just don't know what this function
is doing at all?

If I _am_ right, and if the isolate_lru_page() _did_ fail (and under
what circumstances?) then...  what?  We now have a page which is on an
inappropriate LRU?  Why is this OK?  Do we handle it elsewhere?  How?

etc.

> +/*
> + * Mark page as mlocked if not already.
> + * If page on LRU, isolate and putback to move to noreclaim list.
> + */
> +void mlock_vma_page(struct page *page)
> +{
> +	BUG_ON(!PageLocked(page));
> +
> +	if (!TestSetPageMlocked(page) && !isolate_lru_page(page))
> +			putback_lru_page(page);
> +}

extra tab.

> +/*
> + * called from munlock()/munmap() path with page supposedly on the LRU.
> + *
> + * Note:  unlike mlock_vma_page(), we can't just clear the PageMlocked
> + * [in try_to_unlock()] and then attempt to isolate the page.  We must
> + * isolate the page() to keep others from messing with its noreclaim

page()?

> + * and mlocked state while trying to unlock.  However, we pre-clear the

"unlock"?  (See exhasperated comment against try_to_unlock(), below)

> + * mlocked state anyway as we might lose the isolation race and we might
> + * not get another chance to clear PageMlocked.  If we successfully
> + * isolate the page and try_to_unlock() detects other VM_LOCKED vmas
> + * mapping the page, it will restore the PageMlocked state, unless the page
> + * is mapped in a non-linear vma.  So, we go ahead and SetPageMlocked(),
> + * perhaps redundantly.
> + * If we lose the isolation race, and the page is mapped by other VM_LOCKED
> + * vmas, we'll detect this in vmscan--via try_to_unlock() or try_to_unmap()
> + * either of which will restore the PageMlocked state by calling
> + * mlock_vma_page() above, if it can grab the vma's mmap sem.
> + */

OK, you officially lost me here.  Two hours are up and I guess I need
to have another run at [patch 17/25]

I must say that having tried to absorb the above, my confidence in the
overall correctness of this code is not great.  Hopefully wrong, but
gee.

> +static void munlock_vma_page(struct page *page)
> +{
> +	BUG_ON(!PageLocked(page));
> +
> +	if (TestClearPageMlocked(page) && !isolate_lru_page(page)) {
> +		try_to_unlock(page);
> +		putback_lru_page(page);
> +	}
> +}
> +
> +/*
> + * mlock a range of pages in the vma.
> + *
> + * This takes care of making the pages present too.
> + *
> + * vma->vm_mm->mmap_sem must be held for write.
> + */
> +static int __mlock_vma_pages_range(struct vm_area_struct *vma,
> +			unsigned long start, unsigned long end)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long addr = start;
> +	struct page *pages[16]; /* 16 gives a reasonable batch */
> +	int write = !!(vma->vm_flags & VM_WRITE);
> +	int nr_pages = (end - start) / PAGE_SIZE;
> +	int ret;
> +
> +	VM_BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK);
> +	VM_BUG_ON(start < vma->vm_start || end > vma->vm_end);
> +	VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
> +
> +	lru_add_drain_all();	/* push cached pages to LRU */
> +
> +	while (nr_pages > 0) {
> +		int i;
> +
> +		cond_resched();
> +
> +		/*
> +		 * get_user_pages makes pages present if we are
> +		 * setting mlock.
> +		 */
> +		ret = get_user_pages(current, mm, addr,
> +				min_t(int, nr_pages, ARRAY_SIZE(pages)),
> +				write, 0, pages, NULL);

Doesn't mlock already do a make_pages_present(), or did that get
removed and moved to here?

> +		/*
> +		 * This can happen for, e.g., VM_NONLINEAR regions before
> +		 * a page has been allocated and mapped at a given offset,
> +		 * or for addresses that map beyond end of a file.
> +		 * We'll mlock the the pages if/when they get faulted in.
> +		 */
> +		if (ret < 0)
> +			break;
> +		if (ret == 0) {
> +			/*
> +			 * We know the vma is there, so the only time
> +			 * we cannot get a single page should be an
> +			 * error (ret < 0) case.
> +			 */
> +			WARN_ON(1);
> +			break;
> +		}
> +
> +		lru_add_drain();	/* push cached pages to LRU */
> +
> +		for (i = 0; i < ret; i++) {
> +			struct page *page = pages[i];
> +
> +			/*
> +			 * page might be truncated or migrated out from under
> +			 * us.  Check after acquiring page lock.
> +			 */
> +			lock_page(page);
> +			if (page->mapping)
> +				mlock_vma_page(page);
> +			unlock_page(page);
> +			put_page(page);		/* ref from get_user_pages() */
> +
> +			/*
> +			 * here we assume that get_user_pages() has given us
> +			 * a list of virtually contiguous pages.
> +			 */

Good assumption, that ;)

> +			addr += PAGE_SIZE;	/* for next get_user_pages() */

Could be moved outside the loop I guess.

> +			nr_pages--;

Ditto.

> +		}
> +	}
> +
> +	lru_add_drain_all();	/* to update stats */
> +
> +	return 0;	/* count entire vma as locked_vm */
> +}
>
> ...
>
> +/*
> + * munlock a range of pages in the vma using standard page table walk.
> + *
> + * vma->vm_mm->mmap_sem must be held for write.
> + */
> +static void __munlock_vma_pages_range(struct vm_area_struct *vma,
> +			      unsigned long start, unsigned long end)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct munlock_page_walk mpw;
> +
> +	VM_BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK);
> +	VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
> +	VM_BUG_ON(start < vma->vm_start);
> +	VM_BUG_ON(end > vma->vm_end);
> +
> +	lru_add_drain_all();	/* push cached pages to LRU */
> +	mpw.vma = vma;
> +	(void)walk_page_range(mm, start, end, &munlock_page_walk, &mpw);

The (void) is un-kernely.

> +	lru_add_drain_all();	/* to update stats */
> +

random newline.

> +}
> +
> +#else /* CONFIG_NORECLAIM_MLOCK */
>
> ...
>
> +int mlock_vma_pages_range(struct vm_area_struct *vma,
> +			unsigned long start, unsigned long end)
> +{
> +	int nr_pages = (end - start) / PAGE_SIZE;
> +	BUG_ON(!(vma->vm_flags & VM_LOCKED));
> +
> +	/*
> +	 * filter unlockable vmas
> +	 */
> +	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> +		goto no_mlock;
> +
> +	if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
> +			is_vm_hugetlb_page(vma) ||
> +			vma == get_gate_vma(current))
> +		goto make_present;
> +
> +	return __mlock_vma_pages_range(vma, start, end);

Invert the `if' expression, remove the goto?

> +make_present:
> +	/*
> +	 * User mapped kernel pages or huge pages:
> +	 * make these pages present to populate the ptes, but
> +	 * fall thru' to reset VM_LOCKED--no need to unlock, and
> +	 * return nr_pages so these don't get counted against task's
> +	 * locked limit.  huge pages are already counted against
> +	 * locked vm limit.
> +	 */
> +	make_pages_present(start, end);
> +
> +no_mlock:
> +	vma->vm_flags &= ~VM_LOCKED;	/* and don't come back! */
> +	return nr_pages;		/* pages NOT mlocked */
> +}
> +
> +
>
> ...
>
> +#ifdef CONFIG_NORECLAIM_MLOCK
> +/**
> + * try_to_unlock - Check page's rmap for other vma's holding page locked.
> + * @page: the page to be unlocked.   will be returned with PG_mlocked
> + * cleared if no vmas are VM_LOCKED.

I think kerneldoc will barf over the newline in @page's description.

> + * Return values are:
> + *
> + * SWAP_SUCCESS	- no vma's holding page locked.
> + * SWAP_AGAIN	- page mapped in mlocked vma -- couldn't acquire mmap sem
> + * SWAP_MLOCK	- page is now mlocked.
> + */
> +int try_to_unlock(struct page *page)
> +{
> +	VM_BUG_ON(!PageLocked(page) || PageLRU(page));
> +
> +	if (PageAnon(page))
> +		return try_to_unmap_anon(page, 1, 0);
> +	else
> +		return try_to_unmap_file(page, 1, 0);
> +}
> +#endif

OK, this function is clear as mud.  My first reaction was "what's wrong
with just doing unlock_page()?".  The term "unlock" is waaaaaaaaaaay
overloaded in this context and its use here was an awful decision.

Can we please come up with a more specific name and add some comments
which give the reader some chance of working out what it is that is
actually being unlocked?

>
> ...
>
> @@ -652,7 +652,6 @@ again:			remove_next = 1 + (end > next->
>   * If the vma has a ->close operation then the driver probably needs to release
>   * per-vma resources, so we don't attempt to merge those.
>   */
> -#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
>  
>  static inline int is_mergeable_vma(struct vm_area_struct *vma,
>  			struct file *file, unsigned long vm_flags)

hm, so the old definition of VM_SPECIAL managed to wedge itself between
is_mergeable_vma() and is_mergeable_vma()'s comment.  Had me confused
there.

pls remove the blank line between the comment and the start of
is_mergeable_vma() so people don't go sticking more things in there.




^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 09/25] fix pagecache reclaim referenced bit check
  2008-06-07  1:04   ` Andrew Morton
@ 2008-06-07  1:08     ` Rik van Riel
  2008-06-08 10:02       ` Peter Zijlstra
  0 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel @ 2008-06-07  1:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, mbligh

On Fri, 6 Jun 2008 18:04:51 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 06 Jun 2008 16:28:47 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
> > From: Rik van Riel <riel@redhat.com>
> > 
> > The -mm tree contains the patch
> > vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru.patch
> > which gives referenced pagecache pages another trip around
> > the active list.  This seems to help keep frequently accessed
> > pagecache pages in memory.
> > 
> > However, it means that pagecache pages that get moved to the
> > inactive list do not have their referenced bit set, and a
> > reference to the page will not get it moved back to the active
> > list
> 
> Should that be "the next reference"?  Because two references _will_ cause
> the page to be activated?

Indeed, next reference would be a more accurate description.

> >.  This patch sets the referenced bit on pagecache pages
> > that get deactivated, so the next access to the page will promote
> > it back to the active list.
> > 
> > ---
> >  mm/vmscan.c |    4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
> > ===================================================================
> > --- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-05-28 12:11:51.000000000 -0400
> > +++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-05-28 12:14:34.000000000 -0400
> > @@ -1062,8 +1062,13 @@ static void shrink_active_list(unsigned 
> >  				list_add(&page->lru, &l_inactive);
> >  				pgmoved++;
> >  			}
> > -		} else
> > +		} else {
> >  			list_add(&page->lru, &l_inactive);
> > +			if (file && !page_mapped(page))
> > +				/* Bypass use-once, make the next access count.
> > +				 * See mark_page_accessed. */
> > +				SetPageReferenced(page);
> > +		}
> >  	}
> 
> Will this change also cause these pages to get a second trip around the
> inactive list?  Or do we at the end of the patch series end up
> reclaiming pagecache regardless of their PageReferenced()?  If so, it
> seems that we're making the pagecache pages a _lot_ stickier with this
> change and my one - an additional trip around the active list and an
> additional one around the inactive list.

At the end of the inactive list, we reclaim it regardless of
PageReferenced().  As for making the pagecache pages stickier,
we can balance that out by tweaking swappiness now that we
have page cache and swap/ram backed pages living on separate
sets of LRU lists.

The balancing becomes a lot easier than before.

> Changelog should spell all this out, I guess.

It's not the Changelog I'm worried about as much as the
people who are trying to read the code afterwards.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 06/25] split LRU lists into anon & file sets
  2008-06-07  1:04   ` Andrew Morton
@ 2008-06-07  1:22     ` Rik van Riel
  2008-06-07  1:52       ` Andrew Morton
  0 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel @ 2008-06-07  1:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro

On Fri, 6 Jun 2008 18:04:39 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> The changelogs are a bit scrappy and could do with some care
> 
> - stale assertions such as the above
> 
> - "From:<random number of spaces>Lee" in various places
> 
> - Some have the --- separator and others don't (this trips me up).
> 
> - Stuff like "Against: 2.6.26-rc2-mm1" right in the middle of the
>   changelog for me to hunt down and squish.

I got rid of all of those - until I merged in Lee's latest :(

> - "TODO:  DEBUGGING ONLY: NOT FOR UPSTREAM MERGE" <<-- what's up with this?

I'll remove this one.

> - random Capitalisation in Various patch Titles.
> 
> - "V2 -> V3:" logging in the main changelog - not relevant in the
>   final commit hence more for me to edit away.

I got rid of all of those - again, before merging Lee's latest.
Then I got rid of most of them, but apparently missed a few...

> - Strange inventions like "Originally Signed-off-by:"
> 
> - please prefer to prefix the patch titles with a suitable subsystem
>   identifier.  In this case "vmscan: " would suit.

Will do.

> - Other stuff I forgot.  A general recheck and cleanup would be nice.
> 
>   I've actually fixed all ofthe above but I don't yet know whether
>   I'll be checking all this in.
> 
> > Index: linux-2.6.26-rc2-mm1/fs/proc/proc_misc.c
> > ===================================================================
> > --- linux-2.6.26-rc2-mm1.orig/fs/proc/proc_misc.c	2008-05-23 14:21:21.000000000 -0400
> > +++ linux-2.6.26-rc2-mm1/fs/proc/proc_misc.c	2008-05-23 14:21:34.000000000 -0400
> > @@ -132,6 +132,10 @@ static int meminfo_read_proc(char *page,
> >  	unsigned long allowed;
> >  	struct vmalloc_info vmi;
> >  	long cached;
> > +	unsigned long inactive_anon;
> > +	unsigned long active_anon;
> > +	unsigned long inactive_file;
> > +	unsigned long active_file;
> 
> Shouldn't this be an array[NR_LRU_LISTS]?
> 
> >  /*
> >   * display in kilobytes.
> > @@ -150,48 +154,61 @@ static int meminfo_read_proc(char *page,
> >  
> >  	get_vmalloc_info(&vmi);
> >  
> > +	inactive_anon = global_page_state(NR_INACTIVE_ANON);
> > +	active_anon   = global_page_state(NR_ACTIVE_ANON);
> > +	inactive_file = global_page_state(NR_INACTIVE_FILE);
> > +	active_file   = global_page_state(NR_ACTIVE_FILE);
> 
> then this can perhaps become a loop.

Sure, I can do that.

> >  	/*
> >  	 * Tagged format, for easy grepping and expansion.
> >  	 */
> >  	len = sprintf(page,
> > -		"MemTotal:     %8lu kB\n"
> > -		"MemFree:      %8lu kB\n"
> > -		"Buffers:      %8lu kB\n"
> > -		"Cached:       %8lu kB\n"
> > -		"SwapCached:   %8lu kB\n"
> > -		"Active:       %8lu kB\n"
> > -		"Inactive:     %8lu kB\n"
> > +		"MemTotal:       %8lu kB\n"
> > +		"MemFree:        %8lu kB\n"
> > +		"Buffers:        %8lu kB\n"
> > +		"Cached:         %8lu kB\n"
> > +		"SwapCached:     %8lu kB\n"
> > +		"Active:         %8lu kB\n"
> > +		"Inactive:       %8lu kB\n"
> > +		"Active(anon):   %8lu kB\n"
> > +		"Inactive(anon): %8lu kB\n"
> > +		"Active(file):   %8lu kB\n"
> > +		"Inactive(file): %8lu kB\n"
> >  #ifdef CONFIG_HIGHMEM
> > -		"HighTotal:    %8lu kB\n"
> > -		"HighFree:     %8lu kB\n"
> > -		"LowTotal:     %8lu kB\n"
> > -		"LowFree:      %8lu kB\n"
> > -#endif
> > -		"SwapTotal:    %8lu kB\n"
> > -		"SwapFree:     %8lu kB\n"
> > -		"Dirty:        %8lu kB\n"
> > -		"Writeback:    %8lu kB\n"
> > -		"AnonPages:    %8lu kB\n"
> > -		"Mapped:       %8lu kB\n"
> > -		"Slab:         %8lu kB\n"
> > -		"SReclaimable: %8lu kB\n"
> > -		"SUnreclaim:   %8lu kB\n"
> > -		"PageTables:   %8lu kB\n"
> > -		"NFS_Unstable: %8lu kB\n"
> > -		"Bounce:       %8lu kB\n"
> > -		"WritebackTmp: %8lu kB\n"
> > -		"CommitLimit:  %8lu kB\n"
> > -		"Committed_AS: %8lu kB\n"
> > -		"VmallocTotal: %8lu kB\n"
> > -		"VmallocUsed:  %8lu kB\n"
> > -		"VmallocChunk: %8lu kB\n",
> > +		"HighTotal:      %8lu kB\n"
> > +		"HighFree:       %8lu kB\n"
> > +		"LowTotal:       %8lu kB\n"
> > +		"LowFree:        %8lu kB\n"
> > +#endif
> > +		"SwapTotal:      %8lu kB\n"
> > +		"SwapFree:       %8lu kB\n"
> > +		"Dirty:          %8lu kB\n"
> > +		"Writeback:      %8lu kB\n"
> > +		"AnonPages:      %8lu kB\n"
> > +		"Mapped:         %8lu kB\n"
> > +		"Slab:           %8lu kB\n"
> > +		"SReclaimable:   %8lu kB\n"
> > +		"SUnreclaim:     %8lu kB\n"
> > +		"PageTables:     %8lu kB\n"
> > +		"NFS_Unstable:   %8lu kB\n"
> > +		"Bounce:         %8lu kB\n"
> > +		"WritebackTmp:   %8lu kB\n"
> > +		"CommitLimit:    %8lu kB\n"
> > +		"Committed_AS:   %8lu kB\n"
> > +		"VmallocTotal:   %8lu kB\n"
> > +		"VmallocUsed:    %8lu kB\n"
> > +		"VmallocChunk:   %8lu kB\n",
> >  		K(i.totalram),
> >  		K(i.freeram),
> >  		K(i.bufferram),
> >  		K(cached),
> >  		K(total_swapcache_pages),
> > -		K(global_page_state(NR_ACTIVE)),
> > -		K(global_page_state(NR_INACTIVE)),
> > +		K(active_anon   + active_file),
> > +		K(inactive_anon + inactive_file),
> > +		K(active_anon),
> > +		K(inactive_anon),
> > +		K(active_file),
> > +		K(inactive_file),
> 
> Do we really want to put all this stuff into /proc/meminfo?
>
> Would it be better to aggregate it in some manner for meminfo and show
> the fine-grained info in /proc/vmstat?

Good question.  I believe we'll want the memory usage statistics
in /proc/meminfo, but more temporary internal stuff like 
"writebacktmp" and "nfs_unstable" might not belong there.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 06/25] split LRU lists into anon & file sets
  2008-06-07  1:22     ` Rik van Riel
@ 2008-06-07  1:52       ` Andrew Morton
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Morton @ 2008-06-07  1:52 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro

On Fri, 6 Jun 2008 21:22:31 -0400 Rik van Riel <riel@redhat.com> wrote:

> On Fri, 6 Jun 2008 18:04:39 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > The changelogs are a bit scrappy and could do with some care
> > 
> > - stale assertions such as the above
> > 
> > - "From:<random number of spaces>Lee" in various places
> > 
> > - Some have the --- separator and others don't (this trips me up).
> > 
> > - Stuff like "Against: 2.6.26-rc2-mm1" right in the middle of the
> >   changelog for me to hunt down and squish.
> 
> I got rid of all of those - until I merged in Lee's latest :(
> 
> > - "TODO:  DEBUGGING ONLY: NOT FOR UPSTREAM MERGE" <<-- what's up with this?
> 
> I'll remove this one.
> 
> > - random Capitalisation in Various patch Titles.
> > 
> > - "V2 -> V3:" logging in the main changelog - not relevant in the
> >   final commit hence more for me to edit away.
> 
> I got rid of all of those - again, before merging Lee's latest.
> Then I got rid of most of them, but apparently missed a few...

Poeple often maintain that sort of info below the ^--- line.

>
...
>
> > Do we really want to put all this stuff into /proc/meminfo?
> >
> > Would it be better to aggregate it in some manner for meminfo and show
> > the fine-grained info in /proc/vmstat?
> 
> Good question.  I believe we'll want the memory usage statistics
> in /proc/meminfo, but more temporary internal stuff like 
> "writebacktmp" and "nfs_unstable" might not belong there.

Well, I was just asking..  I haven't actually sat down and worked out
what the proposed new meminfo will look like yet.

(It's documented in Documentation/filesystems/proc.txt btw.  Or used
to be ;))

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 16/25] SHM_LOCKED pages are non-reclaimable
  2008-06-07  1:05   ` Andrew Morton
@ 2008-06-07  5:21     ` KOSAKI Motohiro
  2008-06-10 21:03     ` Rik van Riel
  1 sibling, 0 replies; 102+ messages in thread
From: KOSAKI Motohiro @ 2008-06-07  5:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Rik van Riel, linux-kernel, lee.schermerhorn

Hi

> > +static void check_move_noreclaim_page(struct page *page, struct zone *zone)
> > +{
> > +
> > +	ClearPageNoreclaim(page); /* for page_reclaimable() */
> 
> Confused.  Didn't we just lose track of our NR_NORECLAIM accounting?

I think this code is right.
this patch track it by below code.

but comments adding is better? 


> > +	if (page_reclaimable(page, NULL)) {
> > +		enum lru_list l = LRU_INACTIVE_ANON + page_file_cache(page);
> > +		__dec_zone_state(zone, NR_NORECLAIM);
> > +		list_move(&page->lru, &zone->list[l]);
> > +		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
> > +	} else {
> > +		/*
> > +		 * rotate noreclaim list
> > +		 */
> > +		SetPageNoreclaim(page);
> > +		list_move(&page->lru, &zone->list[LRU_NORECLAIM]);
> > +	}
> > +}





^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
  2008-06-07  1:07   ` Andrew Morton
@ 2008-06-07  5:38     ` KOSAKI Motohiro
  2008-06-10  3:31     ` Nick Piggin
  2008-06-11  1:00     ` Rik van Riel
  2 siblings, 0 replies; 102+ messages in thread
From: KOSAKI Motohiro @ 2008-06-07  5:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Rik van Riel, linux-kernel, lee.schermerhorn,
	linux-mm, eric.whitney, npiggin

Hi

> > +	if (!isolate_lru_page(page)) {
> > +		putback_lru_page(page);
> > +	} else {
> > +		/*
> > +		 * Try hard not to leak this page ...
> > +		 */
> > +		lru_add_drain_all();
> > +		if (!isolate_lru_page(page))
> > +			putback_lru_page(page);
> > +	}
> > +}
> 
> When I review code I often come across stuff which I don't understand
> (at least, which I don't understand sufficiently easily).  So I'll ask
> questions, and I do think the best way in which those questions should
> be answered is by adding a code comment to fix the problem for ever.
> 
> When I look at the isolate_lru_page()-failed cases above I wonder what
> just happened.  We now have a page which is still on the LRU (how did
> it get there in the first place?). Well no.  I _think_ what happened is
> that this function is using isolate_lru_page() and putback_lru_page()
> to move a page off a now-inappropriate LRU list and to put it back onto
> the proper one.  But heck, maybe I just don't know what this function
> is doing at all?
> 
> If I _am_ right, and if the isolate_lru_page() _did_ fail (and under
> what circumstances?) then...  what?  We now have a page which is on an
> inappropriate LRU?  Why is this OK?  Do we handle it elsewhere?  How?

I think this code is OK, 
but "Try hard not to leak this page ..." is wrong comment and not true.

isolate_lru_page() failure mean this page is isolated by another one.
later, Another one put back page to proper LRU by putback_lru_page().
(putback_lru_page() alway put back right LRU.)

no leak happebnd.



> > +static int __mlock_vma_pages_range(struct vm_area_struct *vma,
> > +			unsigned long start, unsigned long end)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	unsigned long addr = start;
> > +	struct page *pages[16]; /* 16 gives a reasonable batch */
> > +	int write = !!(vma->vm_flags & VM_WRITE);
> > +	int nr_pages = (end - start) / PAGE_SIZE;
> > +	int ret;
> > +
> > +	VM_BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK);
> > +	VM_BUG_ON(start < vma->vm_start || end > vma->vm_end);
> > +	VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
> > +
> > +	lru_add_drain_all();	/* push cached pages to LRU */
> > +
> > +	while (nr_pages > 0) {
> > +		int i;
> > +
> > +		cond_resched();
> > +
> > +		/*
> > +		 * get_user_pages makes pages present if we are
> > +		 * setting mlock.
> > +		 */
> > +		ret = get_user_pages(current, mm, addr,
> > +				min_t(int, nr_pages, ARRAY_SIZE(pages)),
> > +				write, 0, pages, NULL);
> 
> Doesn't mlock already do a make_pages_present(), or did that get
> removed and moved to here?

I think, 

vanilla:     call make_pages_present() when mlock.
this series: call __mlock_vma_pages_range() when mlock.

thus, this code is right.





^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 02/25] Use an indexed array for LRU variables
  2008-06-07  1:04   ` Andrew Morton
@ 2008-06-07  5:43     ` KOSAKI Motohiro
  2008-06-07 14:47       ` Rik van Riel
  2008-06-07 18:42     ` Rik van Riel
  1 sibling, 1 reply; 102+ messages in thread
From: KOSAKI Motohiro @ 2008-06-07  5:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Rik van Riel, linux-kernel, lee.schermerhorn,
	clameter

> > Index: linux-2.6.26-rc2-mm1/include/linux/mm_inline.h
> > ===================================================================
> > --- linux-2.6.26-rc2-mm1.orig/include/linux/mm_inline.h	2008-05-23 14:21:21.000000000 -0400
> > +++ linux-2.6.26-rc2-mm1/include/linux/mm_inline.h	2008-05-23 14:21:33.000000000 -0400
> > @@ -1,40 +1,51 @@
> >  static inline void
> > +add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
> > +{
> > +	list_add(&page->lru, &zone->list[l]);
> > +	__inc_zone_state(zone, NR_INACTIVE + l);
> 
>                                ^ that's a bug, isn't it?

this is definitely bug.


> oh, no it isn't.
> Can we rename NR_INACTIVE?  Maybe VMSCAN_BASE or something?

as far as i remembered, old version use LRU_INACTIVE. 



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 07/25] second chance replacement for anonymous pages
  2008-06-07  1:04   ` Andrew Morton
@ 2008-06-07  6:03     ` KOSAKI Motohiro
  2008-06-07  6:43       ` Andrew Morton
  2008-06-08 15:04     ` Rik van Riel
  1 sibling, 1 reply; 102+ messages in thread
From: KOSAKI Motohiro @ 2008-06-07  6:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Rik van Riel, linux-kernel, lee.schermerhorn

> > + * pages.  Ie. a ratio of 3 means 3:1 or 25% of the anonymous pages are
> > + * on the inactive list.
> > + *
> > + * total     return    max
> > + * memory    value     inactive anon
> 
> This function doesn't "return" a "value".

Sorry, my bad.

in my country, "return value" is right word.
s/value/ratio/ is better?

> 
> > + * -------------------------------------
> > + *   10MB       1         5MB
> > + *  100MB       1        50MB
> > + *    1GB       3       250MB
> > + *   10GB      10       0.9GB
> > + *  100GB      31         3GB
> > + *    1TB     101        10GB
> > + *   10TB     320        32GB
> > + */



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 07/25] second chance replacement for anonymous pages
  2008-06-07  6:03     ` KOSAKI Motohiro
@ 2008-06-07  6:43       ` Andrew Morton
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Morton @ 2008-06-07  6:43 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Rik van Riel, linux-kernel, lee.schermerhorn

On Sat, 07 Jun 2008 15:03:42 +0900 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > > + * pages.  Ie. a ratio of 3 means 3:1 or 25% of the anonymous pages are
> > > + * on the inactive list.
> > > + *
> > > + * total     return    max
> > > + * memory    value     inactive anon
> > 
> > This function doesn't "return" a "value".
> 
> Sorry, my bad.
> 
> in my country, "return value" is right word.

Not for a function which returns void!

> s/value/ratio/ is better?

Well it's referring to the value of zone->inactive_ratio.  So
replacing "return value" with "zone->inactive_ratio" would be ideal :)

> > 
> > > + * -------------------------------------
> > > + *   10MB       1         5MB
> > > + *  100MB       1        50MB
> > > + *    1GB       3       250MB
> > > + *   10GB      10       0.9GB
> > > + *  100GB      31         3GB
> > > + *    1TB     101        10GB
> > > + *   10TB     320        32GB
> > > + */
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 02/25] Use an indexed array for LRU variables
  2008-06-07  5:43     ` KOSAKI Motohiro
@ 2008-06-07 14:47       ` Rik van Riel
  2008-06-08 11:22         ` KOSAKI Motohiro
  0 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel @ 2008-06-07 14:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, kosaki.motohiro, linux-kernel, lee.schermerhorn,
	clameter

On Sat, 07 Jun 2008 14:43:50 +0900
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > > Index: linux-2.6.26-rc2-mm1/include/linux/mm_inline.h
> > > ===================================================================
> > > --- linux-2.6.26-rc2-mm1.orig/include/linux/mm_inline.h	2008-05-23 14:21:21.000000000 -0400
> > > +++ linux-2.6.26-rc2-mm1/include/linux/mm_inline.h	2008-05-23 14:21:33.000000000 -0400
> > > @@ -1,40 +1,51 @@
> > >  static inline void
> > > +add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
> > > +{
> > > +	list_add(&page->lru, &zone->list[l]);
> > > +	__inc_zone_state(zone, NR_INACTIVE + l);
> > 
> >                                ^ that's a bug, isn't it?
> 
> this is definitely bug.

I believe this is correct, actually.   I will rename/alias it to
VMSCAN_BASE or something along those lines.

> > oh, no it isn't.
> > Can we rename NR_INACTIVE?  Maybe VMSCAN_BASE or something?
> 
> as far as i remembered, old version use LRU_INACTIVE. 

LRU_* is used to index LRU arrays.

NR_* is used as an offset into the zone state.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 02/25] Use an indexed array for LRU variables
  2008-06-07  1:04   ` Andrew Morton
  2008-06-07  5:43     ` KOSAKI Motohiro
@ 2008-06-07 18:42     ` Rik van Riel
  1 sibling, 0 replies; 102+ messages in thread
From: Rik van Riel @ 2008-06-07 18:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, clameter

On Fri, 6 Jun 2008 18:04:26 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> I would have spat the dummy at pointless churn and code uglification
> but I see that we end up with five LRU lsits so ho hum.

> >  	/* Fields commonly accessed by the page reclaim scanner */
> >  	spinlock_t		lru_lock;	
> > -	struct list_head	active_list;
> > -	struct list_head	inactive_list;
> > -	unsigned long		nr_scan_active;
> > -	unsigned long		nr_scan_inactive;
> > +	struct list_head	list[NR_LRU_LISTS];
> > +	unsigned long		nr_scan[NR_LRU_LISTS];
> 
> It'd be a little cache-friendlier to lay this out as
> 
> 	struct {
> 		struct list_head list;
> 		unsigned long nr_scan;
> 	} lru_stuff[NR_LRU_LISTS];

OK, done.  As zone.lru though for brevity.

> > +++ linux-2.6.26-rc2-mm1/include/linux/mm_inline.h	2008-05-23 14:21:33.000000000 -0400
> > @@ -1,40 +1,51 @@
> >  static inline void
> > +add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
> > +{
> > +	list_add(&page->lru, &zone->list[l]);
> > +	__inc_zone_state(zone, NR_INACTIVE + l);
> 
>                                ^ that's a bug, isn't it?
> 
> oh, no it isn't.
> 
> Can we rename NR_INACTIVE?  Maybe VMSCAN_BASE or something?

I turned it into NR_LRU_BASE, since everything in zone_stat_item
starts with NR_

enum zone_stat_item {
        /* First 128 byte cacheline (assuming 64 bit words) */
        NR_FREE_PAGES,
        NR_LRU_BASE,
        NR_INACTIVE = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
        NR_ACTIVE,      /*  "     "     "   "       "         */

> > -			if (PageActive(page))
> > -				add_page_to_active_list(zone, page);
> > -			else
> > -				add_page_to_inactive_list(zone, page);
> > +			add_page_to_lru_list(zone, page, PageActive(page));
> 
> urgh.  the third arg to add_page_to_lru_list() is an `enum lru_list'
> and here we are secretly coercing PageActive()'s boolean return into a
> just-happens-to-be-right `enum lru_list'.
> 
> That's pretty nasty?

I've pulled page_lru() back from patch 03/25 into 02/25 and
will use that.
 
-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 04/25] free swap space on swap-in/activation
  2008-06-07  1:04   ` Andrew Morton
@ 2008-06-07 19:56     ` Rik van Riel
  2008-06-09  2:14     ` MinChan Kim
  1 sibling, 0 replies; 102+ messages in thread
From: Rik van Riel @ 2008-06-07 19:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, minchan.kim,
	Hugh Dickins

On Fri, 6 Jun 2008 18:04:30 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Fri, 06 Jun 2008 16:28:42 -0400
> Rik van Riel <riel@redhat.com> wrote:
> > From: Rik van Riel <riel@redhat.com>
> > 
> > Free swap cache entries when swapping in pages if vm_swap_full()
> > [swap space > 1/2 used].  Uses new pagevec to reduce pressure
> > on locks.
> > 
> > Signed-off-by: Rik van Riel <riel@redhat.com>
> > Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> > Signed-off-by: MinChan Kim <minchan.kim@gmail.com>
> > 
> > ---
> >  include/linux/pagevec.h |    1 +
> >  include/linux/swap.h    |    6 ++++++
> >  mm/swap.c               |   18 ++++++++++++++++++
> >  mm/swapfile.c           |   25 ++++++++++++++++++++++---
> >  mm/vmscan.c             |    7 +++++++
> >  5 files changed, 54 insertions(+), 3 deletions(-)
> > 
> > Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
> > ===================================================================
> > --- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-05-23 14:21:33.000000000 -0400
> > +++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-05-23 14:21:33.000000000 -0400
> > @@ -619,6 +619,9 @@ free_it:
> >  		continue;
> >  
> >  activate_locked:
> > +		/* Not a candidate for swapping, so reclaim swap space. */
> > +		if (PageSwapCache(page) && vm_swap_full())
> 
> The patch puts rather a lot of pressure onto vm_swap_full().  We might
> want to look into optimising that.
> 
> - Is the 50% thing optimum?  Could go higher and perhaps should be
>   based on amount-of-memory.
> 
> - Can precalculate the fraction rather than doing it inline all the time.

I do not know if 50% is optimum.  It is just what the upstream kernel
has had since 2.4.10 or so.  Before that it used to be 75%.  This same
percentage is used to free swap spaces at swapin time.
 
> - Can make total_swap_pages __read_mostly and have a think about
>   nr_swap_pages too.

I wonder if we wouldn't be off best simply placing the two on their
own cache line.  After all, they are often handled together.

I believe that should be a separate patch though, since I am not
changing that situation from what is already upstream and this
patch series contains more than enough stuff already.

> - Can completely optimise the thing away if !CONFIG_SWAP.
> 
> 
> Has all this code been tested with CONFIG_SWAP=n?

With CONFIG_SWAP=n the macro PageSwapCache(page) will always be
declared false due to the declarations in page-flags.h.  That
means that vm_swap_full() will be evaluated and the compiler
should leave it out.

#ifdef CONFIG_SWAP
PAGEFLAG(SwapCache, swapcache)
#else
PAGEFLAG_FALSE(SwapCache)
#endif

> > +void pagevec_swap_free(struct pagevec *pvec)

> What's going on here.
> 
> Normally we'll bump a page's refcount to account for its presence in a
> pagevec.  This code doesn't do that.
> 
> Is it safe?  If so, how come?

It is safe because the callers already hold an extra reference to
each page.  I have added a full kerneldoc comment to this function
to explain that.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 05/25] define page_file_cache() function
  2008-06-07  1:04   ` Andrew Morton
@ 2008-06-07 23:38     ` Rik van Riel
  0 siblings, 0 replies; 102+ messages in thread
From: Rik van Riel @ 2008-06-07 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, minchan.kim

On Fri, 6 Jun 2008 18:04:34 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> > Unfortunately this needs to use a page flag, since the
> > PG_swapbacked state needs to be preserved all the way
> > to the point where the page is last removed from the
> > LRU.  Trying to derive the status from other info in
> > the page resulted in wrong VM statistics in earlier
> > split VM patchsets.
> 
> argh.  How many are left?

After this patch, 19 are in use.  I believe the way we
keep track of zones means we have 24 total on i386, but
the exact value depends on CONFIG_NODES_SHIFT, which
determines the maximum number of NUMA nodes supported.

The code in include/linux/mm.h makes sure the user
cannot compile the kernel with too high a NODES_SHIFT
value.

On 64 bits, we have more than enough flags.

> > +static inline int page_file_cache(struct page *page)
> > +{
> > +	if (PageSwapBacked(page))
> > +		return 0;
> > +
> > +	/* The page is page cache backed by a normal filesystem. */
> > +	return 2;
> 
> 2?
> 
> Maybe bool would suit here.

It will be replaced with LRU_FILE in a later patch.  I'll change it
to 1 in this patch.

> Maybe a better name would be page_is_file_cache().  The gnu (gcc?)
> convention of putting _p at the end of predicate functions' names makes
> heaps of sense.

I'll change it to page_is_file_cache()

> This function doesn't do enough stuff to do that which it says it does.
> There must be a whole pile of preconditions which the caller must
> evaluate before this function can be usefully used.  I mean, it would
> be a bug to pass an anonymous page or a slab page or whatever into
> here?

Passing in a slab page would indeed not give a useful result.  This
function is meant to help functions that manipulate the LRU lists
(which already know the page is or should be an LRU page) sort the
page onto the right list.

I have amended the comment to reflect this.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 15/25] Ramfs and Ram Disk pages are non-reclaimable
  2008-06-07  1:05   ` Andrew Morton
@ 2008-06-08  4:32     ` Greg KH
  0 siblings, 0 replies; 102+ messages in thread
From: Greg KH @ 2008-06-08  4:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, linux-kernel, lee.schermerhorn, kosaki.motohiro,
	linux-mm, eric.whitney

On Fri, Jun 06, 2008 at 06:05:10PM -0700, Andrew Morton wrote:
> 
> Also, I expect there are a whole host of pseudo-filesystems (sysfs?)
> which have this problem.  Does the patch address all of them?  If not,
> can we come up with something which _does_ address them all without
> having to hunt down and change every such fs?

sysfs used to have this issue, until the people at IBM rewrote the whole
backing store for sysfs so that now it is reclaimable and pages out
quite nicely when there is memory pressure.  That's how they run 20,000
disks on the s390 boxes with no memory :)

But it would be nice to solve the issue "generically" for ram based
filesystems, if possible (usbfs, securityfs, debugfs, etc.)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 09/25] fix pagecache reclaim referenced bit check
  2008-06-07  1:08     ` Rik van Riel
@ 2008-06-08 10:02       ` Peter Zijlstra
  0 siblings, 0 replies; 102+ messages in thread
From: Peter Zijlstra @ 2008-06-08 10:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, linux-kernel, lee.schermerhorn, kosaki.motohiro,
	mbligh

On Fri, 2008-06-06 at 21:08 -0400, Rik van Riel wrote:
> On Fri, 6 Jun 2008 18:04:51 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:

> > Changelog should spell all this out, I guess.
> 
> It's not the Changelog I'm worried about as much as the
> people who are trying to read the code afterwards.

Guess what these people turn to when they look for the rationale of a
change? - the Changelog :-)


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 02/25] Use an indexed array for LRU variables
  2008-06-07 14:47       ` Rik van Riel
@ 2008-06-08 11:22         ` KOSAKI Motohiro
  0 siblings, 0 replies; 102+ messages in thread
From: KOSAKI Motohiro @ 2008-06-08 11:22 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, lee.schermerhorn, clameter

>> > > @@ -1,40 +1,51 @@
>> > >  static inline void
>> > > +add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
>> > > +{
>> > > + list_add(&page->lru, &zone->list[l]);
>> > > + __inc_zone_state(zone, NR_INACTIVE + l);
>> >
>> >                                ^ that's a bug, isn't it?
>>
>> this is definitely bug.
>
> I believe this is correct, actually.   I will rename/alias it to
> VMSCAN_BASE or something along those lines.

Ah, sorry. you are right.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 07/25] second chance replacement for anonymous pages
  2008-06-07  1:04   ` Andrew Morton
  2008-06-07  6:03     ` KOSAKI Motohiro
@ 2008-06-08 15:04     ` Rik van Riel
  1 sibling, 0 replies; 102+ messages in thread
From: Rik van Riel @ 2008-06-08 15:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro

On Fri, 6 Jun 2008 18:04:43 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> > To keep the maximum amount of necessary work reasonable, we scale the
> > active to inactive ratio with the size of memory, using the formula
> > active:inactive ratio = sqrt(memory in GB * 10).
> 
> Should be scaled by PAGE_SIZE?

I suspect the value does not matter all that much.  It is meant to
make the number of inactive anon pages scale sub-linearly with the
size of memory in the system, so anon pages stay on the inactive
list long enough to get referenced again, while limiting the total
number of pages the VM needs to scan to something hopefully reasonable.

This formula has worked well in our testing, but maybe wider testing
in -mm will show us it needs to be tweaked.  Too early to tell whether
scaling by PAGE_SIZE will be needed at all.
 
> > +static inline int inactive_anon_low(struct zone *zone)
> > +{
> > +	unsigned long active, inactive;
> > +
> > +	active = zone_page_state(zone, NR_ACTIVE_ANON);
> > +	inactive = zone_page_state(zone, NR_INACTIVE_ANON);
> > +
> > +	if (inactive * zone->inactive_ratio < active)
> > +		return 1;
> > +
> > +	return 0;
> > +}
> 
> inactive_anon_low: "number of inactive anonymous pages which are in lowmem"?
> 
> Nope.
> 
> Needs a comment.  And maybe a better name, like inactive_anon_is_low. 
> Although making the return type a bool kind-of does that.

Added a comment and renamed the function.

> > +	/*
> > +	 * The ratio of active to inactive pages.
> > +	 */
> > +	unsigned int inactive_ratio;
> 
> That comment needs a lot of help please.  For a start, it's plain wrong
> - inactive_ratio would need to be a float to be able to record that ratio.
> 
> The comment should describe the units too.
.
Commented the hell out of the inactive_ratio stuff :)

> OK, so inactive_ratio is an integer 1 ..  N which determines our target
> number of inactive pages according to the formula
> 
> 	nr_inactive = nr_active / inactive_ratio
> 
> yes?
> 
> Can nr_inactive get larger than this?  I assume so.  I guess that
> doesn't matter much.  Except the problems which you're trying to sovle
> here can reoccur.   What would I need to do to trigger that?

All new anon pages start out on the active list.

The only way you could trigger this problem is by swapping a lot
of memory out through allocation of new memory, then freeing that
new memory and swapping the old memory back in.

That can only happen with the "add newly swapped in pages to the 
inactive list" patch applied, which is why that patch may need some
wider exposure in -mm.

It has not been problematic in our tests so far.
 
> >  long vm_total_pages;	/* The total number of pages which the VM controls */
> >  
> >  static LIST_HEAD(shrinker_list);
> > @@ -1008,7 +1008,7 @@ static inline int zone_is_near_oom(struc
> >  static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
> >  			struct scan_control *sc, int priority, int file)
> >  {
> > -	unsigned long pgmoved;
> > +	unsigned long pgmoved = 0;
> >  	int pgdeactivate = 0;
> >  	unsigned long pgscanned;
> >  	LIST_HEAD(l_hold);	/* The pages which were snipped off */
> > @@ -1036,17 +1036,32 @@ static void shrink_active_list(unsigned 
> >  		__mod_zone_page_state(zone, NR_ACTIVE_ANON, -pgmoved);
> >  	spin_unlock_irq(&zone->lru_lock);
> >  
> > +	pgmoved = 0;
> 
> didn't we just do that?

pgmoved was used in a call above.  I have gotten rid of the top
initialization instead, since it's assigned the return value from
a function.

> >  	while (!list_empty(&l_hold)) {
> >  		cond_resched();
> >  		page = lru_to_page(&l_hold);
> >  		list_del(&page->lru);
> > -		if (page_referenced(page, 0, sc->mem_cgroup))
> > -			list_add(&page->lru, &l_active);
> > -		else
> > +		if (page_referenced(page, 0, sc->mem_cgroup)) {
> > +			if (file) {
> > +				/* Referenced file pages stay active. */
> > +				list_add(&page->lru, &l_active);
> > +			} else {
> > +				/* Anonymous pages always get deactivated. */
> 
> hm.  That's going to make the machine swap like hell.  I guess I don't
> understand all this yet.

The file pages live on a separate LRU from the anon pages.  The anon
LRU will generally be scanned much slower than the file LRU, which
makes the always deactivation harmless.

> > +				list_add(&page->lru, &l_inactive);
> > +				pgmoved++;
> > +			}
> > +		} else
> >  			list_add(&page->lru, &l_inactive);
> >  	}
> >  
> >  	/*
> > +	 * Count the referenced anon pages as rotated, to balance pageout
> > +	 * scan pressure between file and anonymous pages in get_sacn_ratio.
> 
> tpyo

Fixed.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 08/25] add some sanity checks to get_scan_ratio
  2008-06-07  1:04   ` Andrew Morton
@ 2008-06-08 15:11     ` Rik van Riel
  0 siblings, 0 replies; 102+ messages in thread
From: Rik van Riel @ 2008-06-08 15:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro

On Fri, 6 Jun 2008 18:04:47 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> > +	if (unlikely(zone->recent_scanned_file > file / 4)) {
> 
> I see nothing in the changelog about this and there are no comments. 
> How can a reader possibly work out what you were thinking when this
> was typed in??

Pulled into the main split LRU patch and commented.
 
> Perhaps the (nr_swap_pages <= 0) test could happen earlier on.

Done.
 
> Please quadruple-check this code like a paranoid maniac looking for
> underflows, overflows and divides-by-zero.  Bear in mind that x/(y+1)
> can get a div-by-zero for sufficiently-unepected values of y.

Done that already.

> Oh, so that's what the [0] and [1] in get_scan_ratio() mean.  Perhaps
> doing this:
> 
> 	if (nr_swap_pages <= 0) {
> 		percent[0] = 0;		/* anon */
> 		percent[1] = 100;	/* file */
> 
> would clarify things. 

Added lots of comments on this.

> > +++ linux-2.6.26-rc2-mm1/include/linux/mmzone.h	2008-05-28 12:11:51.000000000 -0400
> > @@ -289,6 +289,8 @@ struct zone {
> >  
> >  	unsigned long		recent_rotated_anon;
> >  	unsigned long		recent_rotated_file;
> > +	unsigned long		recent_scanned_anon;
> > +	unsigned long		recent_scanned_file;
> 
> I think struct zone is sufficiently important and obscure that
> field-by-field /*documentation*/ is needed.  Not as kerneldoc, please -
> better to do it at the definition site

Added documentation for these.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-07  1:05   ` Andrew Morton
@ 2008-06-08 20:34     ` Rik van Riel
  2008-06-08 20:57       ` Andrew Morton
  2008-06-08 21:07       ` KOSAKI Motohiro
  2008-06-10 20:09     ` Rik van Riel
  1 sibling, 2 replies; 102+ messages in thread
From: Rik van Riel @ 2008-06-08 20:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
	eric.whitney

On Fri, 6 Jun 2008 18:05:06 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Fri, 06 Jun 2008 16:28:51 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
> > 
> > From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

> > The noreclaim infrastructure is enabled by a new mm Kconfig option
> > [CONFIG_]NORECLAIM_LRU.
> 
> Having a config option for this really sucks, and needs extra-special
> justification, rather than none.

I believe the justification is that it uses a page flag.

PG_noreclaim would be the 20th page flag used, meaning there are
4 more free if 8 bits are used for zone and node info, which would
give 6 bits for NODE_SHIFT or 64 NUMA nodes - probably overkill
for 32 bit x86.

If you want I'll get rid of CONFIG_NORECLAIM_LRU and make everything
just compile in always.

Please let me know what your preference is.
 
> > --- linux-2.6.26-rc2-mm1.orig/include/linux/page-flags.h	2008-05-29 16:21:04.000000000 -0400
> > +++ linux-2.6.26-rc2-mm1/include/linux/page-flags.h	2008-06-06 16:05:15.000000000 -0400
> > @@ -94,6 +94,9 @@ enum pageflags {
> >  	PG_reclaim,		/* To be reclaimed asap */
> >  	PG_buddy,		/* Page is free, on buddy lists */
> >  	PG_swapbacked,		/* Page is backed by RAM/swap */
> > +#ifdef CONFIG_NORECLAIM_LRU
> > +	PG_noreclaim,		/* Page is "non-reclaimable"  */
> > +#endif
> 
> I fear that we're messing up the terminology here.
> 
> Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'.  The term
> already means a few different things, but in the vmscan context,
> "reclaimable" means that the page is unreferenced, clean and can be
> stolen.  "reclaimable" also means a lot of other things, and we just
> made that worse.
> 
> Can we think of a new term which uniquely describes this new concept
> and use that, rather than flogging the old horse?

Want to reuse the BSD term "pinned" instead?

> > +/**
> > + * add_page_to_noreclaim_list
> > + * @page:  the page to be added to the noreclaim list
> > + *
> > + * Add page directly to its zone's noreclaim list.  To avoid races with
> > + * tasks that might be making the page reclaimble while it's not on the
> > + * lru, we want to add the page while it's locked or otherwise "invisible"
> > + * to other tasks.  This is difficult to do when using the pagevec cache,
> > + * so bypass that.
> > + */
> 
> How does a task "make a page reclaimable"?  munlock()?  fsync()? 
> exit()?
> 
> Choice of terminology matters...

Lee?  Kosaki-san?
 

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-08 20:34     ` Rik van Riel
@ 2008-06-08 20:57       ` Andrew Morton
  2008-06-08 21:32         ` Rik van Riel
  2008-06-08 22:03         ` Rik van Riel
  2008-06-08 21:07       ` KOSAKI Motohiro
  1 sibling, 2 replies; 102+ messages in thread
From: Andrew Morton @ 2008-06-08 20:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
	eric.whitney

On Sun, 8 Jun 2008 16:34:13 -0400 Rik van Riel <riel@redhat.com> wrote:

> On Fri, 6 Jun 2008 18:05:06 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> > On Fri, 06 Jun 2008 16:28:51 -0400
> > Rik van Riel <riel@redhat.com> wrote:
> > 
> > > 
> > > From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> 
> > > The noreclaim infrastructure is enabled by a new mm Kconfig option
> > > [CONFIG_]NORECLAIM_LRU.
> > 
> > Having a config option for this really sucks, and needs extra-special
> > justification, rather than none.
> 
> I believe the justification is that it uses a page flag.
> 
> PG_noreclaim would be the 20th page flag used, meaning there are
> 4 more free if 8 bits are used for zone and node info, which would
> give 6 bits for NODE_SHIFT or 64 NUMA nodes - probably overkill
> for 32 bit x86.
> 
> If you want I'll get rid of CONFIG_NORECLAIM_LRU and make everything
> just compile in always.

Seems unlikely to be useful?  The only way in which this would be an
advantage if if we hae some other feature which also needs a page flag
but which will never be concurrently enabled with this one.

> Please let me know what your preference is.

Don't use another page flag?

> > > --- linux-2.6.26-rc2-mm1.orig/include/linux/page-flags.h	2008-05-29 16:21:04.000000000 -0400
> > > +++ linux-2.6.26-rc2-mm1/include/linux/page-flags.h	2008-06-06 16:05:15.000000000 -0400
> > > @@ -94,6 +94,9 @@ enum pageflags {
> > >  	PG_reclaim,		/* To be reclaimed asap */
> > >  	PG_buddy,		/* Page is free, on buddy lists */
> > >  	PG_swapbacked,		/* Page is backed by RAM/swap */
> > > +#ifdef CONFIG_NORECLAIM_LRU
> > > +	PG_noreclaim,		/* Page is "non-reclaimable"  */
> > > +#endif
> > 
> > I fear that we're messing up the terminology here.
> > 
> > Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'.  The term
> > already means a few different things, but in the vmscan context,
> > "reclaimable" means that the page is unreferenced, clean and can be
> > stolen.  "reclaimable" also means a lot of other things, and we just
> > made that worse.
> > 
> > Can we think of a new term which uniquely describes this new concept
> > and use that, rather than flogging the old horse?
> 
> Want to reuse the BSD term "pinned" instead?

mm, "pinned" in Linuxland means "someone took a ref on it to prevent it
from being reclaimed".

As a starting point: what, in your english-language-paragraph-length
words, does this flag mean?

> > > +/**
> > > + * add_page_to_noreclaim_list
> > > + * @page:  the page to be added to the noreclaim list
> > > + *
> > > + * Add page directly to its zone's noreclaim list.  To avoid races with
> > > + * tasks that might be making the page reclaimble while it's not on the
> > > + * lru, we want to add the page while it's locked or otherwise "invisible"
> > > + * to other tasks.  This is difficult to do when using the pagevec cache,
> > > + * so bypass that.
> > > + */
> > 
> > How does a task "make a page reclaimable"?  munlock()?  fsync()? 
> > exit()?
> > 
> > Choice of terminology matters...
> 
> Lee?  Kosaki-san?


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-08 20:34     ` Rik van Riel
  2008-06-08 20:57       ` Andrew Morton
@ 2008-06-08 21:07       ` KOSAKI Motohiro
  1 sibling, 0 replies; 102+ messages in thread
From: KOSAKI Motohiro @ 2008-06-08 21:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, linux-kernel, lee.schermerhorn, linux-mm,
	eric.whitney

>> > +#ifdef CONFIG_NORECLAIM_LRU
>> > +   PG_noreclaim,           /* Page is "non-reclaimable"  */
>> > +#endif
>>
>> I fear that we're messing up the terminology here.
>>
>> Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'.  The term
>> already means a few different things, but in the vmscan context,
>> "reclaimable" means that the page is unreferenced, clean and can be
>> stolen.  "reclaimable" also means a lot of other things, and we just
>> made that worse.
>>
>> Can we think of a new term which uniquely describes this new concept
>> and use that, rather than flogging the old horse?
>
> Want to reuse the BSD term "pinned" instead?

I like this term :)
but I afraid to somebody confuse Xen/KVM term's pinned page.
IOW, I guess somebody imazine from "pinned page" to below flag.

#define PG_pinned               PG_owner_priv_1 /* Xen pinned pagetable */

I have no idea....


>> > +/**
>> > + * add_page_to_noreclaim_list
>> > + * @page:  the page to be added to the noreclaim list
>> > + *
>> > + * Add page directly to its zone's noreclaim list.  To avoid races with
>> > + * tasks that might be making the page reclaimble while it's not on the
>> > + * lru, we want to add the page while it's locked or otherwise "invisible"
>> > + * to other tasks.  This is difficult to do when using the pagevec cache,
>> > + * so bypass that.
>> > + */
>>
>> How does a task "make a page reclaimable"?  munlock()?  fsync()?
>> exit()?
>>
>> Choice of terminology matters...
>
> Lee?  Kosaki-san?

IFAIK, moving noreclaim list to reclaim list happend at below situation.

mlock'ed page
  - all mlocked process exit.
  - all mlocked process call munlock().
  - page related vma vanished
    (e.g. mumap, mmap, remap_file_page)

SHM_LOCKed page
  -  sysctl(SHM_UNLOCK) called.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-08 20:57       ` Andrew Morton
@ 2008-06-08 21:32         ` Rik van Riel
  2008-06-08 21:43           ` Ray Lee
  2008-06-08 23:22           ` Andrew Morton
  2008-06-08 22:03         ` Rik van Riel
  1 sibling, 2 replies; 102+ messages in thread
From: Rik van Riel @ 2008-06-08 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
	eric.whitney

On Sun, 8 Jun 2008 13:57:04 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> > > > From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> > 
> > > > The noreclaim infrastructure is enabled by a new mm Kconfig option
> > > > [CONFIG_]NORECLAIM_LRU.
> > > 
> > > Having a config option for this really sucks, and needs extra-special
> > > justification, rather than none.
> > 
> > I believe the justification is that it uses a page flag.
> > 
> > PG_noreclaim would be the 20th page flag used, meaning there are
> > 4 more free if 8 bits are used for zone and node info, which would
> > give 6 bits for NODE_SHIFT or 64 NUMA nodes - probably overkill
> > for 32 bit x86.
> > 
> > If you want I'll get rid of CONFIG_NORECLAIM_LRU and make everything
> > just compile in always.
> 
> Seems unlikely to be useful?  The only way in which this would be an
> advantage if if we hae some other feature which also needs a page flag
> but which will never be concurrently enabled with this one.
> 
> > Please let me know what your preference is.
> 
> Don't use another page flag?

I don't see how that would work.  We need a way to identify
the status of the page.

> > > > +#ifdef CONFIG_NORECLAIM_LRU
> > > > +	PG_noreclaim,		/* Page is "non-reclaimable"  */
> > > > +#endif
> > > 
> > > I fear that we're messing up the terminology here.
> > > 
> > > Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'.  The term
> > > already means a few different things, but in the vmscan context,
> > > "reclaimable" means that the page is unreferenced, clean and can be
> > > stolen.  "reclaimable" also means a lot of other things, and we just
> > > made that worse.
> > > 
> > > Can we think of a new term which uniquely describes this new concept
> > > and use that, rather than flogging the old horse?
> > 
> > Want to reuse the BSD term "pinned" instead?
> 
> mm, "pinned" in Linuxland means "someone took a ref on it to prevent it
> from being reclaimed".
> 
> As a starting point: what, in your english-language-paragraph-length
> words, does this flag mean?

"Cannot be reclaimed because someone has it locked in memory
through mlock, or the page belongs to something that cannot
be evicted like ramfs."

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-08 21:32         ` Rik van Riel
@ 2008-06-08 21:43           ` Ray Lee
  2008-06-08 23:22           ` Andrew Morton
  1 sibling, 0 replies; 102+ messages in thread
From: Ray Lee @ 2008-06-08 21:43 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, linux-kernel, lee.schermerhorn, kosaki.motohiro,
	linux-mm, eric.whitney

On Sun, Jun 8, 2008 at 2:32 PM, Rik van Riel <riel@redhat.com> wrote:
> On Sun, 8 Jun 2008 13:57:04 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
>> > > > From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
>> >
>> > > > The noreclaim infrastructure is enabled by a new mm Kconfig option
>> > > > [CONFIG_]NORECLAIM_LRU.
>> > >
>> > > Having a config option for this really sucks, and needs extra-special
>> > > justification, rather than none.
>> >
>> > I believe the justification is that it uses a page flag.
>> >
>> > PG_noreclaim would be the 20th page flag used, meaning there are
>> > 4 more free if 8 bits are used for zone and node info, which would
>> > give 6 bits for NODE_SHIFT or 64 NUMA nodes - probably overkill
>> > for 32 bit x86.
>> >
>> > If you want I'll get rid of CONFIG_NORECLAIM_LRU and make everything
>> > just compile in always.
>>
>> Seems unlikely to be useful?  The only way in which this would be an
>> advantage if if we hae some other feature which also needs a page flag
>> but which will never be concurrently enabled with this one.
>>
>> > Please let me know what your preference is.
>>
>> Don't use another page flag?
>
> I don't see how that would work.  We need a way to identify
> the status of the page.
>
>> > > > +#ifdef CONFIG_NORECLAIM_LRU
>> > > > +       PG_noreclaim,           /* Page is "non-reclaimable"  */
>> > > > +#endif
>> > >
>> > > I fear that we're messing up the terminology here.
>> > >
>> > > Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'.  The term
>> > > already means a few different things, but in the vmscan context,
>> > > "reclaimable" means that the page is unreferenced, clean and can be
>> > > stolen.  "reclaimable" also means a lot of other things, and we just
>> > > made that worse.
>> > >
>> > > Can we think of a new term which uniquely describes this new concept
>> > > and use that, rather than flogging the old horse?
>> >
>> > Want to reuse the BSD term "pinned" instead?
>>
>> mm, "pinned" in Linuxland means "someone took a ref on it to prevent it
>> from being reclaimed".
>>
>> As a starting point: what, in your english-language-paragraph-length
>> words, does this flag mean?
>
> "Cannot be reclaimed because someone has it locked in memory
> through mlock, or the page belongs to something that cannot
> be evicted like ramfs."

"Unevictable"

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-08 20:57       ` Andrew Morton
  2008-06-08 21:32         ` Rik van Riel
@ 2008-06-08 22:03         ` Rik van Riel
  1 sibling, 0 replies; 102+ messages in thread
From: Rik van Riel @ 2008-06-08 22:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
	eric.whitney

On Sun, 8 Jun 2008 13:57:04 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> > If you want I'll get rid of CONFIG_NORECLAIM_LRU and make everything
> > just compile in always.
> 
> Seems unlikely to be useful?  The only way in which this would be an
> advantage if if we hae some other feature which also needs a page flag
> but which will never be concurrently enabled with this one.
> 
> > Please let me know what your preference is.
> 
> Don't use another page flag?

To explain in more detail why we need the page flag:

When we move a page from the active or inactive list onto the
noreclaim list, we need to know what list it was on, in order
to adjust the zone counts for that list (NR_ACTIVE_ANON, etc).

For the same reason, we need to be able to identify whether
a page is already on the noreclaim list, so we can adjust
the statistics for the noreclaim pages, too. We cannot afford
to accidentally move a page onto the noreclaim list twice, or
try to remove it from the noreclaim list twice.

We need to know how many pages of each type there are in
each zone, and we need a way to specify that a page has
just become noreclaim. If a page is sitting a pagevec
somewhere, and it has just become unreclaimable, we want
that page to end up on the noreclaim list once that
pagevec is flushed.

As far as I can see, this requires a page flag.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-08 21:32         ` Rik van Riel
  2008-06-08 21:43           ` Ray Lee
@ 2008-06-08 23:22           ` Andrew Morton
  2008-06-08 23:34             ` Rik van Riel
  1 sibling, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2008-06-08 23:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
	eric.whitney

On Sun, 8 Jun 2008 17:32:44 -0400 Rik van Riel <riel@redhat.com> wrote:

> On Sun, 8 Jun 2008 13:57:04 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > > > > From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> > > 
> > > > > The noreclaim infrastructure is enabled by a new mm Kconfig option
> > > > > [CONFIG_]NORECLAIM_LRU.
> > > > 
> > > > Having a config option for this really sucks, and needs extra-special
> > > > justification, rather than none.
> > > 
> > > I believe the justification is that it uses a page flag.
> > > 
> > > PG_noreclaim would be the 20th page flag used, meaning there are
> > > 4 more free if 8 bits are used for zone and node info, which would
> > > give 6 bits for NODE_SHIFT or 64 NUMA nodes - probably overkill
> > > for 32 bit x86.

This feature isn't available on 32-bit cpus is it?

> > > If you want I'll get rid of CONFIG_NORECLAIM_LRU and make everything
> > > just compile in always.
> > 
> > Seems unlikely to be useful?  The only way in which this would be an
> > advantage if if we hae some other feature which also needs a page flag
> > but which will never be concurrently enabled with this one.

^^this?

> > > Please let me know what your preference is.
> > 
> > Don't use another page flag?
> 
> I don't see how that would work.  We need a way to identify
> the status of the page.

We'll run out one day.  Then we will have little choice but to increase
the size of the pageframe.

This is a direct downside of adding more lru lists.

The this-is-64-bit-only problem really sucks, IMO.  We still don't know
the reason for that decision.  Presumably it was because we've already
run out of page flags?  If so, the time for the larger pageframe is
upon us.

> > > > > +#ifdef CONFIG_NORECLAIM_LRU
> > > > > +	PG_noreclaim,		/* Page is "non-reclaimable"  */
> > > > > +#endif
> > > > 
> > > > I fear that we're messing up the terminology here.
> > > > 
> > > > Go into your 2.6.25 tree and do `grep -i reclaimable */*.c'.  The term
> > > > already means a few different things, but in the vmscan context,
> > > > "reclaimable" means that the page is unreferenced, clean and can be
> > > > stolen.  "reclaimable" also means a lot of other things, and we just
> > > > made that worse.
> > > > 
> > > > Can we think of a new term which uniquely describes this new concept
> > > > and use that, rather than flogging the old horse?
> > > 
> > > Want to reuse the BSD term "pinned" instead?
> > 
> > mm, "pinned" in Linuxland means "someone took a ref on it to prevent it
> > from being reclaimed".
> > 
> > As a starting point: what, in your english-language-paragraph-length
> > words, does this flag mean?
> 
> "Cannot be reclaimed because someone has it locked in memory
> through mlock, or the page belongs to something that cannot
> be evicted like ramfs."

Ray's "unevictable" sounds good.  It's not a term we've used elsewhere.

It's all a bit arbitrary, but it's just a label which maps onto a
concept and if we all honour that mapping carefully in our code and
writings, VM maintenance becomes that bit easier.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-08 23:22           ` Andrew Morton
@ 2008-06-08 23:34             ` Rik van Riel
  2008-06-08 23:54               ` Andrew Morton
  0 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel @ 2008-06-08 23:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
	eric.whitney

On Sun, 8 Jun 2008 16:22:08 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> The this-is-64-bit-only problem really sucks, IMO.  We still don't know
> the reason for that decision.  Presumably it was because we've already
> run out of page flags?  If so, the time for the larger pageframe is
> upon us.

32 bit machines are unlikely to have so much memory that they run
into big scalability issues with mlocked memory.

The obvious exception to that are large PAE systems, which run
into other bottlenecks already and will probably hit the wall in
some other way before suffering greatly from the "kswapd is
scanning unevictable pages" problem.

I'll leave it up to you to decide whether you want this feature
64 bit only, or whether you want to use up the page flag on 32
bit systems too.

Please let me know which direction I should take, so I can fix
up the patch set accordingly.

> > > As a starting point: what, in your english-language-paragraph-length
> > > words, does this flag mean?
> > 
> > "Cannot be reclaimed because someone has it locked in memory
> > through mlock, or the page belongs to something that cannot
> > be evicted like ramfs."
> 
> Ray's "unevictable" sounds good.  It's not a term we've used elsewhere.
> 
> It's all a bit arbitrary, but it's just a label which maps onto a
> concept and if we all honour that mapping carefully in our code and
> writings, VM maintenance becomes that bit easier.

OK, I'll rename everything to unevictable and will add documentation
to clear up the meaning.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-08 23:34             ` Rik van Riel
@ 2008-06-08 23:54               ` Andrew Morton
  2008-06-09  0:56                 ` Rik van Riel
                                   ` (2 more replies)
  0 siblings, 3 replies; 102+ messages in thread
From: Andrew Morton @ 2008-06-08 23:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
	eric.whitney

On Sun, 8 Jun 2008 19:34:20 -0400 Rik van Riel <riel@redhat.com> wrote:

> On Sun, 8 Jun 2008 16:22:08 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > The this-is-64-bit-only problem really sucks, IMO.  We still don't know
> > the reason for that decision.  Presumably it was because we've already
> > run out of page flags?  If so, the time for the larger pageframe is
> > upon us.
> 
> 32 bit machines are unlikely to have so much memory that they run
> into big scalability issues with mlocked memory.
> 
> The obvious exception to that are large PAE systems, which run
> into other bottlenecks already and will probably hit the wall in
> some other way before suffering greatly from the "kswapd is
> scanning unevictable pages" problem.
> 
> I'll leave it up to you to decide whether you want this feature
> 64 bit only, or whether you want to use up the page flag on 32
> bit systems too.
> 
> Please let me know which direction I should take, so I can fix
> up the patch set accordingly.

I'm getting rather wobbly about all of this.

This is, afair, by far the most intrusive and high-risk change we've
looked at doing since 2.5.x, for small values of x.

I mean, it's taken many years of work to get reclaim into its current
state (and the reduction in reported problems will in part be due to
the quadrupling-odd of memory over that time).  And we're now proposing
radical changes which again will take years to sort out, all on behalf
of a small number of workloads upon a minority of 64-bit machines which
themselves are a minority of the Linux base.

And it will take longer to get those problems sorted out if 32-bt
machines aren't even compiing the new code in.

Are all of thse changes really justified?

ho hum.  Can you remind us what problems this patchset actually
addresses?  Preferably in order of seriousness?  (The [0/n] description
told us about the implementation but forgot to tell us anything about
what it was fixing).  Because I guess we should have a think about
alternative approaches.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-08 23:54               ` Andrew Morton
@ 2008-06-09  0:56                 ` Rik van Riel
  2008-06-09  6:10                   ` Andrew Morton
  2008-06-09  2:58                 ` Rik van Riel
  2008-06-10 19:17                 ` Christoph Lameter
  2 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel @ 2008-06-09  0:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
	eric.whitney

On Sun, 8 Jun 2008 16:54:34 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Sun, 8 Jun 2008 19:34:20 -0400 Rik van Riel <riel@redhat.com> wrote:

> > Please let me know which direction I should take, so I can fix
> > up the patch set accordingly.
> 
> I'm getting rather wobbly about all of this.
> 
> This is, afair, by far the most intrusive and high-risk change we've
> looked at doing since 2.5.x, for small values of x.

Nowhere near as intrusive or risky as eg. the timer changes that went
in a few releases ago.

> I mean, it's taken many years of work to get reclaim into its current
> state (and the reduction in reported problems will in part be due to
> the quadrupling-odd of memory over that time).

Actually, memory is now getting so large that the current code no
longer works right.  On machines 16GB and up, we have discovered
really pathetic behaviour by the VM currently upstream.

Things like the VM scanning over the (locked) shared memory segment
over and over and over again, to get at the 1GB of freeable pagecache
memory in the system.  Or the system scanning over all anonymous
memory over and over again, despite the fact that there is no more
swap space left.

With heavy anonymous memory workloads, Linux can stall for minutes
once memory runs low and something needs to be swapped out, because
pretty much all memory is anonymous and everything has the referenced
bit set.  We have seen systems with 128GB of RAM hang overnight, once
every CPU got wedged in the pageout scanning code.  Typically the VM
decides on a first page to swap out in 2-3 minutes though, and then
it will start several gigabytes of swap IO at once...

Definately not acceptable behaviour.

> And we're now proposing radical changes which again will take years to sort
> out, all on behalf of a small number of workloads upon a minority of 64-bit
> machines which themselves are a minority of the Linux base.

Hardware gets larger.  4 years ago few people cared about systems
with more than 4GB of memory, but nowadays people have that in their
desktops.

> And it will take longer to get those problems sorted out if 32-bt
> machines aren't even compiing the new code in.

32 bit systems will still get the file/anon LRU split.  The only
thing that is 64 bit only in the current patch set is keeping the
unevictable pages off of the LRU lists.

This means that balancing between file and anon eviction will be
the same on 32 and 64 bit systems and things should get sorted out
on both systems at the same time.

> Are all of thse changes really justified?

People with large Linux servers are experiencing system stalls
of several minutes, or at worst complete livelocks, with the
current VM.

I believe that those issues need to be fixed.

After discussing this for a long time with Larry Woodman,
Lee Schermerhorn and others, I am convinced that they can
not be fixed by putting a bandaid on the current code.

After all, the fundamental problem often is that the file backed
and mem/swap backed pages are on the same LRU.

Think of a case that is becoming more and more common: a database
server with 128GB of RAM, 2GB of (hardly ever used) swap, 80GB of
locked shared memory segment, 30GB of other anonymous memory and
5GB of page cache.

Do you think it is reasonable for the VM to have to scan over
110GB of essentially unevictable memory, just to get at the 5GB
of page cache?

> Because I guess we should have a think about alternative approaches.

We have.  We failed to come up with anything that avoids the
problem without actually fixing the fundamental issues.

If you have an idea, please let us know.

Otherwise, please give us a chance to shake things out in -mm.

I will prepare kernel RPMs for Fedora so users in the community can
easily test these patches too, and help find scenarios where these
patches do not perform as well as what the current kernel has.

I have time to track down and fix any issues that people find.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 04/25] free swap space on swap-in/activation
  2008-06-07  1:04   ` Andrew Morton
  2008-06-07 19:56     ` Rik van Riel
@ 2008-06-09  2:14     ` MinChan Kim
  2008-06-09  2:42       ` Rik van Riel
  2008-06-09 13:38       ` KOSAKI Motohiro
  1 sibling, 2 replies; 102+ messages in thread
From: MinChan Kim @ 2008-06-09  2:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, linux-kernel, lee.schermerhorn, kosaki.motohiro,
	Hugh Dickins

> - Can completely optimise the thing away if !CONFIG_SWAP.
>

I think we can optimize more and more in case of if !CONFIG_SWAP.
If system are !CONFIG_SWAP, we can never  swap out anonymous pages.

Don't we manage anonymous pages with list ?

Such system don't need to insert anonymous pages into lru list when fault occur.
It is just needless overhead.

Also, If we can't reclaim anonymous pages, don't we need anon rmap facility ?
I don't know well which subsystems used rmap.

If pageout only use anonymous rmap for pageout,
we can remove anonymous rmapping code in case of !CONFIG_SWAP

How about your opinion ?

-- 
Kinds regards,
MinChan Kim

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 04/25] free swap space on swap-in/activation
  2008-06-09  2:14     ` MinChan Kim
@ 2008-06-09  2:42       ` Rik van Riel
  2008-06-09 13:38       ` KOSAKI Motohiro
  1 sibling, 0 replies; 102+ messages in thread
From: Rik van Riel @ 2008-06-09  2:42 UTC (permalink / raw)
  To: MinChan Kim
  Cc: Andrew Morton, linux-kernel, lee.schermerhorn, kosaki.motohiro,
	Hugh Dickins

On Mon, 9 Jun 2008 11:14:55 +0900
"MinChan Kim" <minchan.kim@gmail.com> wrote:

> > - Can completely optimise the thing away if !CONFIG_SWAP.

> How about your opinion ?

You are right that it can be done.  However, doing all of those
!CONFIG_SWAP optimizations are not my priority right now.  Most
of them would also apply separately to the current upstream VM.

Patches are welcome, though.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-08 23:54               ` Andrew Morton
  2008-06-09  0:56                 ` Rik van Riel
@ 2008-06-09  2:58                 ` Rik van Riel
  2008-06-09  5:44                   ` Andrew Morton
  2008-06-10 19:17                 ` Christoph Lameter
  2 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel @ 2008-06-09  2:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
	eric.whitney

On Sun, 8 Jun 2008 16:54:34 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> ho hum.  Can you remind us what problems this patchset actually
> addresses?  Preferably in order of seriousness?

Here are some other problems that my patch series can easily fix,
because file cache and anon/swap backed pages live on separate
LRUs:

http://feedblog.org/2007/09/29/using-o_direct-on-linux-and-innodb-to-fix-swap-insanity/

http://blogs.smugmug.com/don/2008/05/01/mysql-and-the-linux-swap-problem/

I do not know for sure whether the patch set does fix it yet for
everyone, or whether it needs some more tuning first, but it is
fairly easily fixable by tweaking the relative pressure on both
sets of LRU lists.

No tricks of skipping over one type of pages while scanning, or
treating the referenced bits differently when the moon is in some
particular phase required - one set of lists for each type of
pages, and variable pressure between the two.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-09  2:58                 ` Rik van Riel
@ 2008-06-09  5:44                   ` Andrew Morton
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Morton @ 2008-06-09  5:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
	eric.whitney

On Sun, 8 Jun 2008 22:58:00 -0400 Rik van Riel <riel@redhat.com> wrote:

> On Sun, 8 Jun 2008 16:54:34 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > ho hum.  Can you remind us what problems this patchset actually
> > addresses?  Preferably in order of seriousness?
> 
> Here are some other problems that my patch series can easily fix,
> because file cache and anon/swap backed pages live on separate
> LRUs:
> 
> http://feedblog.org/2007/09/29/using-o_direct-on-linux-and-innodb-to-fix-swap-insanity/
> 
> http://blogs.smugmug.com/don/2008/05/01/mysql-and-the-linux-swap-problem/

Sorry, but sending us off to look at random bug reports (from people
who didn't report a bug) is not how we discuss or changelog kernel
patches.

It is for good reasons that we like to see an accurate and detailed
analysis of the problems which are being addressed, and a description
of the means by which they were solved.

> I do not know for sure whether the patch set does fix it yet for
> everyone, or whether it needs some more tuning first, but it is
> fairly easily fixable by tweaking the relative pressure on both
> sets of LRU lists.

I expect it will help, yes.  On 64-bit systems.  It's unclear whether
mlock or SHM_LOCK is part of the issue here - if it is then 32-bit
systems will still be exposed to these things.

I also expect that it will introduce new problems, ones which can take a
very long time to diagnose and fix.  Inevitable, but hopefully acceptable,
if the benefit is there.

> No tricks of skipping over one type of pages while scanning, or
> treating the referenced bits differently when the moon is in some
> particular phase required - one set of lists for each type of
> pages, and variable pressure between the two.

For the unevictable pages we have previously considered just taking
them off the LRU and leaving them off - reattach them at
SHM_UNLOCK-time and at munlock()-time (potentially subject to
reexamination of any other vmas which map each page).

I believe that Andrea had code which leaves the anon pages off the LRU
as well, but I forget the details.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-09  0:56                 ` Rik van Riel
@ 2008-06-09  6:10                   ` Andrew Morton
  2008-06-09 13:44                     ` Rik van Riel
  0 siblings, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2008-06-09  6:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
	eric.whitney

On Sun, 8 Jun 2008 20:56:29 -0400 Rik van Riel <riel@redhat.com> wrote:

> On Sun, 8 Jun 2008 16:54:34 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> > On Sun, 8 Jun 2008 19:34:20 -0400 Rik van Riel <riel@redhat.com> wrote:
> 
> > > Please let me know which direction I should take, so I can fix
> > > up the patch set accordingly.
> > 
> > I'm getting rather wobbly about all of this.
> > 
> > This is, afair, by far the most intrusive and high-risk change we've
> > looked at doing since 2.5.x, for small values of x.
> 
> Nowhere near as intrusive or risky as eg. the timer changes that went
> in a few releases ago.

Well.  Intrusiveness doesn't matter much.  But no, you're dead wrong -
this stuff is far more risky than timer changes.  Because things like
the timer changes are trivial to detect errors in - it either works or
it doesn't.

Whereas reclaim problems can take *years* to identify and are often
very hard for the programmers to understand, reproduce and diagnose.

> > I mean, it's taken many years of work to get reclaim into its current
> > state (and the reduction in reported problems will in part be due to
> > the quadrupling-odd of memory over that time).
> 
> Actually, memory is now getting so large that the current code no
> longer works right.  On machines 16GB and up, we have discovered
> really pathetic behaviour by the VM currently upstream.
> 
> Things like the VM scanning over the (locked) shared memory segment
> over and over and over again, to get at the 1GB of freeable pagecache
> memory in the system.

Earlier discussion about removing these pages from ALL LRUs reached a
quite detailed stage, but nobody seemed to finish any code.

>  Or the system scanning over all anonymous
> memory over and over again, despite the fact that there is no more
> swap space left.

We shouldn't rewrite core VM to cater for incorrectly configured
systems.

> With heavy anonymous memory workloads, Linux can stall for minutes
> once memory runs low and something needs to be swapped out, because
> pretty much all memory is anonymous and everything has the referenced
> bit set.  We have seen systems with 128GB of RAM hang overnight, once
> every CPU got wedged in the pageout scanning code.  Typically the VM
> decides on a first page to swap out in 2-3 minutes though, and then
> it will start several gigabytes of swap IO at once...
> 
> Definately not acceptable behaviour.

I see handwavy non-bug-reports loosely associated with a vast pile of
code and vague expressions of hope that one will fix the other.

Where's the meat in this, Rik?  This is engineering.

Do you or do you not have a test case which demonstrates this problem? 
It doesn't sound terribly hard.  Where are the before-and-after test
results?

> > And we're now proposing radical changes which again will take years to sort
> > out, all on behalf of a small number of workloads upon a minority of 64-bit
> > machines which themselves are a minority of the Linux base.
> 
> Hardware gets larger.  4 years ago few people cared about systems
> with more than 4GB of memory, but nowadays people have that in their
> desktops.
> 
> > And it will take longer to get those problems sorted out if 32-bt
> > machines aren't even compiing the new code in.
> 
> 32 bit systems will still get the file/anon LRU split.  The only
> thing that is 64 bit only in the current patch set is keeping the
> unevictable pages off of the LRU lists.
> 
> This means that balancing between file and anon eviction will be
> the same on 32 and 64 bit systems and things should get sorted out
> on both systems at the same time.
> 
> > Are all of thse changes really justified?
> 
> People with large Linux servers are experiencing system stalls
> of several minutes, or at worst complete livelocks, with the
> current VM.
> 
> I believe that those issues need to be fixed.

I'd love to see hard evidence that they have been.  And that doesn't
mean getting palmed off on wikis and random blog pages.

Also, it is incumbent upon us to consider the other design proposals,
such as removing anon pages from the LRUs, removing mlocked pages from
the LRUs.

> After discussing this for a long time with Larry Woodman,
> Lee Schermerhorn and others, I am convinced that they can
> not be fixed by putting a bandaid on the current code.
> 
> After all, the fundamental problem often is that the file backed
> and mem/swap backed pages are on the same LRU.

That actually isn't a fundamental problem.

It _becomes_ a problem because we try to treat the two types of pages
differently.

Stupid question: did anyone try setting swappiness=100?  What happened?

> Think of a case that is becoming more and more common: a database
> server with 128GB of RAM, 2GB of (hardly ever used) swap, 80GB of
> locked shared memory segment, 30GB of other anonymous memory and
> 5GB of page cache.
> 
> Do you think it is reasonable for the VM to have to scan over
> 110GB of essentially unevictable memory, just to get at the 5GB
> of page cache?

Well for starters that system was grossly misconfigured.  It is
incumbent upon you, in your design document (that thing we call a
changelog) to justify why the VM design needs to be altered to cater
for such misconfigured systems.  It just drives me up the wall having
to engage in a 20-email discussion to be able to squeeze these little
revelations out.  Only to have them lost again later.

Secondly, I expect that removal of mlocked pages from the LRU (as was
discussed a year or two ago and perhaps implemented by Andrea) along
with swappiness=100 might be get us towards a fix.  Don't know.

> > Because I guess we should have a think about alternative approaches.
> 
> We have.  We failed to come up with anything that avoids the
> problem without actually fixing the fundamental issues.

Unless I missed it, none of your patch descriptions even attempt to
describe these fundamental issues.  It's all buried in 20-deep email
threads.

> If you have an idea, please let us know.

I see no fundamental reason why we need to put mlocked or SHM_LOCKED
pages onto a VM LRU at all.

One cause of problms is that we attempt to prioritise anon pages over
file-backed pagecache.  And we prioritise mmapped pages, which your patches
don't address, do they?  Stopping doing that would, I expect, prevent a
range of these problems.  It would introduce others, probably.

> Otherwise, please give us a chance to shake things out in -mm.

-mm isn't a very useful testing place any more, I'm afraid.  The
patches would be better off in linux-next, but then they would screw up
all the other pending MM patches, and it's probably a bit early for
getting them into linux-next.

Once I get sections of -mm feeding into linux-next, things will be better.

> I will prepare kernel RPMs for Fedora so users in the community can
> easily test these patches too, and help find scenarios where these
> patches do not perform as well as what the current kernel has.
> 
> I have time to track down and fix any issues that people find.

That helps.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 04/25] free swap space on swap-in/activation
  2008-06-09  2:14     ` MinChan Kim
  2008-06-09  2:42       ` Rik van Riel
@ 2008-06-09 13:38       ` KOSAKI Motohiro
  2008-06-10  2:30         ` MinChan Kim
  1 sibling, 1 reply; 102+ messages in thread
From: KOSAKI Motohiro @ 2008-06-09 13:38 UTC (permalink / raw)
  To: MinChan Kim
  Cc: Rik van Riel, Andrew Morton, linux-kernel, lee.schermerhorn,
	Hugh Dickins

Hi Kim-san,

>> - Can completely optimise the thing away if !CONFIG_SWAP.
>
> I think we can optimize more and more in case of if !CONFIG_SWAP.
> If system are !CONFIG_SWAP, we can never  swap out anonymous pages.
>
> Don't we manage anonymous pages with list ?
>
> Such system don't need to insert anonymous pages into lru list when fault occur.
> It is just needless overhead.
>
> Also, If we can't reclaim anonymous pages, don't we need anon rmap facility ?
> I don't know well which subsystems used rmap.
>
> If pageout only use anonymous rmap for pageout,
> we can remove anonymous rmapping code in case of !CONFIG_SWAP

You are right.
but I think rmap isn't bottle neck in embedded system.
Why do you want remove that?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-09  6:10                   ` Andrew Morton
@ 2008-06-09 13:44                     ` Rik van Riel
  0 siblings, 0 replies; 102+ messages in thread
From: Rik van Riel @ 2008-06-09 13:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
	eric.whitney

On Sun, 8 Jun 2008 23:10:53 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> Also, it is incumbent upon us to consider the other design proposals,
> such as removing anon pages from the LRUs, removing mlocked pages from
> the LRUs.

That is certainly an option.  We'll still need to keep track of
what kind of page the page is, though, otherwise we won't know
whether or not we can put it back onto the LRU lists at munlock
time.

> > After discussing this for a long time with Larry Woodman,
> > Lee Schermerhorn and others, I am convinced that they can
> > not be fixed by putting a bandaid on the current code.
> > 
> > After all, the fundamental problem often is that the file backed
> > and mem/swap backed pages are on the same LRU.
> 
> That actually isn't a fundamental problem.
> 
> It _becomes_ a problem because we try to treat the two types of pages
> differently.
> 
> Stupid question: did anyone try setting swappiness=100?  What happened?

The database shared memory segment got swapped out and the
system crawled to a halt.

Swap IO usually is less efficient than page cache IO, because
page cache IO happens in larger chunks and does not involve
a swap-out first and a swap-in later - the data is just read,
which at least halves the disk IO compared to swap.

Readahead tilts the IO cost even more in favor of evicting
page cache pages, vs. swapping something out.

> > Think of a case that is becoming more and more common: a database
> > server with 128GB of RAM, 2GB of (hardly ever used) swap, 80GB of
> > locked shared memory segment, 30GB of other anonymous memory and
> > 5GB of page cache.
> > 
> > Do you think it is reasonable for the VM to have to scan over
> > 110GB of essentially unevictable memory, just to get at the 5GB
> > of page cache?
> 
> Well for starters that system was grossly misconfigured.

Swapping out the database shared memory segment is not an option,
because it is mlocked.  Even if it was an option, swapping it out
would be a bad idea because swap IO is simply less efficient than
page cache IO (see above).

> Secondly, I expect that removal of mlocked pages from the LRU (as was
> discussed a year or two ago and perhaps implemented by Andrea) along
> with swappiness=100 might be get us towards a fix.  Don't know.

Removing mlocked pages from the LRU can be done, but I suspect
we'll still want to keep track of how many of these pages there
are, right?

> > > Because I guess we should have a think about alternative approaches.
> > 
> > We have.  We failed to come up with anything that avoids the
> > problem without actually fixing the fundamental issues.
> 
> Unless I missed it, none of your patch descriptions even attempt to
> describe these fundamental issues.  It's all buried in 20-deep email
> threads.

I'll add more problem descriptions to the next patch submission.
I'm halfway the patch series making all the cleanups and changes
you suggested.

> One cause of problms is that we attempt to prioritise anon pages over
> file-backed pagecache.  And we prioritise mmapped pages, which your patches
> don't address, do they?  Stopping doing that would, I expect, prevent a
> range of these problems.  It would introduce others, probably.

Try running a database with swappiness=100 and then doing a
backup of the system simultaneously.  The database will end
up being swapped out, which slows down the database, causes
extra IO and ends up slowing down the backup, too.

The backup does not benefit from having its data cached,
since it only reads everything once.

> > Otherwise, please give us a chance to shake things out in -mm.
> 
> -mm isn't a very useful testing place any more, I'm afraid.

That's a problem.  I can run tests on the VM patches, but you know
as well as I do that the code needs to be shaken out by lots of
users before we can be truly confident in it...

> > I will prepare kernel RPMs for Fedora so users in the community can
> > easily test these patches too, and help find scenarios where these
> > patches do not perform as well as what the current kernel has.
> > 
> > I have time to track down and fix any issues that people find.
> 
> That helps.

I sure hope so.

I'll send you a cleaned-up patch series soon.  Hopefully tonight
or tomorrow.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 04/25] free swap space on swap-in/activation
  2008-06-09 13:38       ` KOSAKI Motohiro
@ 2008-06-10  2:30         ` MinChan Kim
  0 siblings, 0 replies; 102+ messages in thread
From: MinChan Kim @ 2008-06-10  2:30 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Andrew Morton, linux-kernel, lee.schermerhorn,
	Hugh Dickins

On Mon, Jun 9, 2008 at 10:38 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi Kim-san,
>
>>> - Can completely optimise the thing away if !CONFIG_SWAP.
>>
>> I think we can optimize more and more in case of if !CONFIG_SWAP.
>> If system are !CONFIG_SWAP, we can never  swap out anonymous pages.
>>
>> Don't we manage anonymous pages with list ?
>>
>> Such system don't need to insert anonymous pages into lru list when fault occur.
>> It is just needless overhead.
>>
>> Also, If we can't reclaim anonymous pages, don't we need anon rmap facility ?
>> I don't know well which subsystems used rmap.
>>
>> If pageout only use anonymous rmap for pageout,
>> we can remove anonymous rmapping code in case of !CONFIG_SWAP
>
> You are right.
> but I think rmap isn't bottle neck in embedded system.
> Why do you want remove that?

I think unnecessary code execution may increase power consumption and
cache footprint.
But I am not sure pageout just only one user in rmap.
If another subsystem (ex, xen, kvm) use rmap for managing memory, we
cannot remove rmap.



-- 
Kinds regards,
MinChan Kim

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
  2008-06-07  1:07   ` Andrew Morton
  2008-06-07  5:38     ` KOSAKI Motohiro
@ 2008-06-10  3:31     ` Nick Piggin
  2008-06-10 12:50       ` Rik van Riel
  2008-06-10 21:14       ` Rik van Riel
  2008-06-11  1:00     ` Rik van Riel
  2 siblings, 2 replies; 102+ messages in thread
From: Nick Piggin @ 2008-06-10  3:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, linux-kernel, lee.schermerhorn, kosaki.motohiro,
	linux-mm, eric.whitney

On Fri, Jun 06, 2008 at 06:07:46PM -0700, Andrew Morton wrote:
> On Fri, 06 Jun 2008 16:28:55 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
> > Originally
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > Against:  2.6.26-rc2-mm1
> > 
> > This patch:
> > 
> > 1) defines the [CONFIG_]NORECLAIM_MLOCK sub-option and the
> >    stub version of the mlock/noreclaim APIs when it's
> >    not configured.  Depends on [CONFIG_]NORECLAIM_LRU.
> 
> Oh sob.
> 
> akpm:/usr/src/25> find . -name '*.[ch]' | xargs grep CONFIG_NORECLAIM | wc -l
> 51
> 
> why oh why?  Must we really really do this to ourselves?  Cheerfully
> unchangeloggedly?
> 
> > 2) add yet another page flag--PG_mlocked--to indicate that
> >    the page is locked for efficient testing in vmscan and,
> >    optionally, fault path.  This allows early culling of
> >    nonreclaimable pages, preventing them from getting to
> >    page_referenced()/try_to_unmap().  Also allows separate
> >    accounting of mlock'd pages, as Nick's original patch
> >    did.
> > 
> >    Note:  Nick's original mlock patch used a PG_mlocked
> >    flag.  I had removed this in favor of the PG_noreclaim
> >    flag + an mlock_count [new page struct member].  I
> >    restored the PG_mlocked flag to eliminate the new
> >    count field.  
> 
> How many page flags are left?  I keep on asking this and I end up
> either a) not being told or b) forgetting.  I thought that we had
> a whopping big comment somewhere which describes how all these
> flags are allocated but I can't immediately locate it.
> 
> > 3) add the mlock/noreclaim infrastructure to mm/mlock.c,
> >    with internal APIs in mm/internal.h.  This is a rework
> >    of Nick's original patch to these files, taking into
> >    account that mlocked pages are now kept on noreclaim
> >    LRU list.
> > 
> > 4) update vmscan.c:page_reclaimable() to check PageMlocked()
> >    and, if vma passed in, the vm_flags.  Note that the vma
> >    will only be passed in for new pages in the fault path;
> >    and then only if the "cull nonreclaimable pages in fault
> >    path" patch is included.
> > 
> > 5) add try_to_unlock() to rmap.c to walk a page's rmap and
> >    ClearPageMlocked() if no other vmas have it mlocked.  
> >    Reuses as much of try_to_unmap() as possible.  This
> >    effectively replaces the use of one of the lru list links
> >    as an mlock count.  If this mechanism let's pages in mlocked
> >    vmas leak through w/o PG_mlocked set [I don't know that it
> >    does], we should catch them later in try_to_unmap().  One
> >    hopes this will be rare, as it will be relatively expensive.
> > 
> > 6) Kosaki:  added munlock page table walk to avoid using
> >    get_user_pages() for unlock.  get_user_pages() is unreliable
> >    for some vma protections.
> >    Lee:  modified to wait for in-flight migration to complete
> >    to close munlock/migration race that could strand pages.
> 
> None of which is available on 32-bit machines.  That's pretty significant.

It should definitely be enabled for 32-bit machines, and enabled by default.
The argument is that 32 bit machines won't have much memory so it won't
be a problem, but a) it also has to work well on other machines without
much memory, and b) it is a nightmare to have significant behaviour changes
like this. For kernel development as well as kernel running.

If we eventually run out of page flags on 32 bit, then sure this might be
one we could look at geting rid of. Once the code has proven itself.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
  2008-06-10  3:31     ` Nick Piggin
@ 2008-06-10 12:50       ` Rik van Riel
  2008-06-10 21:14       ` Rik van Riel
  1 sibling, 0 replies; 102+ messages in thread
From: Rik van Riel @ 2008-06-10 12:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, linux-kernel, lee.schermerhorn, kosaki.motohiro,
	linux-mm, eric.whitney

On Tue, 10 Jun 2008 05:31:30 +0200
Nick Piggin <npiggin@suse.de> wrote:

> It should definitely be enabled for 32-bit machines, and enabled by default.
> The argument is that 32 bit machines won't have much memory so it won't
> be a problem, but a) it also has to work well on other machines without
> much memory, and b) it is a nightmare to have significant behaviour changes
> like this. For kernel development as well as kernel running.
> 
> If we eventually run out of page flags on 32 bit, then sure this might be
> one we could look at geting rid of. Once the code has proven itself.

Alternatively, we tell the 32 bit people not to compile their kernel
with support for 64 NUMA nodes :)

The number of page flags on 32 bits is (32 - ZONE_SHIFT - NODE_SHIFT)
after Christoph's cleanup and no longer a fixed number.

Does anyone compile a 32 bit kernel with a large (ZONE_SHIFT + NODE_SHIFT)?

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-08 23:54               ` Andrew Morton
  2008-06-09  0:56                 ` Rik van Riel
  2008-06-09  2:58                 ` Rik van Riel
@ 2008-06-10 19:17                 ` Christoph Lameter
  2008-06-10 19:37                   ` Rik van Riel
  2 siblings, 1 reply; 102+ messages in thread
From: Christoph Lameter @ 2008-06-10 19:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, linux-kernel, lee.schermerhorn, kosaki.motohiro,
	linux-mm, eric.whitney

On Sun, 8 Jun 2008, Andrew Morton wrote:

> And it will take longer to get those problems sorted out if 32-bt
> machines aren't even compiing the new code in.

The problem is going to be less if we dependedn on 
CONFIG_PAGEFLAGS_EXTENDED instead of 64 bit. This means that only certain 
32bit NUMA/sparsemem configs cannot do this due to lack of page flags.

I did the pageflags rework in part because of Rik's project.

> ho hum.  Can you remind us what problems this patchset actually
> addresses?  Preferably in order of seriousness?  (The [0/n] description
> told us about the implementation but forgot to tell us anything about
> what it was fixing).  Because I guess we should have a think about
> alternative approaches.

It solves the livelock while reclaiming issues that we see more and more.

There are loads that have lots of unreclaimable pages. These are 
frequently and uselessly scanned under memory pressure.

The larger the memory the more problems.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-10 19:17                 ` Christoph Lameter
@ 2008-06-10 19:37                   ` Rik van Riel
  2008-06-10 21:33                     ` Andrew Morton
  0 siblings, 1 reply; 102+ messages in thread
From: Rik van Riel @ 2008-06-10 19:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-kernel, lee.schermerhorn, kosaki.motohiro,
	linux-mm, eric.whitney

On Tue, 10 Jun 2008 12:17:23 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Sun, 8 Jun 2008, Andrew Morton wrote:
> 
> > And it will take longer to get those problems sorted out if 32-bt
> > machines aren't even compiing the new code in.
> 
> The problem is going to be less if we dependedn on 
> CONFIG_PAGEFLAGS_EXTENDED instead of 64 bit. This means that only certain 
> 32bit NUMA/sparsemem configs cannot do this due to lack of page flags.
> 
> I did the pageflags rework in part because of Rik's project.

I think your pageflags work freed up a number of bits on 32
bit systems, unless someone compiles a 32 bit system with
support for 4 memory zones (2 bits ZONE_SHIFT) and 64 NUMA
nodes (6 bits NODE_SHIFT), in which case we should still
have 24 bits for flags.

Of course, having 64 NUMA nodes and a ZONE_SHIFT of 2 on
a 32 bit system is probably total insanity already.  I
suspect very few people compile 32 bit with NUMA at all,
except if it is an architecture that uses DISCONTIGMEM
instead of zones, in which case ZONE_SHIFT is 0, which
will free up space too :)

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-07  1:05   ` Andrew Morton
  2008-06-08 20:34     ` Rik van Riel
@ 2008-06-10 20:09     ` Rik van Riel
  1 sibling, 0 replies; 102+ messages in thread
From: Rik van Riel @ 2008-06-10 20:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
	eric.whitney

On Fri, 6 Jun 2008 18:05:06 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> > +config NORECLAIM_LRU
> > +	bool "Add LRU list to track non-reclaimable pages (EXPERIMENTAL, 64BIT only)"
> > +	depends on EXPERIMENTAL && 64BIT
> > +	help
> > +	  Supports tracking of non-reclaimable pages off the [in]active lists
> > +	  to avoid excessive reclaim overhead on large memory systems.  Pages
> > +	  may be non-reclaimable because:  they are locked into memory, they
> > +	  are anonymous pages for which no swap space exists, or they are anon
> > +	  pages that are expensive to unmap [long anon_vma "related vma" list.]
> 
> Aunt Tillie might be struggling with some of that.

I have now Aunt Tillified the description:

+++ linux-2.6.26-rc5-mm2/mm/Kconfig     2008-06-10 14:56:19.000000000 -0400
@@ -205,3 +205,13 @@ config NR_QUICK
 config VIRT_TO_BUS
        def_bool y
        depends on !ARCH_NO_VIRT_TO_BUS
+
+config UNEVICTABLE_LRU
+       bool "Add LRU list to track non-evictable pages"
+       default y
+       help
+         Keeps unevictable pages off of the active and inactive pageout
+         lists, so kswapd will not waste CPU time or have its balancing
+         algorithms thrown off by scanning these pages.  Selecting this
+         will use one page flag and increase the code size a little,
+         say Y unless you know what you are doing.
 
> Can we think of a new term which uniquely describes this new concept
> and use that, rather than flogging the old horse?

I have also switched to "unevictable".

> > +/**
> > + * add_page_to_noreclaim_list
> > + * @page:  the page to be added to the noreclaim list
> > + *
> > + * Add page directly to its zone's noreclaim list.  To avoid races with
> > + * tasks that might be making the page reclaimble while it's not on the
> > + * lru, we want to add the page while it's locked or otherwise "invisible"
> > + * to other tasks.  This is difficult to do when using the pagevec cache,
> > + * so bypass that.
> > + */
> 
> How does a task "make a page reclaimable"?  munlock()?  fsync()? 
> exit()?
> 
> Choice of terminology matters...

I have added a linuxdoc function description here and
amended the comment to specify the ways in which a task
can make a page evictable.

> > +		VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
> 
> If this ever triggers, you'll wish that it had been coded with two
> separate assertions.

Good catch.  I separated these.
 
> > +/**
> > + * putback_lru_page
> > + * @page to be put back to appropriate lru list

> The kerneldoc function description is missing.

Added this one, as well as a few others that were missing.
 
> > +	} else if (page_reclaimable(page, NULL)) {
> > +		/*
> > +		 * For reclaimable pages, we can use the cache.
> > +		 * In event of a race, worst case is we end up with a
> > +		 * non-reclaimable page on [in]active list.
> > +		 * We know how to handle that.
> > +		 */
> > +		lru += page_file_cache(page);
> > +		lru_cache_add_lru(page, lru);
> > +		mem_cgroup_move_lists(page, lru);

> <stares for a while>
> 
> <penny drops>
> 
> So THAT'S what the magical "return 2" is doing in page_file_cache()!
> 
> <looks>
> 
> OK, after all the patches are applied, the "2" becomes LRU_FILE and the
> enumeration of `enum lru_list' reflects that.

In most places I have turned this into a call to page_lru(page).
 
> > +static inline void cull_nonreclaimable_page(struct page *page)

> Did you check whether all these inlined functions really should have
> been inlined?  Even ones like this are probably too large.

Turned this into just a "static void" and renamed it
to cull_unevictable_page.
 
> > +	/*
> > +	 * Non-reclaimable pages shouldn't make it onto either the active
> > +	 * nor the inactive list. However, when doing lumpy reclaim of
> > +	 * higher order pages we can still run into them.
> 
> I guess that something along the lines of "when this function is being
> called for lumpy reclaim we can still .." would be clearer.

+       /*
+        * When this function is being called for lumpy reclaim, we
+        * initially look into all LRU pages, active, inactive and
+        * unreclaimable; only give shrink_page_list evictable pages.
+        */
+       if (PageUnevictable(page))
+               return ret;

... on to the next patch!

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 16/25] SHM_LOCKED pages are non-reclaimable
  2008-06-07  1:05   ` Andrew Morton
  2008-06-07  5:21     ` KOSAKI Motohiro
@ 2008-06-10 21:03     ` Rik van Riel
  2008-06-10 21:22       ` Lee Schermerhorn
  1 sibling, 1 reply; 102+ messages in thread
From: Rik van Riel @ 2008-06-10 21:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro

On Fri, 6 Jun 2008 18:05:14 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> > While working with Nick Piggin's mlock patches,
> 
> Change log refers to information which its reader has not got a hope
> of actually locating.

Fixed that, and renamed everything to "unevictable".

> > Use the AS_NORECLAIM flag to mark address_space of SHM_LOCKed
> > shared memory regions as non-reclaimable.  Then these pages
> > will be culled off the normal LRU lists during vmscan.
> 
> So I guess there's more justification for handling these pages in this
> manner, because someone could come along later and unlock them.  But
> that isn't true of /dev/ram0 pages and ramfs pages, etc.

Bingo.  Ramdisk and ramfs pages will never become evictable again,
while the pages in an SHM_LOCKED segment might.
 
> > +static void check_move_noreclaim_page(struct page *page, struct zone *zone)
> > +{
> > +
> > +	ClearPageNoreclaim(page); /* for page_reclaimable() */
> 
> Confused.  Didn't we just lose track of our NR_NORECLAIM accounting?
> 
> > +	if (page_reclaimable(page, NULL)) {
> > +		enum lru_list l = LRU_INACTIVE_ANON + page_file_cache(page);
> > +		__dec_zone_state(zone, NR_NORECLAIM);

No, we decrement the zone count here if the page is indeed
unevictable.

> > +		list_move(&page->lru, &zone->list[l]);
> > +		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
> > +	} else {
> > +		/*
> > +		 * rotate noreclaim list
> > +		 */
> > +		SetPageNoreclaim(page);
> > +		list_move(&page->lru, &zone->list[LRU_NORECLAIM]);
> > +	}
> > +}

Or mark it unevictable again if it still is.

> > + * scan_mapping_noreclaim_pages - scan an address space for reclaimable pages
> > + * @mapping: struct address_space to scan for reclaimable pages
> > + *
> > + * Scan all pages in mapping.  Check non-reclaimable pages for
> > + * reclaimability and move them to the appropriate zone lru list.
> > + */
> > +void scan_mapping_noreclaim_pages(struct address_space *mapping)
> > +{

> This function can spend fantastically large amounts of time under
> spin_lock_irq().

I'll leave it up to Lee and Kosaki-san to fix this, once
you have the cleaned up versions.

Fixing this now would just delay my other janitorial work on
this patch series...

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
  2008-06-10  3:31     ` Nick Piggin
  2008-06-10 12:50       ` Rik van Riel
@ 2008-06-10 21:14       ` Rik van Riel
  2008-06-10 21:43         ` Lee Schermerhorn
  1 sibling, 1 reply; 102+ messages in thread
From: Rik van Riel @ 2008-06-10 21:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, linux-kernel, lee.schermerhorn, kosaki.motohiro,
	linux-mm, eric.whitney

On Tue, 10 Jun 2008 05:31:30 +0200
Nick Piggin <npiggin@suse.de> wrote:

> If we eventually run out of page flags on 32 bit, then sure this might be
> one we could look at geting rid of. Once the code has proven itself.

Yes, after the code has proven stable, we can probably get
rid of the PG_mlocked bit and use only PG_unevictable to mark
these pages.

Lee, Kosaki-san, do you see any problem with that approach?
Is the PG_mlocked bit really necessary for non-debugging
purposes?

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 16/25] SHM_LOCKED pages are non-reclaimable
  2008-06-10 21:03     ` Rik van Riel
@ 2008-06-10 21:22       ` Lee Schermerhorn
  2008-06-10 21:49         ` Andrew Morton
  0 siblings, 1 reply; 102+ messages in thread
From: Lee Schermerhorn @ 2008-06-10 21:22 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, kosaki.motohiro

On Tue, 2008-06-10 at 17:03 -0400, Rik van Riel wrote:
> On Fri, 6 Jun 2008 18:05:14 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > > While working with Nick Piggin's mlock patches,
> > 
> > Change log refers to information which its reader has not got a hope
> > of actually locating.
> 
> Fixed that, and renamed everything to "unevictable".

So, we're making a global change of "[no[n]]reclaim[able]" =>
"[un]evictable"?


Shall I take a cut at renaming and updating the document once the code
renames are complete?

> 
> > > Use the AS_NORECLAIM flag to mark address_space of SHM_LOCKed
> > > shared memory regions as non-reclaimable.  Then these pages
> > > will be culled off the normal LRU lists during vmscan.
> > 
> > So I guess there's more justification for handling these pages in this
> > manner, because someone could come along later and unlock them.  But
> > that isn't true of /dev/ram0 pages and ramfs pages, etc.
> 
> Bingo.  Ramdisk and ramfs pages will never become evictable again,
> while the pages in an SHM_LOCKED segment might.
>  
> > > +static void check_move_noreclaim_page(struct page *page, struct zone *zone)
> > > +{
> > > +
> > > +	ClearPageNoreclaim(page); /* for page_reclaimable() */
> > 
> > Confused.  Didn't we just lose track of our NR_NORECLAIM accounting?
> > 
> > > +	if (page_reclaimable(page, NULL)) {
> > > +		enum lru_list l = LRU_INACTIVE_ANON + page_file_cache(page);
> > > +		__dec_zone_state(zone, NR_NORECLAIM);
> 
> No, we decrement the zone count here if the page is indeed
> unevictable.
> 
> > > +		list_move(&page->lru, &zone->list[l]);
> > > +		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
> > > +	} else {
> > > +		/*
> > > +		 * rotate noreclaim list
> > > +		 */
> > > +		SetPageNoreclaim(page);
> > > +		list_move(&page->lru, &zone->list[LRU_NORECLAIM]);
> > > +	}
> > > +}
> 
> Or mark it unevictable again if it still is.
> 
> > > + * scan_mapping_noreclaim_pages - scan an address space for reclaimable pages
> > > + * @mapping: struct address_space to scan for reclaimable pages
> > > + *
> > > + * Scan all pages in mapping.  Check non-reclaimable pages for
> > > + * reclaimability and move them to the appropriate zone lru list.
> > > + */
> > > +void scan_mapping_noreclaim_pages(struct address_space *mapping)
> > > +{
> 
> > This function can spend fantastically large amounts of time under
> > spin_lock_irq().

Yes, if we get a run of pages from the same zone [likely, I think],
we'll hold the lock over a full "batch" of  PAGEVEC_SIZE [14] pages.  I
haven't measured the hold time, but can do.

> 
> I'll leave it up to Lee and Kosaki-san to fix this, once
> you have the cleaned up versions.

I could use some advice on the batch size.  E.g., I could cycle the lock
for each page in the mapping, or choose a batch size somewhat less than
PAGEVEC_SIZE, but > 1.  Thoughts?  Is there a target "max hold time"?

Lee


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-10 19:37                   ` Rik van Riel
@ 2008-06-10 21:33                     ` Andrew Morton
  2008-06-10 21:48                       ` Andi Kleen
                                         ` (3 more replies)
  0 siblings, 4 replies; 102+ messages in thread
From: Andrew Morton @ 2008-06-10 21:33 UTC (permalink / raw)
  To: Rik van Riel
  Cc: clameter, linux-kernel, lee.schermerhorn, kosaki.motohiro,
	linux-mm, eric.whitney, Paul Mundt, Andi Kleen, Ingo Molnar,
	Andy Whitcroft

On Tue, 10 Jun 2008 15:37:02 -0400
Rik van Riel <riel@redhat.com> wrote:

> On Tue, 10 Jun 2008 12:17:23 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
> 
> > On Sun, 8 Jun 2008, Andrew Morton wrote:
> > 
> > > And it will take longer to get those problems sorted out if 32-bt
> > > machines aren't even compiing the new code in.
> > 
> > The problem is going to be less if we dependedn on 
> > CONFIG_PAGEFLAGS_EXTENDED instead of 64 bit. This means that only certain 
> > 32bit NUMA/sparsemem configs cannot do this due to lack of page flags.
> > 
> > I did the pageflags rework in part because of Rik's project.
> 
> I think your pageflags work freed up a number of bits on 32
> bit systems, unless someone compiles a 32 bit system with
> support for 4 memory zones (2 bits ZONE_SHIFT) and 64 NUMA
> nodes (6 bits NODE_SHIFT), in which case we should still
> have 24 bits for flags.
> 
> Of course, having 64 NUMA nodes and a ZONE_SHIFT of 2 on
> a 32 bit system is probably total insanity already.  I
> suspect very few people compile 32 bit with NUMA at all,
> except if it is an architecture that uses DISCONTIGMEM
> instead of zones, in which case ZONE_SHIFT is 0, which
> will free up space too :)

Maybe it's time to bite the bullet and kill i386 NUMA support.  afaik
it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)

arch/sh uses NUMA for 32-bit, I believe. But I don't know what its
maximum node count is.  The default for sh NODES_SHIFT is 3.  

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
  2008-06-10 21:14       ` Rik van Riel
@ 2008-06-10 21:43         ` Lee Schermerhorn
  2008-06-10 21:57           ` Andrew Morton
  2008-06-10 23:48           ` Rik van Riel
  0 siblings, 2 replies; 102+ messages in thread
From: Lee Schermerhorn @ 2008-06-10 21:43 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Nick Piggin, Andrew Morton, linux-kernel, kosaki.motohiro,
	linux-mm, eric.whitney

On Tue, 2008-06-10 at 17:14 -0400, Rik van Riel wrote:
> On Tue, 10 Jun 2008 05:31:30 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > If we eventually run out of page flags on 32 bit, then sure this might be
> > one we could look at geting rid of. Once the code has proven itself.
> 
> Yes, after the code has proven stable, we can probably get
> rid of the PG_mlocked bit and use only PG_unevictable to mark
> these pages.
> 
> Lee, Kosaki-san, do you see any problem with that approach?
> Is the PG_mlocked bit really necessary for non-debugging
> purposes?
> 

Well, it does speed up the check for mlocked pages in page_reclaimable()
[now page_evictable()?] as we don't have to walk the reverse map to
determine that a page is mlocked.   In many places where we currently
test page_reclaimable(), we really don't want to and maybe can't walk
the reverse map.

Unless you're evisioning even larger rework, the PG_unevictable flag
[formerly PG_noreclaim, right?] is analogous to PG_active.  It's only
set when the page is on the corresponding lru list or being held
isolated from it, temporarily.  See isolate_lru_page() and
putback_lru_page() and users thereof--such as mlock_vma_page().  Again,
I have seen what changes you're making here, so maybe that's all
changing.  But, currently, PG_unevictable would not be a replacement for
PG_mlocked.

Anyway, let's see what you come up with before we tackle this.

Couple of related items:

+ 26-rc5-mm1 + a small fix to the double unlock_page() in
shrink_page_list() has been running for a couple of hours on my 32G,
16cpu ia64 numa platform w/o error.  Seems to have survived the merge
into -mm, despite the issues Andrew has raised.

+ on same platform, Mel Gorman's mminit debug code is reporting that
we're using 22 page flags with Noreclaim, Mlock and PAGEFLAGS_EXTENDED
configured.

Lee

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-10 21:33                     ` Andrew Morton
@ 2008-06-10 21:48                       ` Andi Kleen
  2008-06-10 22:05                       ` Dave Hansen
                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 102+ messages in thread
From: Andi Kleen @ 2008-06-10 21:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, clameter, linux-kernel, lee.schermerhorn,
	kosaki.motohiro, linux-mm, eric.whitney, Paul Mundt, Ingo Molnar,
	Andy Whitcroft

> 
> Maybe it's time to bite the bullet and kill i386 NUMA support.  afaik
> it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)

Actually much more (most 64bit NUMA systems can run 32bit too), it just
doesn't work well because the code is not very good, undertested, many
bugs, weird design and in general 32bit NUMA has a lot of limitations
that don't make it a good idea.

But you don't need to kill it only for this (although imho there are
lots of other good reasons) Just use a different way to look up the
node. Encoding it into the flags is just an optimization.
But a separate hash or similar would also work. It seemed like a good
idea back then.

In fact there's already a hash for this (the pa->node hash) that
can do it. It' just some more instructions and one cache line
more accessed, but since i386 NUMA is a fringe application
that doesn't seem like a big issue.

-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 16/25] SHM_LOCKED pages are non-reclaimable
  2008-06-10 21:22       ` Lee Schermerhorn
@ 2008-06-10 21:49         ` Andrew Morton
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Morton @ 2008-06-10 21:49 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: riel, linux-kernel, kosaki.motohiro

On Tue, 10 Jun 2008 17:22:26 -0400
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> > Or mark it unevictable again if it still is.
> > 
> > > > + * scan_mapping_noreclaim_pages - scan an address space for reclaimable pages
> > > > + * @mapping: struct address_space to scan for reclaimable pages
> > > > + *
> > > > + * Scan all pages in mapping.  Check non-reclaimable pages for
> > > > + * reclaimability and move them to the appropriate zone lru list.
> > > > + */
> > > > +void scan_mapping_noreclaim_pages(struct address_space *mapping)
> > > > +{
> > 
> > > This function can spend fantastically large amounts of time under
> > > spin_lock_irq().
> 
> Yes, if we get a run of pages from the same zone [likely, I think],
> we'll hold the lock over a full "batch" of  PAGEVEC_SIZE [14] pages.  I
> haven't measured the hold time, but can do.

oh.  I misread the code.  Holding spin_lock_irq() across a single
pagevec should be OK.  At least, we do that in lots of other places.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
  2008-06-10 21:43         ` Lee Schermerhorn
@ 2008-06-10 21:57           ` Andrew Morton
  2008-06-11 16:01             ` Lee Schermerhorn
  2008-06-10 23:48           ` Rik van Riel
  1 sibling, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2008-06-10 21:57 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: riel, npiggin, linux-kernel, kosaki.motohiro, linux-mm,
	eric.whitney

On Tue, 10 Jun 2008 17:43:17 -0400
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> Couple of related items:
> 
> + 26-rc5-mm1 + a small fix to the double unlock_page() in
> shrink_page_list() has been running for a couple of hours on my 32G,
> 16cpu ia64 numa platform w/o error.  Seems to have survived the merge
> into -mm, despite the issues Andrew has raised.

oh goody, thanks.  Johannes's bootmem rewrite is holding up
surprisingly well.

gee test.kernel.org takes a long time.

> + on same platform, Mel Gorman's mminit debug code is reporting that
> we're using 22 page flags with Noreclaim, Mlock and PAGEFLAGS_EXTENDED
> configured.

what is "Mel Gorman's mminit debug code"?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-10 21:33                     ` Andrew Morton
  2008-06-10 21:48                       ` Andi Kleen
@ 2008-06-10 22:05                       ` Dave Hansen
  2008-06-11  5:09                       ` Paul Mundt
  2008-06-11 19:03                       ` [PATCH -mm 13/25] Noreclaim LRU Infrastructure Andy Whitcroft
  3 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2008-06-10 22:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, clameter, linux-kernel, lee.schermerhorn,
	kosaki.motohiro, linux-mm, eric.whitney, Paul Mundt, Andi Kleen,
	Ingo Molnar, Andy Whitcroft

On Tue, 2008-06-10 at 14:33 -0700, Andrew Morton wrote:
> Maybe it's time to bite the bullet and kill i386 NUMA support.  afaik
> it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)

Yeah, IBM sold a couple of these "interesting" 32-bit NUMA machines:

https://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/tips0267.html?Open

I think those maxed out at 8 nodes, ever.  But, no distro ever turned
NUMA on for i386, so no one actually depends on it working.  We do have
a bunch of systems that we use for testing and so forth.  It'd be a
shame to make these suck *too* much.  The NUMA-Q is probably also so
intertwined with CONFIG_NUMA that we'd likely never get it running
again.

I'd rather just bloat page->flags on these platforms or move the
sparsemem/zone/node bits elsewhere than kill NUMA support.  

-- Dave

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
  2008-06-10 21:43         ` Lee Schermerhorn
  2008-06-10 21:57           ` Andrew Morton
@ 2008-06-10 23:48           ` Rik van Riel
  2008-06-11 15:29             ` Lee Schermerhorn
  1 sibling, 1 reply; 102+ messages in thread
From: Rik van Riel @ 2008-06-10 23:48 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Nick Piggin, Andrew Morton, linux-kernel, kosaki.motohiro,
	linux-mm, eric.whitney

On Tue, 10 Jun 2008 17:43:17 -0400
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> On Tue, 2008-06-10 at 17:14 -0400, Rik van Riel wrote:
> > On Tue, 10 Jun 2008 05:31:30 +0200
> > Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > If we eventually run out of page flags on 32 bit, then sure this might be
> > > one we could look at geting rid of. Once the code has proven itself.
> > 
> > Yes, after the code has proven stable, we can probably get
> > rid of the PG_mlocked bit and use only PG_unevictable to mark
> > these pages.
> > 
> > Lee, Kosaki-san, do you see any problem with that approach?
> > Is the PG_mlocked bit really necessary for non-debugging
> > purposes?
> 
> Well, it does speed up the check for mlocked pages in page_reclaimable()
> [now page_evictable()?] as we don't have to walk the reverse map to
> determine that a page is mlocked.   In many places where we currently
> test page_reclaimable(), we really don't want to and maybe can't walk
> the reverse map.

There are a few places:
1) the pageout code, which calls page_referenced() anyway; we can
   change page_referenced() to return PAGE_MLOCKED and do the right
   thing from there
2) when the page is moved from a per-cpu pagevec onto an LRU list,
   we may be able to simply skip the check there on the theory that
   the pagevecs are small and the pageout code will eventually catch
   these (few?) pages - actually, setting PG_noreclaim on a page
   that is in a pagevec but not on an LRU list might catch that

Does that seem reasonable/possible?

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
  2008-06-07  1:07   ` Andrew Morton
  2008-06-07  5:38     ` KOSAKI Motohiro
  2008-06-10  3:31     ` Nick Piggin
@ 2008-06-11  1:00     ` Rik van Riel
  2 siblings, 0 replies; 102+ messages in thread
From: Rik van Riel @ 2008-06-11  1:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, lee.schermerhorn, kosaki.motohiro, linux-mm,
	eric.whitney, npiggin

On Fri, 6 Jun 2008 18:07:46 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Fri, 06 Jun 2008 16:28:55 -0400
> Rik van Riel <riel@redhat.com> wrote:
> > Originally
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > Against:  2.6.26-rc2-mm1
> > 
> > This patch:
> > 
> > 1) defines the [CONFIG_]NORECLAIM_MLOCK sub-option and the
> >    stub version of the mlock/noreclaim APIs when it's
> >    not configured.  Depends on [CONFIG_]NORECLAIM_LRU.
> 
> Oh sob.

OK, I just removed CONFIG_NORECLAIM_MLOCK.

> > 2) add yet another page flag--PG_mlocked--to indicate that
> >    the page is locked for efficient testing in vmscan and,
> >    optionally, fault path.  This allows early culling of
> >    nonreclaimable pages, preventing them from getting to
> >    page_referenced()/try_to_unmap().  Also allows separate
> >    accounting of mlock'd pages, as Nick's original patch
> >    did.
> > 
> >    Note:  Nick's original mlock patch used a PG_mlocked
> >    flag.  I had removed this in favor of the PG_noreclaim
> >    flag + an mlock_count [new page struct member].  I
> >    restored the PG_mlocked flag to eliminate the new
> >    count field.  
> 
> How many page flags are left? 

Depends on what CONFIG_ZONE_SHIFT and CONFIG_NODE_SHIFT
are set to.

I suspect we'll be able to get rid of the PG_mlocked page
flag in the future, since mlock is just one reason for
the page being PG_noreclaim.

> > +/*
> > + * mlock all pages in this vma range.  For mmap()/mremap()/...
> > + */
> > +extern int mlock_vma_pages_range(struct vm_area_struct *vma,
> > +			unsigned long start, unsigned long end);
> > +
> > +/*
> > + * munlock all pages in vma.   For munmap() and exit().
> > + */
> > +extern void munlock_vma_pages_all(struct vm_area_struct *vma);
> 
> I don't think it's desirable that interfaces be documented in two
> places.  The documentation which you have at the definition site is
> more complete than this, and is at the place where people will expect
> to find it.

I removed these comments.
 
> > +	if (!isolate_lru_page(page)) {
> > +		putback_lru_page(page);
> > +	} else {
> > +		/*
> > +		 * Try hard not to leak this page ...
> > +		 */
> > +		lru_add_drain_all();
> > +		if (!isolate_lru_page(page))
> > +			putback_lru_page(page);
> > +	}
> > +}
> 
> When I review code I often come across stuff which I don't understand
> (at least, which I don't understand sufficiently easily).  So I'll ask
> questions, and I do think the best way in which those questions should
> be answered is by adding a code comment to fix the problem for ever.

        if (!isolate_lru_page(page)) {
                putback_lru_page(page);
        } else {
                /*
                 * Page not on the LRU yet.  Flush all pagevecs and retry.
                 */
                lru_add_drain_all();
                if (!isolate_lru_page(page))
                        putback_lru_page(page);
        }

> If I _am_ right, and if the isolate_lru_page() _did_ fail (and under
> what circumstances?) then...  what?  We now have a page which is on an
> inappropriate LRU?  Why is this OK?  Do we handle it elsewhere?  How?

It is OK because we will run into the page later on in the pageout
code, detect that the page is unevictable and move it to the
unevictable LRU.

> > +/*
> > + * called from munlock()/munmap() path with page supposedly on the LRU.
> > + *
> > + * Note:  unlike mlock_vma_page(), we can't just clear the PageMlocked
> > + * [in try_to_unlock()] and then attempt to isolate the page.  We must
> > + * isolate the page() to keep others from messing with its noreclaim
> 
> page()?

Fixed.

> > + * and mlocked state while trying to unlock.  However, we pre-clear the
> 
> "unlock"?  (See exhasperated comment against try_to_unlock(), below)

Renamed that one to try_to_munlock() and adjusted all the callers and
comments.

> > +static int __mlock_vma_pages_range(struct vm_area_struct *vma,
> > +			unsigned long start, unsigned long end)
> > +{

> > +		ret = get_user_pages(current, mm, addr,
> > +				min_t(int, nr_pages, ARRAY_SIZE(pages)),
> > +				write, 0, pages, NULL);
> 
> Doesn't mlock already do a make_pages_present(), or did that get
> removed and moved to here?

make_pages_present does not work right for PROT_NONE and does
not add pages to the unevictable LRU.  Now that we have a
separate function for unlocking, we may be able to just add
a few lines to make_pages_present and use that again.

Also, make_pages_present works on some other types of VMAs
that this code does not work on.  I do not know whether
merging this with make_pages_present would make things
cleaner or uglier.

Lee?  Kosaki-san?  Either of you interested in investigating 
this after Andrew has the patches merged with the fast cleanups
that I'm doing now?
 
> > +	if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
> > +			is_vm_hugetlb_page(vma) ||
> > +			vma == get_gate_vma(current))
> > +		goto make_present;
> > +
> > +	return __mlock_vma_pages_range(vma, start, end);
> 
> Invert the `if' expression, remove the goto?

Done, thanks.
 
> > +/**
> > + * try_to_unlock - Check page's rmap for other vma's holding page locked.
> > + * @page: the page to be unlocked.   will be returned with PG_mlocked
> > + * cleared if no vmas are VM_LOCKED.
> 
> I think kerneldoc will barf over the newline in @page's description.

Cleaned this up.
 
> > + * Return values are:
> > + *
> > + * SWAP_SUCCESS	- no vma's holding page locked.
> > + * SWAP_AGAIN	- page mapped in mlocked vma -- couldn't acquire mmap sem
> > + * SWAP_MLOCK	- page is now mlocked.
> > + */
> > +int try_to_unlock(struct page *page)
> > +{
> > +	VM_BUG_ON(!PageLocked(page) || PageLRU(page));
> > +
> > +	if (PageAnon(page))
> > +		return try_to_unmap_anon(page, 1, 0);
> > +	else
> > +		return try_to_unmap_file(page, 1, 0);
> > +}
> > +#endif
> 
> OK, this function is clear as mud.  My first reaction was "what's wrong
> with just doing unlock_page()?".  The term "unlock" is waaaaaaaaaaay
> overloaded in this context and its use here was an awful decision.
> 
> Can we please come up with a more specific name and add some comments
> which give the reader some chance of working out what it is that is
> actually being unlocked?

try_to_munlock - I have fixed the documentation for this function too

> > ...
> >
> > @@ -652,7 +652,6 @@ again:			remove_next = 1 + (end > next->
> >   * If the vma has a ->close operation then the driver probably needs to release
> >   * per-vma resources, so we don't attempt to merge those.
> >   */
> > -#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
> >  
> >  static inline int is_mergeable_vma(struct vm_area_struct *vma,
> >  			struct file *file, unsigned long vm_flags)
> 
> hm, so the old definition of VM_SPECIAL managed to wedge itself between
> is_mergeable_vma() and is_mergeable_vma()'s comment.  Had me confused
> there.
> 
> pls remove the blank line between the comment and the start of
> is_mergeable_vma() so people don't go sticking more things in there.

Done.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-10 21:33                     ` Andrew Morton
  2008-06-10 21:48                       ` Andi Kleen
  2008-06-10 22:05                       ` Dave Hansen
@ 2008-06-11  5:09                       ` Paul Mundt
  2008-06-11  6:16                         ` Andrew Morton
  2008-06-11 19:03                       ` [PATCH -mm 13/25] Noreclaim LRU Infrastructure Andy Whitcroft
  3 siblings, 1 reply; 102+ messages in thread
From: Paul Mundt @ 2008-06-11  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, clameter, linux-kernel, lee.schermerhorn,
	kosaki.motohiro, linux-mm, eric.whitney, Andi Kleen, Ingo Molnar,
	Andy Whitcroft

On Tue, Jun 10, 2008 at 02:33:34PM -0700, Andrew Morton wrote:
> On Tue, 10 Jun 2008 15:37:02 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
> > On Tue, 10 Jun 2008 12:17:23 -0700 (PDT)
> > Christoph Lameter <clameter@sgi.com> wrote:
> > 
> > > On Sun, 8 Jun 2008, Andrew Morton wrote:
> > > 
> > > > And it will take longer to get those problems sorted out if 32-bt
> > > > machines aren't even compiing the new code in.
> > > 
> > > The problem is going to be less if we dependedn on 
> > > CONFIG_PAGEFLAGS_EXTENDED instead of 64 bit. This means that only certain 
> > > 32bit NUMA/sparsemem configs cannot do this due to lack of page flags.
> > > 
> > > I did the pageflags rework in part because of Rik's project.
> > 
> > I think your pageflags work freed up a number of bits on 32
> > bit systems, unless someone compiles a 32 bit system with
> > support for 4 memory zones (2 bits ZONE_SHIFT) and 64 NUMA
> > nodes (6 bits NODE_SHIFT), in which case we should still
> > have 24 bits for flags.
> > 
> > Of course, having 64 NUMA nodes and a ZONE_SHIFT of 2 on
> > a 32 bit system is probably total insanity already.  I
> > suspect very few people compile 32 bit with NUMA at all,
> > except if it is an architecture that uses DISCONTIGMEM
> > instead of zones, in which case ZONE_SHIFT is 0, which
> > will free up space too :)
> 
> Maybe it's time to bite the bullet and kill i386 NUMA support.  afaik
> it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)
> 
> arch/sh uses NUMA for 32-bit, I believe. But I don't know what its
> maximum node count is.  The default for sh NODES_SHIFT is 3.  

In terms of memory nodes, systems vary from 2 up to 16 or so. It gets
gradually more complex in the SMP cases where we are 3-4 levels deep in
various types of memories that we expose as nodes (ie, 4-8 CPUs with a
dozen different memories or so at various interconnect levels).

As far as testing goes, it's part of the regular build and regression
testing for a number of boards, which we verify on a daily basis
(although admittedly -mm gets far less testing, even though that's where
most of the churn in this area tends to be).

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-11  5:09                       ` Paul Mundt
@ 2008-06-11  6:16                         ` Andrew Morton
  2008-06-11  6:29                           ` Paul Mundt
                                             ` (2 more replies)
  0 siblings, 3 replies; 102+ messages in thread
From: Andrew Morton @ 2008-06-11  6:16 UTC (permalink / raw)
  To: Paul Mundt
  Cc: Rik van Riel, clameter, linux-kernel, lee.schermerhorn,
	kosaki.motohiro, linux-mm, eric.whitney, Andi Kleen, Ingo Molnar,
	Andy Whitcroft

On Wed, 11 Jun 2008 14:09:15 +0900 Paul Mundt <lethal@linux-sh.org> wrote:

> On Tue, Jun 10, 2008 at 02:33:34PM -0700, Andrew Morton wrote:
> > On Tue, 10 Jun 2008 15:37:02 -0400
> > Rik van Riel <riel@redhat.com> wrote:
> > 
> > > On Tue, 10 Jun 2008 12:17:23 -0700 (PDT)
> > > Christoph Lameter <clameter@sgi.com> wrote:
> > > 
> > > > On Sun, 8 Jun 2008, Andrew Morton wrote:
> > > > 
> > > > > And it will take longer to get those problems sorted out if 32-bt
> > > > > machines aren't even compiing the new code in.
> > > > 
> > > > The problem is going to be less if we dependedn on 
> > > > CONFIG_PAGEFLAGS_EXTENDED instead of 64 bit. This means that only certain 
> > > > 32bit NUMA/sparsemem configs cannot do this due to lack of page flags.
> > > > 
> > > > I did the pageflags rework in part because of Rik's project.
> > > 
> > > I think your pageflags work freed up a number of bits on 32
> > > bit systems, unless someone compiles a 32 bit system with
> > > support for 4 memory zones (2 bits ZONE_SHIFT) and 64 NUMA
> > > nodes (6 bits NODE_SHIFT), in which case we should still
> > > have 24 bits for flags.
> > > 
> > > Of course, having 64 NUMA nodes and a ZONE_SHIFT of 2 on
> > > a 32 bit system is probably total insanity already.  I
> > > suspect very few people compile 32 bit with NUMA at all,
> > > except if it is an architecture that uses DISCONTIGMEM
> > > instead of zones, in which case ZONE_SHIFT is 0, which
> > > will free up space too :)
> > 
> > Maybe it's time to bite the bullet and kill i386 NUMA support.  afaik
> > it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)
> > 
> > arch/sh uses NUMA for 32-bit, I believe. But I don't know what its
> > maximum node count is.  The default for sh NODES_SHIFT is 3.  
> 
> In terms of memory nodes, systems vary from 2 up to 16 or so. It gets
> gradually more complex in the SMP cases where we are 3-4 levels deep in
> various types of memories that we expose as nodes (ie, 4-8 CPUs with a
> dozen different memories or so at various interconnect levels).

Thanks.

Andi has suggested that we can remove the node-ID encoding from
page.flags on x86 because that info is available elsewhere, although a
bit more slowly.

<looks at page_zone(), wonders whether we care about performance anyway>

There wouldn't be much point in doing that unless we did it for all
32-bit architectures.  How much trouble would it cause sh?

> As far as testing goes, it's part of the regular build and regression
> testing for a number of boards, which we verify on a daily basis
> (although admittedly -mm gets far less testing, even though that's where
> most of the churn in this area tends to be).

Oh well, that's what -rc is for :(

It would be good if someone over there could start testing linux-next. 
Once I get my act together that will include most-of-mm anyway.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-11  6:16                         ` Andrew Morton
@ 2008-06-11  6:29                           ` Paul Mundt
  2008-06-11 12:06                           ` Andi Kleen
  2008-06-11 14:09                           ` Removing node flags from page->flags was Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure II Andi Kleen
  2 siblings, 0 replies; 102+ messages in thread
From: Paul Mundt @ 2008-06-11  6:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, clameter, linux-kernel, lee.schermerhorn,
	kosaki.motohiro, linux-mm, eric.whitney, Andi Kleen, Ingo Molnar,
	Andy Whitcroft

On Tue, Jun 10, 2008 at 11:16:42PM -0700, Andrew Morton wrote:
> On Wed, 11 Jun 2008 14:09:15 +0900 Paul Mundt <lethal@linux-sh.org> wrote:
> > On Tue, Jun 10, 2008 at 02:33:34PM -0700, Andrew Morton wrote:
> > > Maybe it's time to bite the bullet and kill i386 NUMA support.  afaik
> > > it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)
> > > 
> > > arch/sh uses NUMA for 32-bit, I believe. But I don't know what its
> > > maximum node count is.  The default for sh NODES_SHIFT is 3.  
> > 
> > In terms of memory nodes, systems vary from 2 up to 16 or so. It gets
> > gradually more complex in the SMP cases where we are 3-4 levels deep in
> > various types of memories that we expose as nodes (ie, 4-8 CPUs with a
> > dozen different memories or so at various interconnect levels).
> 
> Thanks.
> 
> Andi has suggested that we can remove the node-ID encoding from
> page.flags on x86 because that info is available elsewhere, although a
> bit more slowly.
> 
> <looks at page_zone(), wonders whether we care about performance anyway>
> 
> There wouldn't be much point in doing that unless we did it for all
> 32-bit architectures.  How much trouble would it cause sh?
> 
At first glance I don't think that should be too bad. We only do NUMA
through sparsemem anyways, and we have pretty much no overlap in any of
the ranges, so simply setting NODE_NOT_IN_PAGE_FLAGS should be ok there.
Given the relatively small number of pages we have, the added cost of
page_to_nid() referencing section_to_node_table should still be
tolerable. I'll give it a go and see what the numbers look like.

> > As far as testing goes, it's part of the regular build and regression
> > testing for a number of boards, which we verify on a daily basis
> > (although admittedly -mm gets far less testing, even though that's where
> > most of the churn in this area tends to be).
> 
> Oh well, that's what -rc is for :(
> 
> It would be good if someone over there could start testing linux-next. 
> Once I get my act together that will include most-of-mm anyway.
> 
Agreed. This is something we're attempting to add in to our automated
testing at present.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-11  6:16                         ` Andrew Morton
  2008-06-11  6:29                           ` Paul Mundt
@ 2008-06-11 12:06                           ` Andi Kleen
  2008-06-11 14:09                           ` Removing node flags from page->flags was Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure II Andi Kleen
  2 siblings, 0 replies; 102+ messages in thread
From: Andi Kleen @ 2008-06-11 12:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul Mundt, Rik van Riel, clameter, linux-kernel,
	lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney,
	Ingo Molnar, Andy Whitcroft


> Andi has suggested that we can remove the node-ID encoding from
> page.flags on x86 because that info is available elsewhere, although a
> bit more slowly.
> 
> <looks at page_zone(), wonders whether we care about performance anyway>

It would be just pfn_to_nid(page_pfn(page)) for 32bit && CONFIG_NUMA.
-sh should have that too.

Only trouble is that it needs some reordering because right now page_pfn
is not defined early enough.

> There wouldn't be much point in doing that unless we did it for all
> 32-bit architectures.  How much trouble would it cause sh?

Probably very little from a quick look at the source.

-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Removing node flags from page->flags was Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure II
  2008-06-11  6:16                         ` Andrew Morton
  2008-06-11  6:29                           ` Paul Mundt
  2008-06-11 12:06                           ` Andi Kleen
@ 2008-06-11 14:09                           ` Andi Kleen
  2 siblings, 0 replies; 102+ messages in thread
From: Andi Kleen @ 2008-06-11 14:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul Mundt, Rik van Riel, clameter, linux-kernel,
	lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney,
	Ingo Molnar, Andy Whitcroft


After some comptemplation I don't think we need to do anything for this.
Just add more page flags. The ifdef jungle in mm.h should handle it already.

#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
#define NODES_WIDTH             NODES_SHIFT
#else
#ifdef CONFIG_SPARSEMEM_VMEMMAP
#error "Vmemmap: No space for nodes field in page flags"
#endif
#define NODES_WIDTH             0
#endif


[btw the vmemmap case could be handled easily too by going through
the zone, but it's not used on 32bit]

and then

#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
#define NODE_NOT_IN_PAGE_FLAGS
#endif


and then

#ifdef NODE_NOT_IN_PAGE_FLAGS
extern int page_to_nid(struct page *page);
#else
static inline int page_to_nid(struct page *page)
{
        return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
}
#endif

and the sparse.c page_to_nid does a hash lookup.

So if NR_PAGEFLAGS is big enough it should work.

-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
  2008-06-10 23:48           ` Rik van Riel
@ 2008-06-11 15:29             ` Lee Schermerhorn
  0 siblings, 0 replies; 102+ messages in thread
From: Lee Schermerhorn @ 2008-06-11 15:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Nick Piggin, Andrew Morton, linux-kernel, kosaki.motohiro,
	linux-mm, eric.whitney

On Tue, 2008-06-10 at 19:48 -0400, Rik van Riel wrote:
> On Tue, 10 Jun 2008 17:43:17 -0400
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> 
> > On Tue, 2008-06-10 at 17:14 -0400, Rik van Riel wrote:
> > > On Tue, 10 Jun 2008 05:31:30 +0200
> > > Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > > If we eventually run out of page flags on 32 bit, then sure this might be
> > > > one we could look at geting rid of. Once the code has proven itself.
> > > 
> > > Yes, after the code has proven stable, we can probably get
> > > rid of the PG_mlocked bit and use only PG_unevictable to mark
> > > these pages.
> > > 
> > > Lee, Kosaki-san, do you see any problem with that approach?
> > > Is the PG_mlocked bit really necessary for non-debugging
> > > purposes?
> > 
> > Well, it does speed up the check for mlocked pages in page_reclaimable()
> > [now page_evictable()?] as we don't have to walk the reverse map to
> > determine that a page is mlocked.   In many places where we currently
> > test page_reclaimable(), we really don't want to and maybe can't walk
> > the reverse map.
> 
> There are a few places:
> 1) the pageout code, which calls page_referenced() anyway; we can
>    change page_referenced() to return PAGE_MLOCKED and do the right
>    thing from there

In vmscan, true.  try_to_unmap() will catch it too.  By then, we'll have
let the page ride through the active list to the inactive list and won't
catch it until shrink_page_list().  But, this only happens once per page
and then it's hidden on the nor^H^H^Hunevictable list.

We might want to kill the "cull in fault path" patch, tho'.

> 2) when the page is moved from a per-cpu pagevec onto an LRU list,
>    we may be able to simply skip the check there on the theory that
>    the pagevecs are small and the pageout code will eventually catch
>    these (few?) pages - actually, setting PG_noreclaim on a page
>    that is in a pagevec but not on an LRU list might catch that
> 
> Does that seem reasonable/possible?

Not sure.  The most recent patches that I posted do not use the pagevec
for the noreclaim/unevictable list.  They put nonreclaimable/unevictable
pages directly onto the noreclaim/unevictable list to avoid race
conditions that could strand a page.  Kosaki-san and I spent a lot of
time analyzing and testing the current code for potential page leaks
onto the noreclaim/unevictable list.  It currently depends on the atomic
TestSet/TestClear of the PG_mlocked bit, along with page lock and lru
isolation/putback to resolve all of the potential races.  I attempted to
describe this aspect in the doc.  Have to rethink all of that.

> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 17/25] Mlocked Pages are non-reclaimable
  2008-06-10 21:57           ` Andrew Morton
@ 2008-06-11 16:01             ` Lee Schermerhorn
  0 siblings, 0 replies; 102+ messages in thread
From: Lee Schermerhorn @ 2008-06-11 16:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: riel, npiggin, linux-kernel, kosaki.motohiro, linux-mm,
	eric.whitney

On Tue, 2008-06-10 at 14:57 -0700, Andrew Morton wrote:
> On Tue, 10 Jun 2008 17:43:17 -0400
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> 
> > Couple of related items:
> > 
> > + 26-rc5-mm1 + a small fix to the double unlock_page() in
> > shrink_page_list() has been running for a couple of hours on my 32G,
> > 16cpu ia64 numa platform w/o error.  Seems to have survived the merge
> > into -mm, despite the issues Andrew has raised.
> 
> oh goody, thanks.  

I should have mentioned that it's running a fairly heavy stress load to
exercise the vm scalability changes.  Lots of IO, page cache activity,
swapping, mlocking and shmlocking various sized regions, up to 16GB on
32GB machine, migrating of mlocked/shmlocked segments between
nodes,  ...   So far today, the load has been up for ~19.5 hours with no
errors, no softlockups, no oom-kills or such.  

> Johannes's bootmem rewrite is holding up
> surprisingly well.

Well, I am seeing a lot of "potential offnode page_structs" messages for
our funky cache-line interleaved pseudo-node.  I had to limit the prints
to boot at all.  Still investigating.  Looks like slub can't allocate
its initial per node data on that node either.

> 
> gee test.kernel.org takes a long time.
> 
> > + on same platform, Mel Gorman's mminit debug code is reporting that
> > we're using 22 page flags with Noreclaim, Mlock and PAGEFLAGS_EXTENDED
> > configured.
> 
> what is "Mel Gorman's mminit debug code"?

mminit_loglevel={0|1|2}  [I use 3 :)]  shows page flag layout, zone
lists, ...

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-10 21:33                     ` Andrew Morton
                                         ` (2 preceding siblings ...)
  2008-06-11  5:09                       ` Paul Mundt
@ 2008-06-11 19:03                       ` Andy Whitcroft
  2008-06-11 20:52                         ` Andi Kleen
  2008-06-11 23:25                         ` Christoph Lameter
  3 siblings, 2 replies; 102+ messages in thread
From: Andy Whitcroft @ 2008-06-11 19:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, clameter, linux-kernel, lee.schermerhorn,
	kosaki.motohiro, linux-mm, eric.whitney, Paul Mundt, Andi Kleen,
	Ingo Molnar

On Tue, Jun 10, 2008 at 02:33:34PM -0700, Andrew Morton wrote:
> On Tue, 10 Jun 2008 15:37:02 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
> > On Tue, 10 Jun 2008 12:17:23 -0700 (PDT)
> > Christoph Lameter <clameter@sgi.com> wrote:
> > 
> > > On Sun, 8 Jun 2008, Andrew Morton wrote:
> > > 
> > > > And it will take longer to get those problems sorted out if 32-bt
> > > > machines aren't even compiing the new code in.
> > > 
> > > The problem is going to be less if we dependedn on 
> > > CONFIG_PAGEFLAGS_EXTENDED instead of 64 bit. This means that only certain 
> > > 32bit NUMA/sparsemem configs cannot do this due to lack of page flags.
> > > 
> > > I did the pageflags rework in part because of Rik's project.
> > 
> > I think your pageflags work freed up a number of bits on 32
> > bit systems, unless someone compiles a 32 bit system with
> > support for 4 memory zones (2 bits ZONE_SHIFT) and 64 NUMA
> > nodes (6 bits NODE_SHIFT), in which case we should still
> > have 24 bits for flags.
> > 
> > Of course, having 64 NUMA nodes and a ZONE_SHIFT of 2 on
> > a 32 bit system is probably total insanity already.  I
> > suspect very few people compile 32 bit with NUMA at all,
> > except if it is an architecture that uses DISCONTIGMEM
> > instead of zones, in which case ZONE_SHIFT is 0, which
> > will free up space too :)
> 
> Maybe it's time to bite the bullet and kill i386 NUMA support.  afaik
> it's just NUMAQ and a 2-node NUMAish machine which IBM made (as400?)
> 
> arch/sh uses NUMA for 32-bit, I believe. But I don't know what its
> maximum node count is.  The default for sh NODES_SHIFT is 3.  

I think we can say that although NUMAQ can have up to 64 NUMA nodes, in
fact I don't think we have any more with more than 4 nodes left.  From
the other discussion it sounds like we have a maximum if 8 nodes on
other sub-arches.  So it would not be unreasonable to reduce the shift
to 3.  Which might allow us to reduce the size of the reserve.

The problem will come with SPARSEMEM as that stores the section number
in the reserved field.  Which can mean we need the whole reserve, and
there is currently no simple way to remove that.

I have been wondering whether we could make more use of the dynamic
nature of the page bits.  As bits only need to exist when used, whether
we could consider letting the page flags grow to 64 bits if necessary.
However, at a quick count we are still only using about 19 bits, and if
memory serves we have 23/24 after the reserve on 32 bit.

-apw

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-11 19:03                       ` [PATCH -mm 13/25] Noreclaim LRU Infrastructure Andy Whitcroft
@ 2008-06-11 20:52                         ` Andi Kleen
  2008-06-11 23:25                         ` Christoph Lameter
  1 sibling, 0 replies; 102+ messages in thread
From: Andi Kleen @ 2008-06-11 20:52 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Andrew Morton, Rik van Riel, clameter, linux-kernel,
	lee.schermerhorn, kosaki.motohiro, linux-mm, eric.whitney,
	Paul Mundt, Ingo Molnar


> The problem will come with SPARSEMEM as that stores the section number
> in the reserved field.  Which can mean we need the whole reserve, and
> there is currently no simple way to remove that.

Why do you need that many sections on i386?

-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure
  2008-06-11 19:03                       ` [PATCH -mm 13/25] Noreclaim LRU Infrastructure Andy Whitcroft
  2008-06-11 20:52                         ` Andi Kleen
@ 2008-06-11 23:25                         ` Christoph Lameter
  1 sibling, 0 replies; 102+ messages in thread
From: Christoph Lameter @ 2008-06-11 23:25 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Andrew Morton, Rik van Riel, linux-kernel, lee.schermerhorn,
	kosaki.motohiro, linux-mm, eric.whitney, Paul Mundt, Andi Kleen,
	Ingo Molnar

On Wed, 11 Jun 2008, Andy Whitcroft wrote:

> I think we can say that although NUMAQ can have up to 64 NUMA nodes, in
> fact I don't think we have any more with more than 4 nodes left.  From
> the other discussion it sounds like we have a maximum if 8 nodes on
> other sub-arches.  So it would not be unreasonable to reduce the shift
> to 3.  Which might allow us to reduce the size of the reserve.
> 
> The problem will come with SPARSEMEM as that stores the section number
> in the reserved field.  Which can mean we need the whole reserve, and
> there is currently no simple way to remove that.

But in that case we can use the section number to look up the node number. 
That is done automatically if we have too many page flags.

^ permalink raw reply	[flat|nested] 102+ messages in thread

end of thread, other threads:[~2008-06-11 23:25 UTC | newest]

Thread overview: 102+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 01/25] move isolate_lru_page() to vmscan.c Rik van Riel, Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 02/25] Use an indexed array for LRU variables Rik van Riel, Rik van Riel
2008-06-07  1:04   ` Andrew Morton
2008-06-07  5:43     ` KOSAKI Motohiro
2008-06-07 14:47       ` Rik van Riel
2008-06-08 11:22         ` KOSAKI Motohiro
2008-06-07 18:42     ` Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 03/25] use an array for the LRU pagevecs Rik van Riel, Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 04/25] free swap space on swap-in/activation Rik van Riel, Rik van Riel
2008-06-07  1:04   ` Andrew Morton
2008-06-07 19:56     ` Rik van Riel
2008-06-09  2:14     ` MinChan Kim
2008-06-09  2:42       ` Rik van Riel
2008-06-09 13:38       ` KOSAKI Motohiro
2008-06-10  2:30         ` MinChan Kim
2008-06-06 20:28 ` [PATCH -mm 05/25] define page_file_cache() function Rik van Riel, Rik van Riel
2008-06-07  1:04   ` Andrew Morton
2008-06-07 23:38     ` Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 06/25] split LRU lists into anon & file sets Rik van Riel, Rik van Riel
2008-06-07  1:04   ` Andrew Morton
2008-06-07  1:22     ` Rik van Riel
2008-06-07  1:52       ` Andrew Morton
2008-06-06 20:28 ` [PATCH -mm 07/25] second chance replacement for anonymous pages Rik van Riel, Rik van Riel
2008-06-07  1:04   ` Andrew Morton
2008-06-07  6:03     ` KOSAKI Motohiro
2008-06-07  6:43       ` Andrew Morton
2008-06-08 15:04     ` Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 08/25] add some sanity checks to get_scan_ratio Rik van Riel, Rik van Riel
2008-06-07  1:04   ` Andrew Morton
2008-06-08 15:11     ` Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 09/25] fix pagecache reclaim referenced bit check Rik van Riel, Rik van Riel
2008-06-07  1:04   ` Andrew Morton
2008-06-07  1:08     ` Rik van Riel
2008-06-08 10:02       ` Peter Zijlstra
2008-06-06 20:28 ` [PATCH -mm 10/25] add newly swapped in pages to the inactive list Rik van Riel, Rik van Riel
2008-06-07  1:04   ` Andrew Morton
2008-06-06 20:28 ` [PATCH -mm 11/25] more aggressively use lumpy reclaim Rik van Riel, Rik van Riel
2008-06-07  1:05   ` Andrew Morton
2008-06-06 20:28 ` [PATCH -mm 12/25] pageflag helpers for configed-out flags Rik van Riel, Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 13/25] Noreclaim LRU Infrastructure Rik van Riel, Rik van Riel
2008-06-07  1:05   ` Andrew Morton
2008-06-08 20:34     ` Rik van Riel
2008-06-08 20:57       ` Andrew Morton
2008-06-08 21:32         ` Rik van Riel
2008-06-08 21:43           ` Ray Lee
2008-06-08 23:22           ` Andrew Morton
2008-06-08 23:34             ` Rik van Riel
2008-06-08 23:54               ` Andrew Morton
2008-06-09  0:56                 ` Rik van Riel
2008-06-09  6:10                   ` Andrew Morton
2008-06-09 13:44                     ` Rik van Riel
2008-06-09  2:58                 ` Rik van Riel
2008-06-09  5:44                   ` Andrew Morton
2008-06-10 19:17                 ` Christoph Lameter
2008-06-10 19:37                   ` Rik van Riel
2008-06-10 21:33                     ` Andrew Morton
2008-06-10 21:48                       ` Andi Kleen
2008-06-10 22:05                       ` Dave Hansen
2008-06-11  5:09                       ` Paul Mundt
2008-06-11  6:16                         ` Andrew Morton
2008-06-11  6:29                           ` Paul Mundt
2008-06-11 12:06                           ` Andi Kleen
2008-06-11 14:09                           ` Removing node flags from page->flags was Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure II Andi Kleen
2008-06-11 19:03                       ` [PATCH -mm 13/25] Noreclaim LRU Infrastructure Andy Whitcroft
2008-06-11 20:52                         ` Andi Kleen
2008-06-11 23:25                         ` Christoph Lameter
2008-06-08 22:03         ` Rik van Riel
2008-06-08 21:07       ` KOSAKI Motohiro
2008-06-10 20:09     ` Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 14/25] Noreclaim LRU Page Statistics Rik van Riel, Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 15/25] Ramfs and Ram Disk pages are non-reclaimable Rik van Riel, Rik van Riel
2008-06-07  1:05   ` Andrew Morton
2008-06-08  4:32     ` Greg KH
2008-06-06 20:28 ` [PATCH -mm 16/25] SHM_LOCKED " Rik van Riel, Rik van Riel
2008-06-07  1:05   ` Andrew Morton
2008-06-07  5:21     ` KOSAKI Motohiro
2008-06-10 21:03     ` Rik van Riel
2008-06-10 21:22       ` Lee Schermerhorn
2008-06-10 21:49         ` Andrew Morton
2008-06-06 20:28 ` [PATCH -mm 17/25] Mlocked Pages " Rik van Riel, Rik van Riel
2008-06-07  1:07   ` Andrew Morton
2008-06-07  5:38     ` KOSAKI Motohiro
2008-06-10  3:31     ` Nick Piggin
2008-06-10 12:50       ` Rik van Riel
2008-06-10 21:14       ` Rik van Riel
2008-06-10 21:43         ` Lee Schermerhorn
2008-06-10 21:57           ` Andrew Morton
2008-06-11 16:01             ` Lee Schermerhorn
2008-06-10 23:48           ` Rik van Riel
2008-06-11 15:29             ` Lee Schermerhorn
2008-06-11  1:00     ` Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 18/25] Downgrade mmap sem while populating mlocked regions Rik van Riel, Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 19/25] Handle mlocked pages during map, remap, unmap Rik van Riel, Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 20/25] Mlocked Pages statistics Rik van Riel, Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 21/25] Cull non-reclaimable pages in fault path Rik van Riel, Rik van Riel
2008-06-06 20:29 ` [PATCH -mm 22/25] Noreclaim and Mlocked pages vm events Rik van Riel, Rik van Riel
2008-06-06 20:29 ` [PATCH -mm 23/25] Noreclaim LRU scan sysctl Rik van Riel, Rik van Riel
2008-06-06 20:29 ` [PATCH -mm 24/25] Mlocked Pages: count attempts to free mlocked page Rik van Riel, Rik van Riel
2008-06-06 20:29 ` [PATCH -mm 25/25] Noreclaim LRU and Mlocked Pages Documentation Rik van Riel, Rik van Riel
2008-06-06 21:02 ` [PATCH -mm 00/25] VM pageout scalability improvements (V10) Andrew Morton
2008-06-06 21:08   ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox