linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -mm 00/15] VM pageout scalability improvements (V6)
@ 2008-04-28 18:18 Rik van Riel
  2008-04-28 18:18 ` [PATCH -mm 01/15] FYI: vmstats are "off-by-one" Rik van Riel
                   ` (16 more replies)
  0 siblings, 17 replies; 34+ messages in thread
From: Rik van Riel @ 2008-04-28 18:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: lee.schermerhorn, akpm, kosaki.motohiro

On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not
only does it use up CPU time, but it also provokes lock contention
and can leave large systems under memory presure in a catatonic state.

Against 2.6.25-mm1

This patch series improves VM scalability by:

1) putting filesystem backed, swap backed and non-reclaimable pages
   onto their own LRUs, so the system only scans the pages that it
   can/should evict from memory

2) switching to SEQ replacement for the anonymous LRUs, so the
   number of pages that need to be scanned when the system
   starts swapping is bound to a reasonable number

3) keeping non-reclaimable pages off the LRU completely, so the
   VM does not waste CPU time scanning them.  Currently only
   ramfs and SHM_LOCKED pages are kept on the noreclaim list,
   mlock()ed VMAs will be added later

More info on the overall design can be found at:

	http://linux-mm.org/PageReplacementDesign

An all-in-one patch can be found at:

	http://people.redhat.com/riel/splitvm/

Changelog:
- several code cleanups (minchan Kim)
- noreclaim patch refactoring and improvements (Lee Schermerhorn)
- several PROT_NONE and vma merging fixes (KOSAKI Motohiro)
- SMP bugfixes and efficiency improvements (Rik van Riel, Lee Schermerhorn)
- fix NUMA node stats printing (Lee Schermerhorn)
- remove the mlocked-VMA-noreclaim code for now, it still has
  bugs on IA64 and is holding up the merge (Rik van Riel)

- make page_alloc.c compile without CONFIG_NORECLAIM_MLOCK (minchan Kim)
- BUG() does not take an argument (minchan Kim) 
- clean up is_active_lru and is_file_lru (Andy Whitcroft)
- clean up shrink_active_list temp list names (KOSAKI Motohiro)
- add total active & inactive memory totals for vmstat -a (KOSAKI Motohiro)
- only try global anon page aging on global lru scans (KOSAKI Motohiro)
- make function descriptions follow the kernel-doc format (Rik van Riel)
- simplify mlock_vma_pages_range and munlock_vma_pages_range (Lee Schermerhorn)
- remove some more arguments, rename to mlock_vma_pages_all (Lee Schermerhorn)
- many code cleanups (Lee Schermerhorn)
- pass correct vma arg to mlock_vma_pages_range from do_brk (Rik van Riel)
- port to 2.6.25-rc3-mm1

- pull the memcontrol lru arrayification earlier into the patch series
- use a pagevec array similar to the lru array
- clean up the code in various places
- improved pageout balancing and reduced pageout cpu use

- fix compilation on PPC and without memcontrol
- make page_is_pagecache more readable
- replace get_scan_ratio with correct version

- merge memcontroller split LRU code into the main split LRU patch,
  since it is not functionally different (it was split up only to help
  people who had seen the last version of the patch series review it)
- drop the page_file_cache debugging patch, since it never triggered
- reintroduce code to not scan anon list if swap is full
- add code to scan anon list if page cache is very small already
- use lumpy reclaim more aggressively for smaller order > 1 allocations

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH -mm 01/15] FYI: vmstats are "off-by-one"
  2008-04-28 18:18 [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
@ 2008-04-28 18:18 ` Rik van Riel
  2008-04-28 18:18 ` [PATCH -mm 02/15] move isolate_lru_page() to vmscan.c Rik van Riel
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2008-04-28 18:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: lee.schermerhorn, akpm, kosaki.motohiro

[-- Attachment #1: hotfix-vmstat-text.patch --]
[-- Type: text/plain, Size: 1382 bytes --]

PATCH - add vmstat_test for NR_WRITEBACK_TEMP

Against:  2.6.25-rc8-mm2 [also missing in '.25-mm1]

/proc/vmstat output after "nr_vmscan_write" are off-by-one
because of missing test for NR_WRITEBACK_TEMP added to 
mmzone.h.  Apparently missing from:

	mm-add-nr_writeback_temp-counter.patch

in -mm tree.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

---
Andrew,

you may already have this patch.  I am posting it in this
series to help other testers of the code.

Rik


Rik, Kosaki-san:

Here's a patch that I'm going to post to Andrew and the mailing lists.
I noticed that the noreclaim events were off by one.  Missing
vmstat_text in 25-rc8-mm* and 25-mm1.

I have this in my tree as a 25-*mm* "hotfix"--i.e., before any of our
patches.  It causes a couple of 1 one offsets but no patch conflicts.

Lee


 mm/vmstat.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6.25-rc8-mm2/mm/vmstat.c
===================================================================
--- linux-2.6.25-rc8-mm2.orig/mm/vmstat.c	2008-04-15 12:07:18.000000000 -0400
+++ linux-2.6.25-rc8-mm2/mm/vmstat.c	2008-04-15 12:11:48.000000000 -0400
@@ -699,6 +699,7 @@ static const char * const vmstat_text[] 
 	"nr_unstable",
 	"nr_bounce",
 	"nr_vmscan_write",
+	"nr_writeback_temp",
 
 #ifdef CONFIG_NUMA
 	"numa_hit",

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH -mm 02/15] move isolate_lru_page() to vmscan.c
  2008-04-28 18:18 [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
  2008-04-28 18:18 ` [PATCH -mm 01/15] FYI: vmstats are "off-by-one" Rik van Riel
@ 2008-04-28 18:18 ` Rik van Riel
  2008-04-28 18:18 ` [PATCH -mm 03/15] Use an indexed array for LRU variables Rik van Riel
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2008-04-28 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: lee.schermerhorn, akpm, kosaki.motohiro, Nick Piggin,
	Lee Schermerhorn

[-- Attachment #1: np-01-move-and-rework-isolate_lru_page-v2.patch --]
[-- Type: text/plain, Size: 6987 bytes --]

V1 -> V2 [lts]:
+  fix botched merge -- add back "get_page_unless_zero()"

  From: Nick Piggin <npiggin@suse.de>
  To: Linux Memory Management <linux-mm@kvack.org>
  Subject: [patch 1/4] mm: move and rework isolate_lru_page
  Date:	Mon, 12 Mar 2007 07:38:44 +0100 (CET)

isolate_lru_page logically belongs to be in vmscan.c than migrate.c.

It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c. However a subsequent
patch needs to make use of it in the core mm, so we can happily move it
to vmscan.c.

Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list. Callers can do that.

	Note that we now have '__isolate_lru_page()', that does
	something quite different, visible outside of vmscan.c
	for use with memory controller.  Methinks we need to
	rationalize these names/purposes.	--lts

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Index: linux-2.6.25-mm1/include/linux/migrate.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/migrate.h	2007-07-08 19:32:17.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/migrate.h	2008-04-23 17:31:13.000000000 -0400
@@ -25,7 +25,6 @@ static inline int vma_migratable(struct 
 	return 1;
 }
 
-extern int isolate_lru_page(struct page *p, struct list_head *pagelist);
 extern int putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
 			struct page *, struct page *);
@@ -42,8 +41,6 @@ extern int migrate_vmas(struct mm_struct
 static inline int vma_migratable(struct vm_area_struct *vma)
 					{ return 0; }
 
-static inline int isolate_lru_page(struct page *p, struct list_head *list)
-					{ return -ENOSYS; }
 static inline int putback_lru_pages(struct list_head *l) { return 0; }
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private) { return -ENOSYS; }
Index: linux-2.6.25-mm1/mm/internal.h
===================================================================
--- linux-2.6.25-mm1.orig/mm/internal.h	2008-04-22 10:33:45.000000000 -0400
+++ linux-2.6.25-mm1/mm/internal.h	2008-04-23 17:31:13.000000000 -0400
@@ -34,6 +34,8 @@ static inline void __put_page(struct pag
 	atomic_dec(&page->_count);
 }
 
+extern int isolate_lru_page(struct page *page);
+
 extern void __free_pages_bootmem(struct page *page, unsigned int order);
 
 /*
Index: linux-2.6.25-mm1/mm/migrate.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/migrate.c	2008-04-22 10:33:26.000000000 -0400
+++ linux-2.6.25-mm1/mm/migrate.c	2008-04-23 17:31:13.000000000 -0400
@@ -36,36 +36,6 @@
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 /*
- * Isolate one page from the LRU lists. If successful put it onto
- * the indicated list with elevated page count.
- *
- * Result:
- *  -EBUSY: page not on LRU list
- *  0: page removed from LRU list and added to the specified list.
- */
-int isolate_lru_page(struct page *page, struct list_head *pagelist)
-{
-	int ret = -EBUSY;
-
-	if (PageLRU(page)) {
-		struct zone *zone = page_zone(page);
-
-		spin_lock_irq(&zone->lru_lock);
-		if (PageLRU(page) && get_page_unless_zero(page)) {
-			ret = 0;
-			ClearPageLRU(page);
-			if (PageActive(page))
-				del_page_from_active_list(zone, page);
-			else
-				del_page_from_inactive_list(zone, page);
-			list_add_tail(&page->lru, pagelist);
-		}
-		spin_unlock_irq(&zone->lru_lock);
-	}
-	return ret;
-}
-
-/*
  * migrate_prep() needs to be called before we start compiling a list of pages
  * to be migrated using isolate_lru_page().
  */
@@ -879,7 +849,9 @@ static int do_move_pages(struct mm_struc
 				!migrate_all)
 			goto put_and_set;
 
-		err = isolate_lru_page(page, &pagelist);
+		err = isolate_lru_page(page);
+		if (!err)
+			list_add_tail(&page->lru, &pagelist);
 put_and_set:
 		/*
 		 * Either remove the duplicate refcount from
Index: linux-2.6.25-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/vmscan.c	2008-04-22 10:33:45.000000000 -0400
+++ linux-2.6.25-mm1/mm/vmscan.c	2008-04-23 17:31:13.000000000 -0400
@@ -816,6 +816,51 @@ static unsigned long clear_active_flags(
 	return nr_active;
 }
 
+/**
+ * isolate_lru_page - tries to isolate a page from its LRU list
+ * @page: page to isolate from its LRU list
+ *
+ * Isolates a @page from an LRU list, clears PageLRU and adjusts the
+ * vmstat statistic corresponding to whatever LRU list the page was on.
+ *
+ * Returns 0 if the page was removed from an LRU list.
+ * Returns -EBUSY if the page was not on an LRU list.
+ *
+ * The returned page will have PageLRU() cleared.  If it was found on
+ * the active list, it will have PageActive set.  That flag may need
+ * to be cleared by the caller before letting the page go.
+ *
+ * The vmstat statistic corresponding to the list on which the page was
+ * found will be decremented.
+ *
+ * Restrictions:
+ * (1) Must be called with an elevated refcount on the page. This is a
+ *     fundamentnal difference from isolate_lru_pages (which is called
+ *     without a stable reference).
+ * (2) the lru_lock must not be held.
+ * (3) interrupts must be enabled.
+ */
+int isolate_lru_page(struct page *page)
+{
+	int ret = -EBUSY;
+
+	if (PageLRU(page)) {
+		struct zone *zone = page_zone(page);
+
+		spin_lock_irq(&zone->lru_lock);
+		if (PageLRU(page) && get_page_unless_zero(page)) {
+			ret = 0;
+			ClearPageLRU(page);
+			if (PageActive(page))
+				del_page_from_active_list(zone, page);
+			else
+				del_page_from_inactive_list(zone, page);
+		}
+		spin_unlock_irq(&zone->lru_lock);
+	}
+	return ret;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
Index: linux-2.6.25-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/mempolicy.c	2008-04-22 10:33:45.000000000 -0400
+++ linux-2.6.25-mm1/mm/mempolicy.c	2008-04-23 17:31:13.000000000 -0400
@@ -93,6 +93,8 @@
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
 
+#include "internal.h"
+
 /* Internal flags */
 #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0)	/* Skip checks for continuous vmas */
 #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1)		/* Invert check for nodemask */
@@ -758,8 +760,11 @@ static void migrate_page_add(struct page
 	/*
 	 * Avoid migrating a page that is shared with others.
 	 */
-	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1)
-		isolate_lru_page(page, pagelist);
+	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
+		if (!isolate_lru_page(page)) {
+			list_add_tail(&page->lru, pagelist);
+		}
+	}
 }
 
 static struct page *new_node_page(struct page *page, unsigned long node, int **x)

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH -mm 03/15] Use an indexed array for LRU variables
  2008-04-28 18:18 [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
  2008-04-28 18:18 ` [PATCH -mm 01/15] FYI: vmstats are "off-by-one" Rik van Riel
  2008-04-28 18:18 ` [PATCH -mm 02/15] move isolate_lru_page() to vmscan.c Rik van Riel
@ 2008-04-28 18:18 ` Rik van Riel
  2008-04-28 18:18 ` [PATCH -mm 04/15] use an array for the LRU pagevecs Rik van Riel
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2008-04-28 18:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: lee.schermerhorn, akpm, kosaki.motohiro, Christoph Lameter

[-- Attachment #1: cl-use-indexed-array-of-lru-lists.patch --]
[-- Type: text/plain, Size: 22135 bytes --]

V3 [riel]: memcontrol LRU arrayification

V1 -> V2 [lts]:
+ Remove extraneous  __dec_zone_state(zone, NR_ACTIVE) pointed
  out by Mel G.

>From clameter@sgi.com Wed Aug 29 11:39:51 2007

Currently we are defining explicit variables for the inactive
and active list. An indexed array can be more generic and avoid
repeating similar code in several places in the reclaim code.

We are saving a few bytes in terms of code size:

Before:

   text    data     bss     dec     hex filename
4097753  573120 4092484 8763357  85b7dd vmlinux

After:

   text    data     bss     dec     hex filename
4097729  573120 4092484 8763333  85b7c5 vmlinux

Having an easy way to add new lru lists may ease future work on
the reclaim code.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

 include/linux/memcontrol.h |   17 +---
 include/linux/mm_inline.h  |   33 +++++---
 include/linux/mmzone.h     |   17 ++--
 mm/memcontrol.c            |  116 ++++++++++------------------
 mm/page_alloc.c            |    9 +-
 mm/swap.c                  |    2 
 mm/vmscan.c                |  141 +++++++++++++++++------------------
 mm/vmstat.c                |    3 
 8 files changed, 158 insertions(+), 180 deletions(-)

Index: linux-2.6.25-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/mmzone.h	2008-04-22 10:33:44.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/mmzone.h	2008-04-24 10:46:43.000000000 -0400
@@ -81,8 +81,8 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
-	NR_INACTIVE,
-	NR_ACTIVE,
+	NR_INACTIVE,	/* must match order of LRU_[IN]ACTIVE */
+	NR_ACTIVE,	/*  "     "     "   "       "         */
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
@@ -107,6 +107,13 @@ enum zone_stat_item {
 #endif
 	NR_VM_ZONE_STAT_ITEMS };
 
+enum lru_list {
+	LRU_INACTIVE,	/* must match order of NR_[IN]ACTIVE */
+	LRU_ACTIVE,	/*  "     "     "   "       "        */
+	NR_LRU_LISTS };
+
+#define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
+
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
@@ -251,10 +258,8 @@ struct zone {
 
 	/* Fields commonly accessed by the page reclaim scanner */
 	spinlock_t		lru_lock;	
-	struct list_head	active_list;
-	struct list_head	inactive_list;
-	unsigned long		nr_scan_active;
-	unsigned long		nr_scan_inactive;
+	struct list_head	list[NR_LRU_LISTS];
+	unsigned long		nr_scan[NR_LRU_LISTS];
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
Index: linux-2.6.25-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/mm_inline.h	2007-07-08 19:32:17.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/mm_inline.h	2008-04-24 10:46:43.000000000 -0400
@@ -1,40 +1,51 @@
 static inline void
+add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
+{
+	list_add(&page->lru, &zone->list[l]);
+	__inc_zone_state(zone, NR_INACTIVE + l);
+}
+
+static inline void
+del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
+{
+	list_del(&page->lru);
+	__dec_zone_state(zone, NR_INACTIVE + l);
+}
+
+static inline void
 add_page_to_active_list(struct zone *zone, struct page *page)
 {
-	list_add(&page->lru, &zone->active_list);
-	__inc_zone_state(zone, NR_ACTIVE);
+	add_page_to_lru_list(zone, page, LRU_ACTIVE);
 }
 
 static inline void
 add_page_to_inactive_list(struct zone *zone, struct page *page)
 {
-	list_add(&page->lru, &zone->inactive_list);
-	__inc_zone_state(zone, NR_INACTIVE);
+	add_page_to_lru_list(zone, page, LRU_INACTIVE);
 }
 
 static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
-	list_del(&page->lru);
-	__dec_zone_state(zone, NR_ACTIVE);
+	del_page_from_lru_list(zone, page, LRU_ACTIVE);
 }
 
 static inline void
 del_page_from_inactive_list(struct zone *zone, struct page *page)
 {
-	list_del(&page->lru);
-	__dec_zone_state(zone, NR_INACTIVE);
+	del_page_from_lru_list(zone, page, LRU_INACTIVE);
 }
 
 static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
+	enum lru_list l = LRU_INACTIVE;
+
 	list_del(&page->lru);
 	if (PageActive(page)) {
 		__ClearPageActive(page);
-		__dec_zone_state(zone, NR_ACTIVE);
-	} else {
-		__dec_zone_state(zone, NR_INACTIVE);
+		l = LRU_ACTIVE;
 	}
+	__dec_zone_state(zone, NR_INACTIVE + l);
 }
 
Index: linux-2.6.25-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/page_alloc.c	2008-04-22 10:33:45.000000000 -0400
+++ linux-2.6.25-mm1/mm/page_alloc.c	2008-04-24 10:46:43.000000000 -0400
@@ -3447,6 +3447,7 @@ static void __paginginit free_area_init_
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
 		unsigned long size, realsize, memmap_pages;
+		enum lru_list l;
 
 		size = zone_spanned_pages_in_node(nid, j, zones_size);
 		realsize = size - zone_absent_pages_in_node(nid, j,
@@ -3496,10 +3497,10 @@ static void __paginginit free_area_init_
 		zone->prev_priority = DEF_PRIORITY;
 
 		zone_pcp_init(zone);
-		INIT_LIST_HEAD(&zone->active_list);
-		INIT_LIST_HEAD(&zone->inactive_list);
-		zone->nr_scan_active = 0;
-		zone->nr_scan_inactive = 0;
+		for_each_lru(l) {
+			INIT_LIST_HEAD(&zone->list[l]);
+			zone->nr_scan[l] = 0;
+		}
 		zap_zone_vm_stats(zone);
 		zone->flags = 0;
 		if (!size)
Index: linux-2.6.25-mm1/mm/swap.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/swap.c	2008-04-22 10:33:45.000000000 -0400
+++ linux-2.6.25-mm1/mm/swap.c	2008-04-24 10:46:43.000000000 -0400
@@ -117,7 +117,7 @@ static void pagevec_move_tail(struct pag
 			spin_lock(&zone->lru_lock);
 		}
 		if (PageLRU(page) && !PageActive(page)) {
-			list_move_tail(&page->lru, &zone->inactive_list);
+			list_move_tail(&page->lru, &zone->list[LRU_INACTIVE]);
 			pgmoved++;
 		}
 	}
Index: linux-2.6.25-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/vmscan.c	2008-04-23 17:31:13.000000000 -0400
+++ linux-2.6.25-mm1/mm/vmscan.c	2008-04-24 10:46:43.000000000 -0400
@@ -791,10 +791,10 @@ static unsigned long isolate_pages_globa
 					int active)
 {
 	if (active)
-		return isolate_lru_pages(nr, &z->active_list, dst,
+		return isolate_lru_pages(nr, &z->list[LRU_ACTIVE], dst,
 						scanned, order, mode);
 	else
-		return isolate_lru_pages(nr, &z->inactive_list, dst,
+		return isolate_lru_pages(nr, &z->list[LRU_INACTIVE], dst,
 						scanned, order, mode);
 }
 
@@ -945,10 +945,7 @@ static unsigned long shrink_inactive_lis
 			VM_BUG_ON(PageLRU(page));
 			SetPageLRU(page);
 			list_del(&page->lru);
-			if (PageActive(page))
-				add_page_to_active_list(zone, page);
-			else
-				add_page_to_inactive_list(zone, page);
+			add_page_to_lru_list(zone, page, PageActive(page));
 			if (!pagevec_add(&pvec, page)) {
 				spin_unlock_irq(&zone->lru_lock);
 				__pagevec_release(&pvec);
@@ -1116,8 +1113,8 @@ static void shrink_active_list(unsigned 
 	int pgdeactivate = 0;
 	unsigned long pgscanned;
 	LIST_HEAD(l_hold);	/* The pages which were snipped off */
-	LIST_HEAD(l_inactive);	/* Pages to go onto the inactive_list */
-	LIST_HEAD(l_active);	/* Pages to go onto the active_list */
+	LIST_HEAD(l_active);
+	LIST_HEAD(l_inactive);
 	struct page *page;
 	struct pagevec pvec;
 	int reclaim_mapped = 0;
@@ -1169,7 +1166,7 @@ static void shrink_active_list(unsigned 
 		VM_BUG_ON(!PageActive(page));
 		ClearPageActive(page);
 
-		list_move(&page->lru, &zone->inactive_list);
+		list_move(&page->lru, &zone->list[LRU_INACTIVE]);
 		mem_cgroup_move_lists(page, false);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
@@ -1199,7 +1196,7 @@ static void shrink_active_list(unsigned 
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
 
-		list_move(&page->lru, &zone->active_list);
+		list_move(&page->lru, &zone->list[LRU_ACTIVE]);
 		mem_cgroup_move_lists(page, true);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
@@ -1219,65 +1216,64 @@ static void shrink_active_list(unsigned 
 	pagevec_release(&pvec);
 }
 
+static unsigned long shrink_list(enum lru_list l, unsigned long nr_to_scan,
+	struct zone *zone, struct scan_control *sc, int priority)
+{
+	if (l == LRU_ACTIVE) {
+		shrink_active_list(nr_to_scan, zone, sc, priority);
+		return 0;
+	}
+	return shrink_inactive_list(nr_to_scan, zone, sc);
+}
+
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
 static unsigned long shrink_zone(int priority, struct zone *zone,
 				struct scan_control *sc)
 {
-	unsigned long nr_active;
-	unsigned long nr_inactive;
+	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
 	unsigned long nr_reclaimed = 0;
+	enum lru_list l;
 
 	if (scan_global_lru(sc)) {
 		/*
 		 * Add one to nr_to_scan just to make sure that the kernel
 		 * will slowly sift through the active list.
 		 */
-		zone->nr_scan_active +=
-			(zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
-		nr_active = zone->nr_scan_active;
-		zone->nr_scan_inactive +=
-			(zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
-		nr_inactive = zone->nr_scan_inactive;
-		if (nr_inactive >= sc->swap_cluster_max)
-			zone->nr_scan_inactive = 0;
-		else
-			nr_inactive = 0;
-
-		if (nr_active >= sc->swap_cluster_max)
-			zone->nr_scan_active = 0;
-		else
-			nr_active = 0;
+		for_each_lru(l) {
+			zone->nr_scan[l] += (zone_page_state(zone,
+					NR_INACTIVE + l)  >> priority) + 1;
+			nr[l] = zone->nr_scan[l];
+			if (nr[l] >= sc->swap_cluster_max)
+				zone->nr_scan[l] = 0;
+			else
+				nr[l] = 0;
+		}
 	} else {
 		/*
 		 * This reclaim occurs not because zone memory shortage but
 		 * because memory controller hits its limit.
 		 * Then, don't modify zone reclaim related data.
 		 */
-		nr_active = mem_cgroup_calc_reclaim_active(sc->mem_cgroup,
-					zone, priority);
+		nr[LRU_ACTIVE] = mem_cgroup_calc_reclaim(sc->mem_cgroup,
+					zone, priority, LRU_ACTIVE);
 
-		nr_inactive = mem_cgroup_calc_reclaim_inactive(sc->mem_cgroup,
-					zone, priority);
+		nr[LRU_INACTIVE] = mem_cgroup_calc_reclaim(sc->mem_cgroup,
+					zone, priority, LRU_INACTIVE);
 	}
 
-
-	while (nr_active || nr_inactive) {
-		if (nr_active) {
-			nr_to_scan = min(nr_active,
+	while (nr[LRU_ACTIVE] || nr[LRU_INACTIVE]) {
+		for_each_lru(l) {
+			if (nr[l]) {
+				nr_to_scan = min(nr[l],
 					(unsigned long)sc->swap_cluster_max);
-			nr_active -= nr_to_scan;
-			shrink_active_list(nr_to_scan, zone, sc, priority);
-		}
+				nr[l] -= nr_to_scan;
 
-		if (nr_inactive) {
-			nr_to_scan = min(nr_inactive,
-					(unsigned long)sc->swap_cluster_max);
-			nr_inactive -= nr_to_scan;
-			nr_reclaimed += shrink_inactive_list(nr_to_scan, zone,
-								sc);
+				nr_reclaimed += shrink_list(l, nr_to_scan,
+							zone, sc, priority);
+			}
 		}
 	}
 
@@ -1791,6 +1787,7 @@ static unsigned long shrink_all_zones(un
 {
 	struct zone *zone;
 	unsigned long nr_to_scan, ret = 0;
+	enum lru_list l;
 
 	for_each_zone(zone) {
 
@@ -1800,28 +1797,25 @@ static unsigned long shrink_all_zones(un
 		if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY)
 			continue;
 
-		/* For pass = 0 we don't shrink the active list */
-		if (pass > 0) {
-			zone->nr_scan_active +=
-				(zone_page_state(zone, NR_ACTIVE) >> prio) + 1;
-			if (zone->nr_scan_active >= nr_pages || pass > 3) {
-				zone->nr_scan_active = 0;
+		for_each_lru(l) {
+			/* For pass = 0 we don't shrink the active list */
+			if (pass == 0 && l == LRU_ACTIVE)
+				continue;
+
+			zone->nr_scan[l] +=
+				(zone_page_state(zone, NR_INACTIVE + l)
+								>> prio) + 1;
+			if (zone->nr_scan[l] >= nr_pages || pass > 3) {
+				zone->nr_scan[l] = 0;
 				nr_to_scan = min(nr_pages,
-					zone_page_state(zone, NR_ACTIVE));
-				shrink_active_list(nr_to_scan, zone, sc, prio);
+					zone_page_state(zone,
+							NR_INACTIVE + l));
+				ret += shrink_list(l, nr_to_scan, zone,
+								sc, prio);
+				if (ret >= nr_pages)
+					return ret;
 			}
 		}
-
-		zone->nr_scan_inactive +=
-			(zone_page_state(zone, NR_INACTIVE) >> prio) + 1;
-		if (zone->nr_scan_inactive >= nr_pages || pass > 3) {
-			zone->nr_scan_inactive = 0;
-			nr_to_scan = min(nr_pages,
-				zone_page_state(zone, NR_INACTIVE));
-			ret += shrink_inactive_list(nr_to_scan, zone, sc);
-			if (ret >= nr_pages)
-				return ret;
-		}
 	}
 
 	return ret;
Index: linux-2.6.25-mm1/mm/vmstat.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/vmstat.c	2008-04-23 17:31:10.000000000 -0400
+++ linux-2.6.25-mm1/mm/vmstat.c	2008-04-24 10:46:43.000000000 -0400
@@ -764,7 +764,8 @@ static void zoneinfo_show_print(struct s
 		   zone->pages_low,
 		   zone->pages_high,
 		   zone->pages_scanned,
-		   zone->nr_scan_active, zone->nr_scan_inactive,
+		   zone->nr_scan[LRU_ACTIVE],
+		   zone->nr_scan[LRU_INACTIVE],
 		   zone->spanned_pages,
 		   zone->present_pages);
 
Index: linux-2.6.25-mm1/include/linux/memcontrol.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/memcontrol.h	2008-04-22 10:33:44.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/memcontrol.h	2008-04-24 10:46:43.000000000 -0400
@@ -66,10 +66,8 @@ extern void mem_cgroup_note_reclaim_prio
 extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
 							int priority);
 
-extern long mem_cgroup_calc_reclaim_active(struct mem_cgroup *mem,
-				struct zone *zone, int priority);
-extern long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
-				struct zone *zone, int priority);
+extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
+					int priority, enum lru_list lru);
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 static inline void page_reset_bad_cgroup(struct page *page)
@@ -151,14 +149,9 @@ static inline void mem_cgroup_record_rec
 {
 }
 
-static inline long mem_cgroup_calc_reclaim_active(struct mem_cgroup *mem,
-					struct zone *zone, int priority)
-{
-	return 0;
-}
-
-static inline long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
-					struct zone *zone, int priority)
+static inline long mem_cgroup_calc_reclaim(struct mem_cgroup *mem,
+					struct zone *zone, int priority,
+					enum lru_list lru)
 {
 	return 0;
 }
Index: linux-2.6.25-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/memcontrol.c	2008-04-22 10:33:45.000000000 -0400
+++ linux-2.6.25-mm1/mm/memcontrol.c	2008-04-24 10:56:57.000000000 -0400
@@ -32,6 +32,7 @@
 #include <linux/fs.h>
 #include <linux/seq_file.h>
 #include <linux/vmalloc.h>
+#include <linux/mm_inline.h>
 
 #include <asm/uaccess.h>
 
@@ -83,22 +84,13 @@ static s64 mem_cgroup_read_stat(struct m
 /*
  * per-zone information in memory controller.
  */
-
-enum mem_cgroup_zstat_index {
-	MEM_CGROUP_ZSTAT_ACTIVE,
-	MEM_CGROUP_ZSTAT_INACTIVE,
-
-	NR_MEM_CGROUP_ZSTAT,
-};
-
 struct mem_cgroup_per_zone {
 	/*
 	 * spin_lock to protect the per cgroup LRU
 	 */
 	spinlock_t		lru_lock;
-	struct list_head	active_list;
-	struct list_head	inactive_list;
-	unsigned long count[NR_MEM_CGROUP_ZSTAT];
+	struct list_head	lists[NR_LRU_LISTS];
+	unsigned long		count[NR_LRU_LISTS];
 };
 /* Macro for accessing counter */
 #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
@@ -218,7 +210,7 @@ page_cgroup_zoneinfo(struct page_cgroup 
 }
 
 static unsigned long mem_cgroup_get_all_zonestat(struct mem_cgroup *mem,
-					enum mem_cgroup_zstat_index idx)
+					enum lru_list idx)
 {
 	int nid, zid;
 	struct mem_cgroup_per_zone *mz;
@@ -280,11 +272,9 @@ static void __mem_cgroup_remove_list(str
 			struct page_cgroup *pc)
 {
 	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
+	int lru = !!from;
 
-	if (from)
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) -= 1;
-	else
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) -= 1;
+	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
 
 	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, false);
 	list_del_init(&pc->lru);
@@ -293,37 +283,35 @@ static void __mem_cgroup_remove_list(str
 static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
 				struct page_cgroup *pc)
 {
-	int to = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
+	int lru = LRU_INACTIVE;
+
+	if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+		lru += LRU_ACTIVE;
+
+	MEM_CGROUP_ZSTAT(mz, lru) += 1;
+	list_add(&pc->lru, &mz->lists[lru]);
 
-	if (!to) {
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) += 1;
-		list_add(&pc->lru, &mz->inactive_list);
-	} else {
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) += 1;
-		list_add(&pc->lru, &mz->active_list);
-	}
 	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
 }
 
 static void __mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
 {
-	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
 	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
+	int lru = LRU_INACTIVE;
 
-	if (from)
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) -= 1;
-	else
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) -= 1;
+	if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+		lru += LRU_ACTIVE;
 
-	if (active) {
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) += 1;
+	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
+
+	if (active)
 		pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
-		list_move(&pc->lru, &mz->active_list);
-	} else {
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) += 1;
+	else
 		pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
-		list_move(&pc->lru, &mz->inactive_list);
-	}
+
+	lru = !!active;
+	MEM_CGROUP_ZSTAT(mz, lru) += 1;
+	list_move(&pc->lru, &mz->lists[lru]);
 }
 
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
@@ -392,8 +380,8 @@ long mem_cgroup_reclaim_imbalance(struct
 {
 	unsigned long active, inactive;
 	/* active and inactive are the number of pages. 'long' is ok.*/
-	active = mem_cgroup_get_all_zonestat(mem, MEM_CGROUP_ZSTAT_ACTIVE);
-	inactive = mem_cgroup_get_all_zonestat(mem, MEM_CGROUP_ZSTAT_INACTIVE);
+	active = mem_cgroup_get_all_zonestat(mem, LRU_ACTIVE);
+	inactive = mem_cgroup_get_all_zonestat(mem, LRU_INACTIVE);
 	return (long) (active / (inactive + 1));
 }
 
@@ -424,28 +412,17 @@ void mem_cgroup_record_reclaim_priority(
  * (see include/linux/mmzone.h)
  */
 
-long mem_cgroup_calc_reclaim_active(struct mem_cgroup *mem,
-				   struct zone *zone, int priority)
+long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
+					int priority, enum lru_list lru)
 {
-	long nr_active;
+	long nr_pages;
 	int nid = zone->zone_pgdat->node_id;
 	int zid = zone_idx(zone);
 	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
 
-	nr_active = MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE);
-	return (nr_active >> priority);
-}
-
-long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
-					struct zone *zone, int priority)
-{
-	long nr_inactive;
-	int nid = zone->zone_pgdat->node_id;
-	int zid = zone_idx(zone);
-	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	nr_pages = MEM_CGROUP_ZSTAT(mz, lru);
 
-	nr_inactive = MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE);
-	return (nr_inactive >> priority);
+	return (nr_pages >> priority);
 }
 
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
@@ -464,14 +441,11 @@ unsigned long mem_cgroup_isolate_pages(u
 	int nid = z->zone_pgdat->node_id;
 	int zid = zone_idx(z);
 	struct mem_cgroup_per_zone *mz;
+	int lru = !!active;
 
 	BUG_ON(!mem_cont);
 	mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
-	if (active)
-		src = &mz->active_list;
-	else
-		src = &mz->inactive_list;
-
+	src = &mz->lists[lru];
 
 	spin_lock(&mz->lru_lock);
 	scan = 0;
@@ -762,7 +736,7 @@ void mem_cgroup_page_migration(struct pa
 #define FORCE_UNCHARGE_BATCH	(128)
 static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 			    struct mem_cgroup_per_zone *mz,
-			    int active)
+			    enum lru_list lru)
 {
 	struct page_cgroup *pc;
 	struct page *page;
@@ -770,10 +744,7 @@ static void mem_cgroup_force_empty_list(
 	unsigned long flags;
 	struct list_head *list;
 
-	if (active)
-		list = &mz->active_list;
-	else
-		list = &mz->inactive_list;
+	list = &mz->lists[lru];
 
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	while (!list_empty(list)) {
@@ -816,11 +787,10 @@ static int mem_cgroup_force_empty(struct
 		for_each_node_state(node, N_POSSIBLE)
 			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 				struct mem_cgroup_per_zone *mz;
+				enum lru_list l;
 				mz = mem_cgroup_zoneinfo(mem, node, zid);
-				/* drop all page_cgroup in active_list */
-				mem_cgroup_force_empty_list(mem, mz, 1);
-				/* drop all page_cgroup in inactive_list */
-				mem_cgroup_force_empty_list(mem, mz, 0);
+				for_each_lru(l)
+					mem_cgroup_force_empty_list(mem, mz, l);
 			}
 	}
 	ret = 0;
@@ -905,9 +875,9 @@ static int mem_control_stat_show(struct 
 		unsigned long active, inactive;
 
 		inactive = mem_cgroup_get_all_zonestat(mem_cont,
-						MEM_CGROUP_ZSTAT_INACTIVE);
+						LRU_INACTIVE);
 		active = mem_cgroup_get_all_zonestat(mem_cont,
-						MEM_CGROUP_ZSTAT_ACTIVE);
+						LRU_ACTIVE);
 		cb->fill(cb, "active", (active) * PAGE_SIZE);
 		cb->fill(cb, "inactive", (inactive) * PAGE_SIZE);
 	}
@@ -952,6 +922,7 @@ static int alloc_mem_cgroup_per_zone_inf
 {
 	struct mem_cgroup_per_node *pn;
 	struct mem_cgroup_per_zone *mz;
+	enum lru_list l;
 	int zone, tmp = node;
 	/*
 	 * This routine is called against possible nodes.
@@ -972,9 +943,9 @@ static int alloc_mem_cgroup_per_zone_inf
 
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
-		INIT_LIST_HEAD(&mz->active_list);
-		INIT_LIST_HEAD(&mz->inactive_list);
 		spin_lock_init(&mz->lru_lock);
+		for_each_lru(l)
+			INIT_LIST_HEAD(&mz->lists[l]);
 	}
 	return 0;
 }

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH -mm 04/15] use an array for the LRU pagevecs
  2008-04-28 18:18 [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
                   ` (2 preceding siblings ...)
  2008-04-28 18:18 ` [PATCH -mm 03/15] Use an indexed array for LRU variables Rik van Riel
@ 2008-04-28 18:18 ` Rik van Riel
  2008-04-28 18:18 ` [PATCH -mm 05/15] free swap space on swap-in/activation Rik van Riel
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2008-04-28 18:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: lee.schermerhorn, akpm, kosaki.motohiro

[-- Attachment #1: pagevec-array.patch --]
[-- Type: text/plain, Size: 9149 bytes --]

Turn the pagevecs into an array just like the LRUs.  This significantly
cleans up the source code and reduces the size of the kernel by about
13kB after all the LRU lists have been created further down in the split
VM patch series.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

 include/linux/mmzone.h  |   15 +++++-
 include/linux/pagevec.h |   13 ++++-
 include/linux/swap.h    |   18 ++++++-
 mm/migrate.c            |   11 ----
 mm/swap.c               |   87 +++++++++++++++-----------------------
 5 files changed, 76 insertions(+), 68 deletions(-)

Index: linux-2.6.25-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/mmzone.h	2008-04-24 10:46:43.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/mmzone.h	2008-04-24 11:59:51.000000000 -0400
@@ -107,13 +107,22 @@ enum zone_stat_item {
 #endif
 	NR_VM_ZONE_STAT_ITEMS };
 
+#define LRU_BASE 0
+
 enum lru_list {
-	LRU_INACTIVE,	/* must match order of NR_[IN]ACTIVE */
-	LRU_ACTIVE,	/*  "     "     "   "       "        */
+	LRU_INACTIVE = LRU_BASE,	/* must match order of NR_[IN]ACTIVE */
+	LRU_ACTIVE,			/*  "     "     "   "       "        */
 	NR_LRU_LISTS };
 
 #define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
 
+static inline int is_active_lru(enum lru_list l)
+{
+	return (l == LRU_ACTIVE);
+}
+
+enum lru_list page_lru(struct page *page);
+
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
Index: linux-2.6.25-mm1/mm/swap.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/swap.c	2008-04-24 10:46:43.000000000 -0400
+++ linux-2.6.25-mm1/mm/swap.c	2008-04-24 11:59:51.000000000 -0400
@@ -34,8 +34,7 @@
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
 
-static DEFINE_PER_CPU(struct pagevec, lru_add_pvecs) = { 0, };
-static DEFINE_PER_CPU(struct pagevec, lru_add_active_pvecs) = { 0, };
+static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs) = { {0,}, };
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs) = { 0, };
 
 /*
@@ -96,6 +95,23 @@ void put_pages_list(struct list_head *pa
 }
 EXPORT_SYMBOL(put_pages_list);
 
+/**
+ * page_lru - which LRU list should a page be on?
+ * @page: the page to test
+ *
+ * Returns the LRU list a page should be on, as an index
+ * into the array of LRU lists.
+ */
+enum lru_list page_lru(struct page *page)
+{
+	enum lru_list lru = LRU_BASE;
+
+	if (PageActive(page))
+		lru += LRU_ACTIVE;
+
+	return lru;
+}
+
 /*
  * pagevec_move_tail() must be called with IRQ disabled.
  * Otherwise this may cause nasty races.
@@ -186,28 +202,29 @@ void mark_page_accessed(struct page *pag
 
 EXPORT_SYMBOL(mark_page_accessed);
 
-/**
- * lru_cache_add: add a page to the page lists
- * @page: the page to add
- */
-void lru_cache_add(struct page *page)
+void __lru_cache_add(struct page *page, enum lru_list lru)
 {
-	struct pagevec *pvec = &get_cpu_var(lru_add_pvecs);
+	struct pagevec *pvec = &get_cpu_var(lru_add_pvecs)[lru];
 
 	page_cache_get(page);
 	if (!pagevec_add(pvec, page))
-		__pagevec_lru_add(pvec);
+		____pagevec_lru_add(pvec, lru);
 	put_cpu_var(lru_add_pvecs);
 }
 
-void lru_cache_add_active(struct page *page)
+/**
+ * lru_cache_add_lru - add a page to a page list
+ * @page: the page to be added to the LRU.
+ * @lru: the LRU list to which the page is added.
+ */
+void lru_cache_add_lru(struct page *page, enum lru_list lru)
 {
-	struct pagevec *pvec = &get_cpu_var(lru_add_active_pvecs);
+	if (PageActive(page)) {
+		ClearPageActive(page);
+	}
 
-	page_cache_get(page);
-	if (!pagevec_add(pvec, page))
-		__pagevec_lru_add_active(pvec);
-	put_cpu_var(lru_add_active_pvecs);
+	VM_BUG_ON(PageLRU(page) || PageActive(page));
+	__lru_cache_add(page, lru);
 }
 
 /*
@@ -217,15 +234,15 @@ void lru_cache_add_active(struct page *p
  */
 static void drain_cpu_pagevecs(int cpu)
 {
+	struct pagevec *pvecs = per_cpu(lru_add_pvecs, cpu);
 	struct pagevec *pvec;
+	int lru;
 
-	pvec = &per_cpu(lru_add_pvecs, cpu);
-	if (pagevec_count(pvec))
-		__pagevec_lru_add(pvec);
-
-	pvec = &per_cpu(lru_add_active_pvecs, cpu);
-	if (pagevec_count(pvec))
-		__pagevec_lru_add_active(pvec);
+	for_each_lru(lru) {
+		pvec = &pvecs[lru - LRU_BASE];
+		if (pagevec_count(pvec))
+			____pagevec_lru_add(pvec, lru);
+	}
 
 	pvec = &per_cpu(lru_rotate_pvecs, cpu);
 	if (pagevec_count(pvec)) {
@@ -379,7 +396,7 @@ void __pagevec_release_nonlru(struct pag
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.
  */
-void __pagevec_lru_add(struct pagevec *pvec)
+void ____pagevec_lru_add(struct pagevec *pvec, enum lru_list lru)
 {
 	int i;
 	struct zone *zone = NULL;
@@ -396,7 +413,9 @@ void __pagevec_lru_add(struct pagevec *p
 		}
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
-		add_page_to_inactive_list(zone, page);
+		if (is_active_lru(lru))
+			SetPageActive(page);
+		add_page_to_lru_list(zone, page, lru);
 	}
 	if (zone)
 		spin_unlock_irq(&zone->lru_lock);
@@ -404,34 +423,7 @@ void __pagevec_lru_add(struct pagevec *p
 	pagevec_reinit(pvec);
 }
 
-EXPORT_SYMBOL(__pagevec_lru_add);
-
-void __pagevec_lru_add_active(struct pagevec *pvec)
-{
-	int i;
-	struct zone *zone = NULL;
-
-	for (i = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
-		struct zone *pagezone = page_zone(page);
-
-		if (pagezone != zone) {
-			if (zone)
-				spin_unlock_irq(&zone->lru_lock);
-			zone = pagezone;
-			spin_lock_irq(&zone->lru_lock);
-		}
-		VM_BUG_ON(PageLRU(page));
-		SetPageLRU(page);
-		VM_BUG_ON(PageActive(page));
-		SetPageActive(page);
-		add_page_to_active_list(zone, page);
-	}
-	if (zone)
-		spin_unlock_irq(&zone->lru_lock);
-	release_pages(pvec->pages, pvec->nr, pvec->cold);
-	pagevec_reinit(pvec);
-}
+EXPORT_SYMBOL(____pagevec_lru_add);
 
 /*
  * Try to drop buffers from the pages in a pagevec
Index: linux-2.6.25-mm1/include/linux/pagevec.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/pagevec.h	2007-07-08 19:32:17.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/pagevec.h	2008-04-24 11:59:51.000000000 -0400
@@ -23,8 +23,7 @@ struct pagevec {
 void __pagevec_release(struct pagevec *pvec);
 void __pagevec_release_nonlru(struct pagevec *pvec);
 void __pagevec_free(struct pagevec *pvec);
-void __pagevec_lru_add(struct pagevec *pvec);
-void __pagevec_lru_add_active(struct pagevec *pvec);
+void ____pagevec_lru_add(struct pagevec *pvec, enum lru_list lru);
 void pagevec_strip(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
@@ -81,6 +80,16 @@ static inline void pagevec_free(struct p
 		__pagevec_free(pvec);
 }
 
+static inline void __pagevec_lru_add(struct pagevec *pvec)
+{
+	____pagevec_lru_add(pvec, LRU_INACTIVE);
+}
+
+static inline void __pagevec_lru_add_active(struct pagevec *pvec)
+{
+	____pagevec_lru_add(pvec, LRU_ACTIVE);
+}
+
 static inline void pagevec_lru_add(struct pagevec *pvec)
 {
 	if (pagevec_count(pvec))
Index: linux-2.6.25-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/swap.h	2008-04-22 10:33:44.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/swap.h	2008-04-24 11:59:51.000000000 -0400
@@ -171,8 +171,8 @@ extern unsigned int nr_free_pagecache_pa
 
 
 /* linux/mm/swap.c */
-extern void lru_cache_add(struct page *);
-extern void lru_cache_add_active(struct page *);
+extern void __lru_cache_add(struct page *, enum lru_list lru);
+extern void lru_cache_add_lru(struct page *, enum lru_list lru);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
@@ -180,6 +180,20 @@ extern int lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void swap_setup(void);
 
+/**
+ * lru_cache_add: add a page to the page lists
+ * @page: the page to add
+ */
+static inline void lru_cache_add(struct page *page)
+{
+	__lru_cache_add(page, LRU_INACTIVE);
+}
+
+static inline void lru_cache_add_active(struct page *page)
+{
+	__lru_cache_add(page, LRU_ACTIVE);
+}
+
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask);
Index: linux-2.6.25-mm1/mm/migrate.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/migrate.c	2008-04-23 17:31:13.000000000 -0400
+++ linux-2.6.25-mm1/mm/migrate.c	2008-04-24 11:59:51.000000000 -0400
@@ -54,16 +54,7 @@ int migrate_prep(void)
 
 static inline void move_to_lru(struct page *page)
 {
-	if (PageActive(page)) {
-		/*
-		 * lru_cache_add_active checks that
-		 * the PG_active bit is off.
-		 */
-		ClearPageActive(page);
-		lru_cache_add_active(page);
-	} else {
-		lru_cache_add(page);
-	}
+	lru_cache_add_lru(page, page_lru(page));
 	put_page(page);
 }
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH -mm 05/15] free swap space on swap-in/activation
  2008-04-28 18:18 [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
                   ` (3 preceding siblings ...)
  2008-04-28 18:18 ` [PATCH -mm 04/15] use an array for the LRU pagevecs Rik van Riel
@ 2008-04-28 18:18 ` Rik van Riel
  2008-05-12 11:21   ` Daisuke Nishimura
  2008-04-28 18:18 ` [PATCH -mm 06/15] define page_file_cache() function Rik van Riel
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 34+ messages in thread
From: Rik van Riel @ 2008-04-28 18:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: lee.schermerhorn, akpm, kosaki.motohiro, Lee Schermerhorn

[-- Attachment #1: rvr-00-linux-2.6-swapfree.patch --]
[-- Type: text/plain, Size: 2852 bytes --]

Free swap cache entries when swapping in pages if vm_swap_full()
[swap space > 1/2 used].  Uses new pagevec to reduce pressure
on locks.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Index: linux-2.6.25-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/vmscan.c	2008-04-24 10:46:43.000000000 -0400
+++ linux-2.6.25-mm1/mm/vmscan.c	2008-04-24 11:59:56.000000000 -0400
@@ -619,6 +619,9 @@ free_it:
 		continue;
 
 activate_locked:
+		/* Not a candidate for swapping, so reclaim swap space. */
+		if (PageSwapCache(page) && vm_swap_full())
+			remove_exclusive_swap_page(page);
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
@@ -1203,6 +1206,8 @@ static void shrink_active_list(unsigned 
 			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
 			pgmoved = 0;
 			spin_unlock_irq(&zone->lru_lock);
+			if (vm_swap_full())
+				pagevec_swap_free(&pvec);
 			__pagevec_release(&pvec);
 			spin_lock_irq(&zone->lru_lock);
 		}
@@ -1212,6 +1217,8 @@ static void shrink_active_list(unsigned 
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
 	__count_vm_events(PGDEACTIVATE, pgdeactivate);
 	spin_unlock_irq(&zone->lru_lock);
+	if (vm_swap_full())
+		pagevec_swap_free(&pvec);
 
 	pagevec_release(&pvec);
 }
Index: linux-2.6.25-mm1/mm/swap.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/swap.c	2008-04-24 11:59:51.000000000 -0400
+++ linux-2.6.25-mm1/mm/swap.c	2008-04-24 11:59:56.000000000 -0400
@@ -443,6 +443,24 @@ void pagevec_strip(struct pagevec *pvec)
 	}
 }
 
+/*
+ * Try to free swap space from the pages in a pagevec
+ */
+void pagevec_swap_free(struct pagevec *pvec)
+{
+	int i;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+
+		if (PageSwapCache(page) && !TestSetPageLocked(page)) {
+			if (PageSwapCache(page))
+				remove_exclusive_swap_page(page);
+			unlock_page(page);
+		}
+	}
+}
+
 /**
  * pagevec_lookup - gang pagecache lookup
  * @pvec:	Where the resulting pages are placed
Index: linux-2.6.25-mm1/include/linux/pagevec.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/pagevec.h	2008-04-24 11:59:51.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/pagevec.h	2008-04-24 11:59:56.000000000 -0400
@@ -25,6 +25,7 @@ void __pagevec_release_nonlru(struct pag
 void __pagevec_free(struct pagevec *pvec);
 void ____pagevec_lru_add(struct pagevec *pvec, enum lru_list lru);
 void pagevec_strip(struct pagevec *pvec);
+void pagevec_swap_free(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
 unsigned pagevec_lookup_tag(struct pagevec *pvec,

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH -mm 06/15] define page_file_cache() function
  2008-04-28 18:18 [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
                   ` (4 preceding siblings ...)
  2008-04-28 18:18 ` [PATCH -mm 05/15] free swap space on swap-in/activation Rik van Riel
@ 2008-04-28 18:18 ` Rik van Riel
  2008-04-28 18:18 ` [PATCH -mm 07/15] split LRU lists into anon & file sets Rik van Riel
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2008-04-28 18:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: lee.schermerhorn, akpm, kosaki.motohiro

[-- Attachment #1: rvr-01-linux-2.6-page_file_cache.patch --]
[-- Type: text/plain, Size: 6641 bytes --]

Define page_file_cache() function to answer the question:
	is page backed by a file?

Originally part of Rik van Riel's split-lru patch.  Extracted
to make available for other, independent reclaim patches.

Moved inline function to linux/mm_inline.h where it will
be needed by subsequent "split LRU" and "noreclaim" patches.  

Unfortunately this needs to use a page flag, since the
PG_swapbacked state needs to be preserved all the way
to the point where the page is last removed from the
LRU.  Trying to derive the status from other info in
the page resulted in wrong VM statistics in earlier
split VM patchsets.


Signed-off-by:  Rik van Riel <riel@redhat.com>
Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>


Index: linux-2.6.25-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/mm_inline.h	2008-04-24 10:46:43.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/mm_inline.h	2008-04-24 12:00:01.000000000 -0400
@@ -1,3 +1,26 @@
+#ifndef LINUX_MM_INLINE_H
+#define LINUX_MM_INLINE_H
+
+/**
+ * page_file_cache - should the page be on a file LRU or anon LRU?
+ * @page: the page to test
+ *
+ * Returns !0 if @page is page cache page backed by a regular filesystem,
+ * or 0 if @page is anonymous, tmpfs or otherwise ram or swap backed.
+ *
+ * We would like to get this info without a page flag, but the state
+ * needs to survive until the page is last deleted from the LRU, which
+ * could be as far down as __page_cache_release.
+ */
+static inline int page_file_cache(struct page *page)
+{
+	if (PageSwapBacked(page))
+		return 0;
+
+	/* The page is page cache backed by a normal filesystem. */
+	return 2;
+}
+
 static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
@@ -49,3 +72,4 @@ del_page_from_lru(struct zone *zone, str
 	__dec_zone_state(zone, NR_INACTIVE + l);
 }
 
+#endif
Index: linux-2.6.25-mm1/mm/shmem.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/shmem.c	2008-04-22 10:33:45.000000000 -0400
+++ linux-2.6.25-mm1/mm/shmem.c	2008-04-24 12:00:01.000000000 -0400
@@ -1378,6 +1378,7 @@ repeat:
 				goto failed;
 			}
 
+			SetPageSwapBacked(filepage);
 			spin_lock(&info->lock);
 			entry = shmem_swp_alloc(info, idx, sgp);
 			if (IS_ERR(entry))
Index: linux-2.6.25-mm1/include/linux/page-flags.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/page-flags.h	2008-04-22 10:33:44.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/page-flags.h	2008-04-24 12:00:01.000000000 -0400
@@ -93,6 +93,7 @@ enum pageflags {
 	PG_mappedtodisk,	/* Has blocks allocated on-disk */
 	PG_reclaim,		/* To be reclaimed asap */
 	PG_buddy,		/* Page is free, on buddy lists */
+	PG_swapbacked,		/* Page is backed by RAM/swap */
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 	PG_uncached,		/* Page has been mapped as uncached */
 #endif
@@ -160,6 +161,7 @@ PAGEFLAG(Pinned, owner_priv_1) TESTSCFLA
 PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved)
 PAGEFLAG(Private, private) __CLEARPAGEFLAG(Private, private)
 	__SETPAGEFLAG(Private, private)
+PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
 
 /*
  * Only test-and-set exist for PG_writeback.  The unconditional operators are
Index: linux-2.6.25-mm1/mm/memory.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/memory.c	2008-04-22 10:33:45.000000000 -0400
+++ linux-2.6.25-mm1/mm/memory.c	2008-04-24 12:00:01.000000000 -0400
@@ -1756,6 +1756,7 @@ gotten:
 		ptep_clear_flush(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
+		SetPageSwapBacked(new_page);
 		lru_cache_add_active(new_page);
 		page_add_new_anon_rmap(new_page, vma, address);
 
@@ -2224,6 +2225,7 @@ static int do_anonymous_page(struct mm_s
 	if (!pte_none(*page_table))
 		goto release;
 	inc_mm_counter(mm, anon_rss);
+	SetPageSwapBacked(page);
 	lru_cache_add_active(page);
 	page_add_new_anon_rmap(page, vma, address);
 	set_pte_at(mm, address, page_table, entry);
@@ -2367,6 +2369,7 @@ static int __do_fault(struct mm_struct *
 		set_pte_at(mm, address, page_table, entry);
 		if (anon) {
                         inc_mm_counter(mm, anon_rss);
+			SetPageSwapBacked(page);
                         lru_cache_add_active(page);
                         page_add_new_anon_rmap(page, vma, address);
 		} else {
Index: linux-2.6.25-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/swap_state.c	2008-04-22 10:33:45.000000000 -0400
+++ linux-2.6.25-mm1/mm/swap_state.c	2008-04-24 12:00:01.000000000 -0400
@@ -82,6 +82,7 @@ int add_to_swap_cache(struct page *page,
 		if (!error) {
 			page_cache_get(page);
 			SetPageSwapCache(page);
+			SetPageSwapBacked(page);
 			set_page_private(page, entry.val);
 			total_swapcache_pages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
Index: linux-2.6.25-mm1/mm/migrate.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/migrate.c	2008-04-24 11:59:51.000000000 -0400
+++ linux-2.6.25-mm1/mm/migrate.c	2008-04-24 12:00:01.000000000 -0400
@@ -546,6 +546,8 @@ static int move_to_new_page(struct page 
 	/* Prepare mapping for the new page.*/
 	newpage->index = page->index;
 	newpage->mapping = page->mapping;
+	if (PageSwapBacked(page))
+		SetPageSwapBacked(newpage);
 
 	mapping = page_mapping(page);
 	if (!mapping)
Index: linux-2.6.25-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/page_alloc.c	2008-04-24 10:46:43.000000000 -0400
+++ linux-2.6.25-mm1/mm/page_alloc.c	2008-04-24 12:00:01.000000000 -0400
@@ -261,6 +261,7 @@ static void bad_page(struct page *page)
 			1 << PG_slab    |
 			1 << PG_swapcache |
 			1 << PG_writeback |
+			1 << PG_swapbacked |
 			1 << PG_buddy );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
@@ -494,6 +495,8 @@ static inline int free_pages_check(struc
 		bad_page(page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
+	if (PageSwapBacked(page))
+		__ClearPageSwapBacked(page);
 	/*
 	 * For now, we report if PG_reserved was found set, but do not
 	 * clear it, and do not free the page.  But we shall soon need
@@ -644,6 +647,7 @@ static int prep_new_page(struct page *pa
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_reserved |
+			1 << PG_swapbacked |
 			1 << PG_buddy ))))
 		bad_page(page);
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH -mm 07/15] split LRU lists into anon & file sets
  2008-04-28 18:18 [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
                   ` (5 preceding siblings ...)
  2008-04-28 18:18 ` [PATCH -mm 06/15] define page_file_cache() function Rik van Riel
@ 2008-04-28 18:18 ` Rik van Riel
  2008-05-10  7:50   ` MinChan Kim
  2008-04-28 18:18 ` [PATCH -mm 08/15] SEQ replacement for anonymous pages Rik van Riel
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 34+ messages in thread
From: Rik van Riel @ 2008-04-28 18:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: lee.schermerhorn, akpm, kosaki.motohiro, Lee Schermerhorn

[-- Attachment #1: rvr-02-linux-2.6-vm-split-lrus.patch --]
[-- Type: text/plain, Size: 58555 bytes --]

Split the LRU lists in two, one set for pages that are backed by
real file systems ("file") and one for pages that are backed by
memory and swap ("anon").  The latter includes tmpfs.

Eventually mlocked pages will be taken off the LRUs alltogether.
A patch for that already exists and just needs to be integrated
into this series.

This patch mostly has the infrastructure and a basic policy to
balance how much we scan the anon lists and how much we scan
the file lists. The big policy changes are in separate patches.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Index: linux-2.6.25-mm1/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.25-mm1.orig/fs/proc/proc_misc.c	2008-04-22 10:33:44.000000000 -0400
+++ linux-2.6.25-mm1/fs/proc/proc_misc.c	2008-04-24 12:01:36.000000000 -0400
@@ -132,6 +132,10 @@ static int meminfo_read_proc(char *page,
 	unsigned long allowed;
 	struct vmalloc_info vmi;
 	long cached;
+	unsigned long inactive_anon;
+	unsigned long active_anon;
+	unsigned long inactive_file;
+	unsigned long active_file;
 
 /*
  * display in kilobytes.
@@ -150,48 +154,61 @@ static int meminfo_read_proc(char *page,
 
 	get_vmalloc_info(&vmi);
 
+	inactive_anon = global_page_state(NR_INACTIVE_ANON);
+	active_anon   = global_page_state(NR_ACTIVE_ANON);
+	inactive_file = global_page_state(NR_INACTIVE_FILE);
+	active_file   = global_page_state(NR_ACTIVE_FILE);
+
 	/*
 	 * Tagged format, for easy grepping and expansion.
 	 */
 	len = sprintf(page,
-		"MemTotal:     %8lu kB\n"
-		"MemFree:      %8lu kB\n"
-		"Buffers:      %8lu kB\n"
-		"Cached:       %8lu kB\n"
-		"SwapCached:   %8lu kB\n"
-		"Active:       %8lu kB\n"
-		"Inactive:     %8lu kB\n"
+		"MemTotal:       %8lu kB\n"
+		"MemFree:        %8lu kB\n"
+		"Buffers:        %8lu kB\n"
+		"Cached:         %8lu kB\n"
+		"SwapCached:     %8lu kB\n"
+		"Active:         %8lu kB\n"
+		"Inactive:       %8lu kB\n"
+		"Active(anon):   %8lu kB\n"
+		"Inactive(anon): %8lu kB\n"
+		"Active(file):   %8lu kB\n"
+		"Inactive(file): %8lu kB\n"
 #ifdef CONFIG_HIGHMEM
-		"HighTotal:    %8lu kB\n"
-		"HighFree:     %8lu kB\n"
-		"LowTotal:     %8lu kB\n"
-		"LowFree:      %8lu kB\n"
-#endif
-		"SwapTotal:    %8lu kB\n"
-		"SwapFree:     %8lu kB\n"
-		"Dirty:        %8lu kB\n"
-		"Writeback:    %8lu kB\n"
-		"AnonPages:    %8lu kB\n"
-		"Mapped:       %8lu kB\n"
-		"Slab:         %8lu kB\n"
-		"SReclaimable: %8lu kB\n"
-		"SUnreclaim:   %8lu kB\n"
-		"PageTables:   %8lu kB\n"
-		"NFS_Unstable: %8lu kB\n"
-		"Bounce:       %8lu kB\n"
-		"WritebackTmp: %8lu kB\n"
-		"CommitLimit:  %8lu kB\n"
-		"Committed_AS: %8lu kB\n"
-		"VmallocTotal: %8lu kB\n"
-		"VmallocUsed:  %8lu kB\n"
-		"VmallocChunk: %8lu kB\n",
+		"HighTotal:      %8lu kB\n"
+		"HighFree:       %8lu kB\n"
+		"LowTotal:       %8lu kB\n"
+		"LowFree:        %8lu kB\n"
+#endif
+		"SwapTotal:      %8lu kB\n"
+		"SwapFree:       %8lu kB\n"
+		"Dirty:          %8lu kB\n"
+		"Writeback:      %8lu kB\n"
+		"AnonPages:      %8lu kB\n"
+		"Mapped:         %8lu kB\n"
+		"Slab:           %8lu kB\n"
+		"SReclaimable:   %8lu kB\n"
+		"SUnreclaim:     %8lu kB\n"
+		"PageTables:     %8lu kB\n"
+		"NFS_Unstable:   %8lu kB\n"
+		"Bounce:         %8lu kB\n"
+		"WritebackTmp:   %8lu kB\n"
+		"CommitLimit:    %8lu kB\n"
+		"Committed_AS:   %8lu kB\n"
+		"VmallocTotal:   %8lu kB\n"
+		"VmallocUsed:    %8lu kB\n"
+		"VmallocChunk:   %8lu kB\n",
 		K(i.totalram),
 		K(i.freeram),
 		K(i.bufferram),
 		K(cached),
 		K(total_swapcache_pages),
-		K(global_page_state(NR_ACTIVE)),
-		K(global_page_state(NR_INACTIVE)),
+		K(active_anon   + active_file),
+		K(inactive_anon + inactive_file),
+		K(active_anon),
+		K(inactive_anon),
+		K(active_file),
+		K(inactive_file),
 #ifdef CONFIG_HIGHMEM
 		K(i.totalhigh),
 		K(i.freehigh),
Index: linux-2.6.25-mm1/fs/cifs/file.c
===================================================================
--- linux-2.6.25-mm1.orig/fs/cifs/file.c	2008-04-22 10:33:24.000000000 -0400
+++ linux-2.6.25-mm1/fs/cifs/file.c	2008-04-24 12:01:36.000000000 -0400
@@ -1778,7 +1778,7 @@ static void cifs_copy_cache_pages(struct
 		SetPageUptodate(page);
 		unlock_page(page);
 		if (!pagevec_add(plru_pvec, page))
-			__pagevec_lru_add(plru_pvec);
+			__pagevec_lru_add_file(plru_pvec);
 		data += PAGE_CACHE_SIZE;
 	}
 	return;
@@ -1912,7 +1912,7 @@ static int cifs_readpages(struct file *f
 		bytes_read = 0;
 	}
 
-	pagevec_lru_add(&lru_pvec);
+	pagevec_lru_add_file(&lru_pvec);
 
 /* need to free smb_read_data buf before exit */
 	if (smb_read_data) {
Index: linux-2.6.25-mm1/fs/ntfs/file.c
===================================================================
--- linux-2.6.25-mm1.orig/fs/ntfs/file.c	2008-04-22 10:33:25.000000000 -0400
+++ linux-2.6.25-mm1/fs/ntfs/file.c	2008-04-24 12:01:36.000000000 -0400
@@ -439,7 +439,7 @@ static inline int __ntfs_grab_cache_page
 			pages[nr] = *cached_page;
 			page_cache_get(*cached_page);
 			if (unlikely(!pagevec_add(lru_pvec, *cached_page)))
-				__pagevec_lru_add(lru_pvec);
+				__pagevec_lru_add_file(lru_pvec);
 			*cached_page = NULL;
 		}
 		index++;
@@ -2084,7 +2084,7 @@ err_out:
 						OSYNC_METADATA|OSYNC_DATA);
 		}
   	}
-	pagevec_lru_add(&lru_pvec);
+	pagevec_lru_add_file(&lru_pvec);
 	ntfs_debug("Done.  Returning %s (written 0x%lx, status %li).",
 			written ? "written" : "status", (unsigned long)written,
 			(long)status);
Index: linux-2.6.25-mm1/fs/nfs/dir.c
===================================================================
--- linux-2.6.25-mm1.orig/fs/nfs/dir.c	2008-04-22 10:33:44.000000000 -0400
+++ linux-2.6.25-mm1/fs/nfs/dir.c	2008-04-24 12:01:36.000000000 -0400
@@ -1524,7 +1524,7 @@ static int nfs_symlink(struct inode *dir
 	if (!add_to_page_cache(page, dentry->d_inode->i_mapping, 0,
 							GFP_KERNEL)) {
 		pagevec_add(&lru_pvec, page);
-		pagevec_lru_add(&lru_pvec);
+		pagevec_lru_add_file(&lru_pvec);
 		SetPageUptodate(page);
 		unlock_page(page);
 	} else
Index: linux-2.6.25-mm1/fs/ramfs/file-nommu.c
===================================================================
--- linux-2.6.25-mm1.orig/fs/ramfs/file-nommu.c	2008-04-22 10:33:00.000000000 -0400
+++ linux-2.6.25-mm1/fs/ramfs/file-nommu.c	2008-04-24 12:01:36.000000000 -0400
@@ -111,12 +111,12 @@ static int ramfs_nommu_expand_for_mappin
 			goto add_error;
 
 		if (!pagevec_add(&lru_pvec, page))
-			__pagevec_lru_add(&lru_pvec);
+			__pagevec_lru_add_file(&lru_pvec);
 
 		unlock_page(page);
 	}
 
-	pagevec_lru_add(&lru_pvec);
+	pagevec_lru_add_file(&lru_pvec);
 	return 0;
 
  fsize_exceeded:
Index: linux-2.6.25-mm1/drivers/base/node.c
===================================================================
--- linux-2.6.25-mm1.orig/drivers/base/node.c	2008-04-22 10:33:42.000000000 -0400
+++ linux-2.6.25-mm1/drivers/base/node.c	2008-04-24 12:01:36.000000000 -0400
@@ -45,34 +45,44 @@ static ssize_t node_read_meminfo(struct 
 	si_meminfo_node(&i, nid);
 
 	n = sprintf(buf, "\n"
-		       "Node %d MemTotal:     %8lu kB\n"
-		       "Node %d MemFree:      %8lu kB\n"
-		       "Node %d MemUsed:      %8lu kB\n"
-		       "Node %d Active:       %8lu kB\n"
-		       "Node %d Inactive:     %8lu kB\n"
+		       "Node %d MemTotal:       %8lu kB\n"
+		       "Node %d MemFree:        %8lu kB\n"
+		       "Node %d MemUsed:        %8lu kB\n"
+		       "Node %d Active:         %8lu kB\n"
+		       "Node %d Inactive:       %8lu kB\n"
+		       "Node %d Active(anon):   %8lu kB\n"
+		       "Node %d Inactive(anon): %8lu kB\n"
+		       "Node %d Active(file):   %8lu kB\n"
+		       "Node %d Inactive(file): %8lu kB\n"
 #ifdef CONFIG_HIGHMEM
-		       "Node %d HighTotal:    %8lu kB\n"
-		       "Node %d HighFree:     %8lu kB\n"
-		       "Node %d LowTotal:     %8lu kB\n"
-		       "Node %d LowFree:      %8lu kB\n"
+		       "Node %d HighTotal:      %8lu kB\n"
+		       "Node %d HighFree:       %8lu kB\n"
+		       "Node %d LowTotal:       %8lu kB\n"
+		       "Node %d LowFree:        %8lu kB\n"
 #endif
-		       "Node %d Dirty:        %8lu kB\n"
-		       "Node %d Writeback:    %8lu kB\n"
-		       "Node %d FilePages:    %8lu kB\n"
-		       "Node %d Mapped:       %8lu kB\n"
-		       "Node %d AnonPages:    %8lu kB\n"
-		       "Node %d PageTables:   %8lu kB\n"
-		       "Node %d NFS_Unstable: %8lu kB\n"
-		       "Node %d Bounce:       %8lu kB\n"
-		       "Node %d WritebackTmp: %8lu kB\n"
-		       "Node %d Slab:         %8lu kB\n"
-		       "Node %d SReclaimable: %8lu kB\n"
-		       "Node %d SUnreclaim:   %8lu kB\n",
+		       "Node %d Dirty:          %8lu kB\n"
+		       "Node %d Writeback:      %8lu kB\n"
+		       "Node %d FilePages:      %8lu kB\n"
+		       "Node %d Mapped:         %8lu kB\n"
+		       "Node %d AnonPages:      %8lu kB\n"
+		       "Node %d PageTables:     %8lu kB\n"
+		       "Node %d NFS_Unstable:   %8lu kB\n"
+		       "Node %d Bounce:         %8lu kB\n"
+		       "Node %d WritebackTmp:   %8lu kB\n"
+		       "Node %d Slab:           %8lu kB\n"
+		       "Node %d SReclaimable:   %8lu kB\n"
+		       "Node %d SUnreclaim:     %8lu kB\n",
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
 		       nid, K(i.totalram - i.freeram),
-		       nid, node_page_state(nid, NR_ACTIVE),
-		       nid, node_page_state(nid, NR_INACTIVE),
+		       nid, node_page_state(nid, NR_ACTIVE_ANON) +
+				node_page_state(nid, NR_ACTIVE_FILE),
+		       nid, node_page_state(nid, NR_INACTIVE_ANON) +
+				node_page_state(nid, NR_INACTIVE_FILE),
+		       nid, node_page_state(nid, NR_ACTIVE_ANON),
+		       nid, node_page_state(nid, NR_INACTIVE_ANON),
+		       nid, node_page_state(nid, NR_ACTIVE_FILE),
+		       nid, node_page_state(nid, NR_INACTIVE_FILE),
 #ifdef CONFIG_HIGHMEM
 		       nid, K(i.totalhigh),
 		       nid, K(i.freehigh),
Index: linux-2.6.25-mm1/mm/memory.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/memory.c	2008-04-24 12:00:01.000000000 -0400
+++ linux-2.6.25-mm1/mm/memory.c	2008-04-24 12:01:36.000000000 -0400
@@ -1757,7 +1757,7 @@ gotten:
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		SetPageSwapBacked(new_page);
-		lru_cache_add_active(new_page);
+		lru_cache_add_active_anon(new_page);
 		page_add_new_anon_rmap(new_page, vma, address);
 
 		/* Free the old page.. */
@@ -2226,7 +2226,7 @@ static int do_anonymous_page(struct mm_s
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	SetPageSwapBacked(page);
-	lru_cache_add_active(page);
+	lru_cache_add_active_anon(page);
 	page_add_new_anon_rmap(page, vma, address);
 	set_pte_at(mm, address, page_table, entry);
 
@@ -2370,7 +2370,7 @@ static int __do_fault(struct mm_struct *
 		if (anon) {
                         inc_mm_counter(mm, anon_rss);
 			SetPageSwapBacked(page);
-                        lru_cache_add_active(page);
+                        lru_cache_add_active_anon(page);
                         page_add_new_anon_rmap(page, vma, address);
 		} else {
 			inc_mm_counter(mm, file_rss);
Index: linux-2.6.25-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/page_alloc.c	2008-04-24 12:00:01.000000000 -0400
+++ linux-2.6.25-mm1/mm/page_alloc.c	2008-04-24 12:01:36.000000000 -0400
@@ -1923,10 +1923,13 @@ void show_free_areas(void)
 		}
 	}
 
-	printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu\n"
+	printk("Active_anon:%lu active_file:%lu inactive_anon%lu\n"
+		" inactive_file:%lu dirty:%lu writeback:%lu unstable:%lu\n"
 		" free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n",
-		global_page_state(NR_ACTIVE),
-		global_page_state(NR_INACTIVE),
+		global_page_state(NR_ACTIVE_ANON),
+		global_page_state(NR_ACTIVE_FILE),
+		global_page_state(NR_INACTIVE_ANON),
+		global_page_state(NR_INACTIVE_FILE),
 		global_page_state(NR_FILE_DIRTY),
 		global_page_state(NR_WRITEBACK),
 		global_page_state(NR_UNSTABLE_NFS),
@@ -1949,8 +1952,10 @@ void show_free_areas(void)
 			" min:%lukB"
 			" low:%lukB"
 			" high:%lukB"
-			" active:%lukB"
-			" inactive:%lukB"
+			" active_anon:%lukB"
+			" inactive_anon:%lukB"
+			" active_file:%lukB"
+			" inactive_file:%lukB"
 			" present:%lukB"
 			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
@@ -1960,8 +1965,10 @@ void show_free_areas(void)
 			K(zone->pages_min),
 			K(zone->pages_low),
 			K(zone->pages_high),
-			K(zone_page_state(zone, NR_ACTIVE)),
-			K(zone_page_state(zone, NR_INACTIVE)),
+			K(zone_page_state(zone, NR_ACTIVE_ANON)),
+			K(zone_page_state(zone, NR_INACTIVE_ANON)),
+			K(zone_page_state(zone, NR_ACTIVE_FILE)),
+			K(zone_page_state(zone, NR_INACTIVE_FILE)),
 			K(zone->present_pages),
 			zone->pages_scanned,
 			(zone_is_all_unreclaimable(zone) ? "yes" : "no")
@@ -3505,6 +3512,9 @@ static void __paginginit free_area_init_
 			INIT_LIST_HEAD(&zone->list[l]);
 			zone->nr_scan[l] = 0;
 		}
+		zone->recent_rotated_anon = 0;
+		zone->recent_rotated_file = 0;
+//TODO recent_scanned_* ???
 		zap_zone_vm_stats(zone);
 		zone->flags = 0;
 		if (!size)
Index: linux-2.6.25-mm1/mm/swap.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/swap.c	2008-04-24 11:59:56.000000000 -0400
+++ linux-2.6.25-mm1/mm/swap.c	2008-04-24 12:01:36.000000000 -0400
@@ -108,6 +108,7 @@ enum lru_list page_lru(struct page *page
 
 	if (PageActive(page))
 		lru += LRU_ACTIVE;
+	lru += page_file_cache(page);
 
 	return lru;
 }
@@ -133,7 +134,8 @@ static void pagevec_move_tail(struct pag
 			spin_lock(&zone->lru_lock);
 		}
 		if (PageLRU(page) && !PageActive(page)) {
-			list_move_tail(&page->lru, &zone->list[LRU_INACTIVE]);
+			int lru = page_file_cache(page);
+			list_move_tail(&page->lru, &zone->list[lru]);
 			pgmoved++;
 		}
 	}
@@ -174,9 +176,13 @@ void activate_page(struct page *page)
 
 	spin_lock_irq(&zone->lru_lock);
 	if (PageLRU(page) && !PageActive(page)) {
-		del_page_from_inactive_list(zone, page);
+		int lru = LRU_BASE;
+		lru += page_file_cache(page);
+		del_page_from_lru_list(zone, page, lru);
+
 		SetPageActive(page);
-		add_page_to_active_list(zone, page);
+		lru += LRU_ACTIVE;
+		add_page_to_lru_list(zone, page, lru);
 		__count_vm_event(PGACTIVATE);
 		mem_cgroup_move_lists(page, true);
 	}
Index: linux-2.6.25-mm1/mm/readahead.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/readahead.c	2008-04-22 10:33:45.000000000 -0400
+++ linux-2.6.25-mm1/mm/readahead.c	2008-04-24 12:01:36.000000000 -0400
@@ -229,7 +229,7 @@ int do_page_cache_readahead(struct addre
  */
 unsigned long max_sane_readahead(unsigned long nr)
 {
-	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE)
+	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
 		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
 }
 
Index: linux-2.6.25-mm1/mm/filemap.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/filemap.c	2008-04-22 10:33:45.000000000 -0400
+++ linux-2.6.25-mm1/mm/filemap.c	2008-04-24 12:01:36.000000000 -0400
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/mm_inline.h> /* for page_file_cache() */
 #include "internal.h"
 
 /*
@@ -492,8 +493,12 @@ int add_to_page_cache_lru(struct page *p
 				pgoff_t offset, gfp_t gfp_mask)
 {
 	int ret = add_to_page_cache(page, mapping, offset, gfp_mask);
-	if (ret == 0)
-		lru_cache_add(page);
+	if (ret == 0) {
+		if (page_file_cache(page))
+			lru_cache_add_file(page);
+		else
+			lru_cache_add_active_anon(page);
+	}
 	return ret;
 }
 
Index: linux-2.6.25-mm1/mm/vmstat.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/vmstat.c	2008-04-24 10:46:43.000000000 -0400
+++ linux-2.6.25-mm1/mm/vmstat.c	2008-04-24 12:01:36.000000000 -0400
@@ -687,8 +687,10 @@ const struct seq_operations pagetypeinfo
 static const char * const vmstat_text[] = {
 	/* Zoned VM counters */
 	"nr_free_pages",
-	"nr_inactive",
-	"nr_active",
+	"nr_inactive_anon",
+	"nr_active_anon",
+	"nr_inactive_file",
+	"nr_active_file",
 	"nr_anon_pages",
 	"nr_mapped",
 	"nr_file_pages",
@@ -756,7 +758,7 @@ static void zoneinfo_show_print(struct s
 		   "\n        min      %lu"
 		   "\n        low      %lu"
 		   "\n        high     %lu"
-		   "\n        scanned  %lu (a: %lu i: %lu)"
+		   "\n        scanned  %lu (aa: %lu ia: %lu af: %lu if: %lu)"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
@@ -764,8 +766,10 @@ static void zoneinfo_show_print(struct s
 		   zone->pages_low,
 		   zone->pages_high,
 		   zone->pages_scanned,
-		   zone->nr_scan[LRU_ACTIVE],
-		   zone->nr_scan[LRU_INACTIVE],
+		   zone->nr_scan[LRU_ACTIVE_ANON],
+		   zone->nr_scan[LRU_INACTIVE_ANON],
+		   zone->nr_scan[LRU_ACTIVE_FILE],
+		   zone->nr_scan[LRU_INACTIVE_FILE],
 		   zone->spanned_pages,
 		   zone->present_pages);
 
Index: linux-2.6.25-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/vmscan.c	2008-04-24 11:59:56.000000000 -0400
+++ linux-2.6.25-mm1/mm/vmscan.c	2008-04-24 12:02:16.000000000 -0400
@@ -78,7 +78,7 @@ struct scan_control {
 	unsigned long (*isolate_pages)(unsigned long nr, struct list_head *dst,
 			unsigned long *scanned, int order, int mode,
 			struct zone *z, struct mem_cgroup *mem_cont,
-			int active);
+			int active, int file);
 };
 
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -236,27 +236,6 @@ unsigned long shrink_slab(unsigned long 
 	return ret;
 }
 
-/* Called without lock on whether page is mapped, so answer is unstable */
-static inline int page_mapping_inuse(struct page *page)
-{
-	struct address_space *mapping;
-
-	/* Page is in somebody's page tables. */
-	if (page_mapped(page))
-		return 1;
-
-	/* Be more reluctant to reclaim swapcache than pagecache */
-	if (PageSwapCache(page))
-		return 1;
-
-	mapping = page_mapping(page);
-	if (!mapping)
-		return 0;
-
-	/* File is mmap'd by somebody? */
-	return mapping_mapped(mapping);
-}
-
 static inline int is_page_cache_freeable(struct page *page)
 {
 	return page_count(page) - !!PagePrivate(page) == 2;
@@ -518,8 +497,7 @@ static unsigned long shrink_page_list(st
 
 		referenced = page_referenced(page, 1, sc->mem_cgroup);
 		/* In active use or really unfreeable?  Activate it. */
-		if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
-					referenced && page_mapping_inuse(page))
+		if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced)
 			goto activate_locked;
 
 #ifdef CONFIG_SWAP
@@ -550,8 +528,6 @@ static unsigned long shrink_page_list(st
 		}
 
 		if (PageDirty(page)) {
-			if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced)
-				goto keep_locked;
 			if (!may_enter_fs)
 				goto keep_locked;
 			if (!sc->may_writepage)
@@ -652,7 +628,7 @@ keep:
  *
  * returns 0 on success, -ve errno on failure.
  */
-int __isolate_lru_page(struct page *page, int mode)
+int __isolate_lru_page(struct page *page, int mode, int file)
 {
 	int ret = -EINVAL;
 
@@ -668,6 +644,9 @@ int __isolate_lru_page(struct page *page
 	if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
 		return ret;
 
+	if (mode != ISOLATE_BOTH && (!page_file_cache(page) != !file))
+		return ret;
+
 	ret = -EBUSY;
 	if (likely(get_page_unless_zero(page))) {
 		/*
@@ -698,12 +677,13 @@ int __isolate_lru_page(struct page *page
  * @scanned:	The number of pages that were scanned.
  * @order:	The caller's attempted allocation order
  * @mode:	One of the LRU isolation modes
+ * @file:	True [1] if isolating file [!anon] pages
  *
  * returns how many pages were moved onto *@dst.
  */
 static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		struct list_head *src, struct list_head *dst,
-		unsigned long *scanned, int order, int mode)
+		unsigned long *scanned, int order, int mode, int file)
 {
 	unsigned long nr_taken = 0;
 	unsigned long scan;
@@ -720,7 +700,7 @@ static unsigned long isolate_lru_pages(u
 
 		VM_BUG_ON(!PageLRU(page));
 
-		switch (__isolate_lru_page(page, mode)) {
+		switch (__isolate_lru_page(page, mode, file)) {
 		case 0:
 			list_move(&page->lru, dst);
 			nr_taken++;
@@ -763,10 +743,11 @@ static unsigned long isolate_lru_pages(u
 				break;
 
 			cursor_page = pfn_to_page(pfn);
+
 			/* Check that we have not crossed a zone boundary. */
 			if (unlikely(page_zone_id(cursor_page) != zone_id))
 				continue;
-			switch (__isolate_lru_page(cursor_page, mode)) {
+			switch (__isolate_lru_page(cursor_page, mode, file)) {
 			case 0:
 				list_move(&cursor_page->lru, dst);
 				nr_taken++;
@@ -791,30 +772,37 @@ static unsigned long isolate_pages_globa
 					unsigned long *scanned, int order,
 					int mode, struct zone *z,
 					struct mem_cgroup *mem_cont,
-					int active)
+					int active, int file)
 {
+	int lru = LRU_BASE;
 	if (active)
-		return isolate_lru_pages(nr, &z->list[LRU_ACTIVE], dst,
-						scanned, order, mode);
-	else
-		return isolate_lru_pages(nr, &z->list[LRU_INACTIVE], dst,
-						scanned, order, mode);
+		lru += LRU_ACTIVE;
+	if (file)
+		lru += LRU_FILE;
+	return isolate_lru_pages(nr, &z->list[lru], dst, scanned, order,
+								mode, !!file);
 }
 
 /*
  * clear_active_flags() is a helper for shrink_active_list(), clearing
  * any active bits from the pages in the list.
  */
-static unsigned long clear_active_flags(struct list_head *page_list)
+static unsigned long clear_active_flags(struct list_head *page_list,
+					unsigned int *count)
 {
 	int nr_active = 0;
+	int lru;
 	struct page *page;
 
-	list_for_each_entry(page, page_list, lru)
+	list_for_each_entry(page, page_list, lru) {
+		lru = page_file_cache(page);
 		if (PageActive(page)) {
+			lru += LRU_ACTIVE;
 			ClearPageActive(page);
 			nr_active++;
 		}
+		count[lru]++;
+	}
 
 	return nr_active;
 }
@@ -852,12 +840,12 @@ int isolate_lru_page(struct page *page)
 
 		spin_lock_irq(&zone->lru_lock);
 		if (PageLRU(page) && get_page_unless_zero(page)) {
+			int lru = LRU_BASE;
 			ret = 0;
 			ClearPageLRU(page);
-			if (PageActive(page))
-				del_page_from_active_list(zone, page);
-			else
-				del_page_from_inactive_list(zone, page);
+
+			lru += page_file_cache(page) + !!PageActive(page);
+			del_page_from_lru_list(zone, page, lru);
 		}
 		spin_unlock_irq(&zone->lru_lock);
 	}
@@ -869,7 +857,7 @@ int isolate_lru_page(struct page *page)
  * of reclaimed pages
  */
 static unsigned long shrink_inactive_list(unsigned long max_scan,
-				struct zone *zone, struct scan_control *sc)
+			struct zone *zone, struct scan_control *sc, int file)
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
@@ -886,18 +874,25 @@ static unsigned long shrink_inactive_lis
 		unsigned long nr_scan;
 		unsigned long nr_freed;
 		unsigned long nr_active;
+		unsigned int count[NR_LRU_LISTS] = { 0, };
+		int mode = (sc->order > PAGE_ALLOC_COSTLY_ORDER) ?
+					ISOLATE_BOTH : ISOLATE_INACTIVE;
 
 		nr_taken = sc->isolate_pages(sc->swap_cluster_max,
-			     &page_list, &nr_scan, sc->order,
-			     (sc->order > PAGE_ALLOC_COSTLY_ORDER)?
-					     ISOLATE_BOTH : ISOLATE_INACTIVE,
-				zone, sc->mem_cgroup, 0);
-		nr_active = clear_active_flags(&page_list);
+			     &page_list, &nr_scan, sc->order, mode,
+				zone, sc->mem_cgroup, 0, file);
+		nr_active = clear_active_flags(&page_list, count);
 		__count_vm_events(PGDEACTIVATE, nr_active);
 
-		__mod_zone_page_state(zone, NR_ACTIVE, -nr_active);
-		__mod_zone_page_state(zone, NR_INACTIVE,
-						-(nr_taken - nr_active));
+		__mod_zone_page_state(zone, NR_ACTIVE_FILE,
+						-count[LRU_ACTIVE_FILE]);
+		__mod_zone_page_state(zone, NR_INACTIVE_FILE,
+						-count[LRU_INACTIVE_FILE]);
+		__mod_zone_page_state(zone, NR_ACTIVE_ANON,
+						-count[LRU_ACTIVE_ANON]);
+		__mod_zone_page_state(zone, NR_INACTIVE_ANON,
+						-count[LRU_INACTIVE_ANON]);
+
 		if (scan_global_lru(sc))
 			zone->pages_scanned += nr_scan;
 		spin_unlock_irq(&zone->lru_lock);
@@ -919,7 +914,7 @@ static unsigned long shrink_inactive_lis
 			 * The attempt at page out may have made some
 			 * of the pages active, mark them inactive again.
 			 */
-			nr_active = clear_active_flags(&page_list);
+			nr_active = clear_active_flags(&page_list, count);
 			count_vm_events(PGDEACTIVATE, nr_active);
 
 			nr_freed += shrink_page_list(&page_list, sc,
@@ -944,11 +939,20 @@ static unsigned long shrink_inactive_lis
 		 * Put back any unfreeable pages.
 		 */
 		while (!list_empty(&page_list)) {
+			int lru = LRU_BASE;
 			page = lru_to_page(&page_list);
 			VM_BUG_ON(PageLRU(page));
 			SetPageLRU(page);
 			list_del(&page->lru);
-			add_page_to_lru_list(zone, page, PageActive(page));
+			if (page_file_cache(page)) {
+				lru += LRU_FILE;
+				zone->recent_rotated_file++;
+			} else {
+				zone->recent_rotated_anon++;
+			}
+			if (PageActive(page))
+				lru += LRU_ACTIVE;
+			add_page_to_lru_list(zone, page, lru);
 			if (!pagevec_add(&pvec, page)) {
 				spin_unlock_irq(&zone->lru_lock);
 				__pagevec_release(&pvec);
@@ -979,115 +983,7 @@ static inline void note_zone_scanning_pr
 
 static inline int zone_is_near_oom(struct zone *zone)
 {
-	return zone->pages_scanned >= (zone_page_state(zone, NR_ACTIVE)
-				+ zone_page_state(zone, NR_INACTIVE))*3;
-}
-
-/*
- * Determine we should try to reclaim mapped pages.
- * This is called only when sc->mem_cgroup is NULL.
- */
-static int calc_reclaim_mapped(struct scan_control *sc, struct zone *zone,
-				int priority)
-{
-	long mapped_ratio;
-	long distress;
-	long swap_tendency;
-	long imbalance;
-	int reclaim_mapped = 0;
-	int prev_priority;
-
-	if (scan_global_lru(sc) && zone_is_near_oom(zone))
-		return 1;
-	/*
-	 * `distress' is a measure of how much trouble we're having
-	 * reclaiming pages.  0 -> no problems.  100 -> great trouble.
-	 */
-	if (scan_global_lru(sc))
-		prev_priority = zone->prev_priority;
-	else
-		prev_priority = mem_cgroup_get_reclaim_priority(sc->mem_cgroup);
-
-	distress = 100 >> min(prev_priority, priority);
-
-	/*
-	 * The point of this algorithm is to decide when to start
-	 * reclaiming mapped memory instead of just pagecache.  Work out
-	 * how much memory
-	 * is mapped.
-	 */
-	if (scan_global_lru(sc))
-		mapped_ratio = ((global_page_state(NR_FILE_MAPPED) +
-				global_page_state(NR_ANON_PAGES)) * 100) /
-					vm_total_pages;
-	else
-		mapped_ratio = mem_cgroup_calc_mapped_ratio(sc->mem_cgroup);
-
-	/*
-	 * Now decide how much we really want to unmap some pages.  The
-	 * mapped ratio is downgraded - just because there's a lot of
-	 * mapped memory doesn't necessarily mean that page reclaim
-	 * isn't succeeding.
-	 *
-	 * The distress ratio is important - we don't want to start
-	 * going oom.
-	 *
-	 * A 100% value of vm_swappiness overrides this algorithm
-	 * altogether.
-	 */
-	swap_tendency = mapped_ratio / 2 + distress + sc->swappiness;
-
-	/*
-	 * If there's huge imbalance between active and inactive
-	 * (think active 100 times larger than inactive) we should
-	 * become more permissive, or the system will take too much
-	 * cpu before it start swapping during memory pressure.
-	 * Distress is about avoiding early-oom, this is about
-	 * making swappiness graceful despite setting it to low
-	 * values.
-	 *
-	 * Avoid div by zero with nr_inactive+1, and max resulting
-	 * value is vm_total_pages.
-	 */
-	if (scan_global_lru(sc)) {
-		imbalance  = zone_page_state(zone, NR_ACTIVE);
-		imbalance /= zone_page_state(zone, NR_INACTIVE) + 1;
-	} else
-		imbalance = mem_cgroup_reclaim_imbalance(sc->mem_cgroup);
-
-	/*
-	 * Reduce the effect of imbalance if swappiness is low,
-	 * this means for a swappiness very low, the imbalance
-	 * must be much higher than 100 for this logic to make
-	 * the difference.
-	 *
-	 * Max temporary value is vm_total_pages*100.
-	 */
-	imbalance *= (vm_swappiness + 1);
-	imbalance /= 100;
-
-	/*
-	 * If not much of the ram is mapped, makes the imbalance
-	 * less relevant, it's high priority we refill the inactive
-	 * list with mapped pages only in presence of high ratio of
-	 * mapped pages.
-	 *
-	 * Max temporary value is vm_total_pages*100.
-	 */
-	imbalance *= mapped_ratio;
-	imbalance /= 100;
-
-	/* apply imbalance feedback to swap_tendency */
-	swap_tendency += imbalance;
-
-	/*
-	 * Now use this metric to decide whether to start moving mapped
-	 * memory onto the inactive list.
-	 */
-	if (swap_tendency >= 100)
-		reclaim_mapped = 1;
-
-	return reclaim_mapped;
+	return zone->pages_scanned >= (zone_lru_pages(zone) * 3);
 }
 
 /*
@@ -1110,7 +1006,7 @@ static int calc_reclaim_mapped(struct sc
 
 
 static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
-				struct scan_control *sc, int priority)
+			struct scan_control *sc, int priority, int file)
 {
 	unsigned long pgmoved;
 	int pgdeactivate = 0;
@@ -1120,16 +1016,13 @@ static void shrink_active_list(unsigned 
 	LIST_HEAD(l_inactive);
 	struct page *page;
 	struct pagevec pvec;
-	int reclaim_mapped = 0;
-
-	if (sc->may_swap)
-		reclaim_mapped = calc_reclaim_mapped(sc, zone, priority);
+	enum lru_list lru;
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 	pgmoved = sc->isolate_pages(nr_pages, &l_hold, &pgscanned, sc->order,
 					ISOLATE_ACTIVE, zone,
-					sc->mem_cgroup, 1);
+					sc->mem_cgroup, 1, file);
 	/*
 	 * zone->pages_scanned is used for detect zone's oom
 	 * mem_cgroup remembers nr_scan by itself.
@@ -1137,29 +1030,29 @@ static void shrink_active_list(unsigned 
 	if (scan_global_lru(sc))
 		zone->pages_scanned += pgscanned;
 
-	__mod_zone_page_state(zone, NR_ACTIVE, -pgmoved);
+	if (file)
+		__mod_zone_page_state(zone, NR_ACTIVE_FILE, -pgmoved);
+	else
+		__mod_zone_page_state(zone, NR_ACTIVE_ANON, -pgmoved);
 	spin_unlock_irq(&zone->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
 		page = lru_to_page(&l_hold);
 		list_del(&page->lru);
-		if (page_mapped(page)) {
-			if (!reclaim_mapped ||
-			    (total_swap_pages == 0 && PageAnon(page)) ||
-			    page_referenced(page, 0, sc->mem_cgroup)) {
-				list_add(&page->lru, &l_active);
-				continue;
-			}
-		} else if (TestClearPageReferenced(page)) {
+		if (page_referenced(page, 0, sc->mem_cgroup))
 			list_add(&page->lru, &l_active);
-			continue;
-		}
-		list_add(&page->lru, &l_inactive);
+		else
+			list_add(&page->lru, &l_inactive);
 	}
 
+	/*
+	 * Now put the pages back on the appropriate [file or anon] inactive
+	 * and active lists.
+	 */
 	pagevec_init(&pvec, 1);
 	pgmoved = 0;
+	lru = LRU_BASE + file * LRU_FILE;
 	spin_lock_irq(&zone->lru_lock);
 	while (!list_empty(&l_inactive)) {
 		page = lru_to_page(&l_inactive);
@@ -1169,11 +1062,12 @@ static void shrink_active_list(unsigned 
 		VM_BUG_ON(!PageActive(page));
 		ClearPageActive(page);
 
-		list_move(&page->lru, &zone->list[LRU_INACTIVE]);
+		list_move(&page->lru, &zone->list[lru]);
 		mem_cgroup_move_lists(page, false);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
-			__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
+			__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru,
+								pgmoved);
 			spin_unlock_irq(&zone->lru_lock);
 			pgdeactivate += pgmoved;
 			pgmoved = 0;
@@ -1183,7 +1077,7 @@ static void shrink_active_list(unsigned 
 			spin_lock_irq(&zone->lru_lock);
 		}
 	}
-	__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru, pgmoved);
 	pgdeactivate += pgmoved;
 	if (buffer_heads_over_limit) {
 		spin_unlock_irq(&zone->lru_lock);
@@ -1192,6 +1086,7 @@ static void shrink_active_list(unsigned 
 	}
 
 	pgmoved = 0;
+	lru = LRU_ACTIVE + file * LRU_FILE;
 	while (!list_empty(&l_active)) {
 		page = lru_to_page(&l_active);
 		prefetchw_prev_lru_page(page, &l_active, flags);
@@ -1199,11 +1094,12 @@ static void shrink_active_list(unsigned 
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
 
-		list_move(&page->lru, &zone->list[LRU_ACTIVE]);
+		list_move(&page->lru, &zone->list[lru]);
 		mem_cgroup_move_lists(page, true);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
-			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
+			__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru,
+								pgmoved);
 			pgmoved = 0;
 			spin_unlock_irq(&zone->lru_lock);
 			if (vm_swap_full())
@@ -1212,7 +1108,12 @@ static void shrink_active_list(unsigned 
 			spin_lock_irq(&zone->lru_lock);
 		}
 	}
-	__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru, pgmoved);
+	if (file) {
+		zone->recent_rotated_file += pgmoved;
+	} else {
+		zone->recent_rotated_anon += pgmoved;
+	}
 
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
 	__count_vm_events(PGDEACTIVATE, pgdeactivate);
@@ -1223,16 +1124,82 @@ static void shrink_active_list(unsigned 
 	pagevec_release(&pvec);
 }
 
-static unsigned long shrink_list(enum lru_list l, unsigned long nr_to_scan,
+static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 	struct zone *zone, struct scan_control *sc, int priority)
 {
-	if (l == LRU_ACTIVE) {
-		shrink_active_list(nr_to_scan, zone, sc, priority);
+	int file = is_file_lru(lru);
+
+	if (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE) {
+		shrink_active_list(nr_to_scan, zone, sc, priority, file);
 		return 0;
 	}
-	return shrink_inactive_list(nr_to_scan, zone, sc);
+	return shrink_inactive_list(nr_to_scan, zone, sc, file);
+}
+
+/*
+ * The utility of the anon and file memory corresponds to the fraction
+ * of pages that were recently referenced in each category.  Pageout
+ * pressure is distributed according to the size of each set, the fraction
+ * of recently referenced pages (except used-once file pages) and the
+ * swappiness parameter.
+ *
+ * We return the relative pressures as percentages so shrink_zone can
+ * easily use them.
+ */
+static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
+					unsigned long *percent)
+{
+	unsigned long anon, file;
+	unsigned long anon_prio, file_prio;
+	unsigned long rotate_sum;
+	unsigned long ap, fp;
+
+	anon  = zone_page_state(zone, NR_ACTIVE_ANON) +
+		zone_page_state(zone, NR_INACTIVE_ANON);
+	file  = zone_page_state(zone, NR_ACTIVE_FILE) +
+		zone_page_state(zone, NR_INACTIVE_FILE);
+
+	rotate_sum = zone->recent_rotated_file + zone->recent_rotated_anon;
+
+	/* Keep a floating average of RECENT references. */
+	if (unlikely(rotate_sum > min(anon, file))) {
+		spin_lock_irq(&zone->lru_lock);
+		zone->recent_rotated_file /= 2;
+		zone->recent_rotated_anon /= 2;
+		spin_unlock_irq(&zone->lru_lock);
+		rotate_sum /= 2;
+	}
+
+	/*
+	 * With swappiness at 100, anonymous and file have the same priority.
+	 * This scanning priority is essentially the inverse of IO cost.
+	 */
+	anon_prio = sc->swappiness;
+	file_prio = 200 - sc->swappiness;
+
+	/*
+	 *                  anon       recent_rotated_anon
+	 * %anon = 100 * ----------- / ------------------- * IO cost
+	 *               anon + file       rotate_sum
+	 */
+	ap = (anon_prio * anon) / (anon + file + 1);
+	ap *= rotate_sum / (zone->recent_rotated_anon + 1);
+	if (ap == 0)
+		ap = 1;
+	else if (ap > 100)
+		ap = 100;
+	percent[0] = ap;
+
+	fp = (file_prio * file) / (anon + file + 1);
+	fp *= rotate_sum / (zone->recent_rotated_file + 1);
+	if (fp == 0)
+		fp = 1;
+	else if (fp > 100)
+		fp = 100;
+	percent[1] = fp;
 }
 
+
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
@@ -1242,36 +1209,38 @@ static unsigned long shrink_zone(int pri
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
 	unsigned long nr_reclaimed = 0;
+	unsigned long percent[2];	/* anon @ 0; file @ 1 */
 	enum lru_list l;
 
-	if (scan_global_lru(sc)) {
-		/*
-		 * Add one to nr_to_scan just to make sure that the kernel
-		 * will slowly sift through the active list.
-		 */
-		for_each_lru(l) {
+	get_scan_ratio(zone, sc, percent);
+
+	for_each_lru(l) {
+		if (scan_global_lru(sc)) {
+			int file = is_file_lru(l);
+			/*
+			 * Add one to nr_to_scan just to make sure that the
+			 * kernel will slowly sift through the active list.
+			 */
 			zone->nr_scan[l] += (zone_page_state(zone,
-					NR_INACTIVE + l)  >> priority) + 1;
-			nr[l] = zone->nr_scan[l];
+				NR_INACTIVE_ANON + l) >> priority) + 1;
+			nr[l] = zone->nr_scan[l] * percent[file] / 100;
 			if (nr[l] >= sc->swap_cluster_max)
 				zone->nr_scan[l] = 0;
 			else
 				nr[l] = 0;
+		} else {
+			/*
+			 * This reclaim occurs not because zone memory shortage
+			 * but because memory controller hits its limit.
+			 * Then, don't modify zone reclaim related data.
+			 */
+			nr[l] = mem_cgroup_calc_reclaim(sc->mem_cgroup, zone,
+								priority, l);
 		}
-	} else {
-		/*
-		 * This reclaim occurs not because zone memory shortage but
-		 * because memory controller hits its limit.
-		 * Then, don't modify zone reclaim related data.
-		 */
-		nr[LRU_ACTIVE] = mem_cgroup_calc_reclaim(sc->mem_cgroup,
-					zone, priority, LRU_ACTIVE);
-
-		nr[LRU_INACTIVE] = mem_cgroup_calc_reclaim(sc->mem_cgroup,
-					zone, priority, LRU_INACTIVE);
 	}
 
-	while (nr[LRU_ACTIVE] || nr[LRU_INACTIVE]) {
+	while (nr[LRU_ACTIVE_ANON] || nr[LRU_INACTIVE_ANON] ||
+				nr[LRU_ACTIVE_FILE] || nr[LRU_INACTIVE_FILE]) {
 		for_each_lru(l) {
 			if (nr[l]) {
 				nr_to_scan = min(nr[l],
@@ -1344,7 +1313,7 @@ static unsigned long shrink_zones(int pr
 
 	return nr_reclaimed;
 }
- 
+
 /*
  * This is the main entry point to direct page reclaim.
  *
@@ -1385,8 +1354,7 @@ static unsigned long do_try_to_free_page
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
 
-			lru_pages += zone_page_state(zone, NR_ACTIVE)
-					+ zone_page_state(zone, NR_INACTIVE);
+			lru_pages += zone_lru_pages(zone);
 		}
 	}
 
@@ -1586,8 +1554,7 @@ loop_again:
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
 
-			lru_pages += zone_page_state(zone, NR_ACTIVE)
-					+ zone_page_state(zone, NR_INACTIVE);
+			lru_pages += zone_lru_pages(zone);
 		}
 
 		/*
@@ -1631,8 +1598,7 @@ loop_again:
 			if (zone_is_all_unreclaimable(zone))
 				continue;
 			if (nr_slab == 0 && zone->pages_scanned >=
-				(zone_page_state(zone, NR_ACTIVE)
-				+ zone_page_state(zone, NR_INACTIVE)) * 6)
+						(zone_lru_pages(zone) * 6))
 					zone_set_flag(zone,
 						      ZONE_ALL_UNRECLAIMABLE);
 			/*
@@ -1686,7 +1652,7 @@ out:
 
 /*
  * The background pageout daemon, started as a kernel thread
- * from the init process. 
+ * from the init process.
  *
  * This basically trickles out pages so that we have _some_
  * free memory available even if there is no other activity
@@ -1781,6 +1747,14 @@ void wakeup_kswapd(struct zone *zone, in
 	wake_up_interruptible(&pgdat->kswapd_wait);
 }
 
+unsigned long global_lru_pages(void)
+{
+	return global_page_state(NR_ACTIVE_ANON)
+		+ global_page_state(NR_ACTIVE_FILE)
+		+ global_page_state(NR_INACTIVE_ANON)
+		+ global_page_state(NR_INACTIVE_FILE);
+}
+
 #ifdef CONFIG_PM
 /*
  * Helper function for shrink_all_memory().  Tries to reclaim 'nr_pages' pages
@@ -1806,17 +1780,18 @@ static unsigned long shrink_all_zones(un
 
 		for_each_lru(l) {
 			/* For pass = 0 we don't shrink the active list */
-			if (pass == 0 && l == LRU_ACTIVE)
+			if (pass == 0 &&
+				(l == LRU_ACTIVE_ANON || l == LRU_ACTIVE_FILE))
 				continue;
 
 			zone->nr_scan[l] +=
-				(zone_page_state(zone, NR_INACTIVE + l)
+				(zone_page_state(zone, NR_INACTIVE_ANON + l)
 								>> prio) + 1;
 			if (zone->nr_scan[l] >= nr_pages || pass > 3) {
 				zone->nr_scan[l] = 0;
 				nr_to_scan = min(nr_pages,
 					zone_page_state(zone,
-							NR_INACTIVE + l));
+							NR_INACTIVE_ANON + l));
 				ret += shrink_list(l, nr_to_scan, zone,
 								sc, prio);
 				if (ret >= nr_pages)
@@ -1828,11 +1803,6 @@ static unsigned long shrink_all_zones(un
 	return ret;
 }
 
-static unsigned long count_lru_pages(void)
-{
-	return global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE);
-}
-
 /*
  * Try to free `nr_pages' of memory, system-wide, and return the number of
  * freed pages.
@@ -1858,7 +1828,7 @@ unsigned long shrink_all_memory(unsigned
 
 	current->reclaim_state = &reclaim_state;
 
-	lru_pages = count_lru_pages();
+	lru_pages = global_lru_pages();
 	nr_slab = global_page_state(NR_SLAB_RECLAIMABLE);
 	/* If slab caches are huge, it's better to hit them first */
 	while (nr_slab >= lru_pages) {
@@ -1901,7 +1871,7 @@ unsigned long shrink_all_memory(unsigned
 
 			reclaim_state.reclaimed_slab = 0;
 			shrink_slab(sc.nr_scanned, sc.gfp_mask,
-					count_lru_pages());
+					global_lru_pages());
 			ret += reclaim_state.reclaimed_slab;
 			if (ret >= nr_pages)
 				goto out;
@@ -1918,7 +1888,7 @@ unsigned long shrink_all_memory(unsigned
 	if (!ret) {
 		do {
 			reclaim_state.reclaimed_slab = 0;
-			shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages());
+			shrink_slab(nr_pages, sc.gfp_mask, global_lru_pages());
 			ret += reclaim_state.reclaimed_slab;
 		} while (ret < nr_pages && reclaim_state.reclaimed_slab > 0);
 	}
Index: linux-2.6.25-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/swap_state.c	2008-04-24 12:00:01.000000000 -0400
+++ linux-2.6.25-mm1/mm/swap_state.c	2008-04-24 12:01:36.000000000 -0400
@@ -301,7 +301,7 @@ struct page *read_swap_cache_async(swp_e
 			/*
 			 * Initiate read into locked page and return.
 			 */
-			lru_cache_add_active(new_page);
+			lru_cache_add_active_anon(new_page);
 			swap_readpage(NULL, new_page);
 			return new_page;
 		}
Index: linux-2.6.25-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/mmzone.h	2008-04-24 11:59:51.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/mmzone.h	2008-04-24 12:01:36.000000000 -0400
@@ -81,21 +81,23 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
-	NR_INACTIVE,	/* must match order of LRU_[IN]ACTIVE */
-	NR_ACTIVE,	/*  "     "     "   "       "         */
+	NR_INACTIVE_ANON,	/* must match order of LRU_[IN]ACTIVE_* */
+	NR_ACTIVE_ANON,		/*  "     "     "   "       "           */
+	NR_INACTIVE_FILE,	/*  "     "     "   "       "           */
+	NR_ACTIVE_FILE,		/*  "     "     "   "       "           */
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
 	NR_FILE_PAGES,
 	NR_FILE_DIRTY,
 	NR_WRITEBACK,
-	/* Second 128 byte cacheline */
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
 	NR_PAGETABLE,		/* used for pagetables */
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_BOUNCE,
 	NR_VMSCAN_WRITE,
+	/* Second 128 byte cacheline */
 	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
 #ifdef CONFIG_NUMA
 	NUMA_HIT,		/* allocated in intended node */
@@ -107,18 +109,33 @@ enum zone_stat_item {
 #endif
 	NR_VM_ZONE_STAT_ITEMS };
 
+/*
+ * We do arithmetic on the LRU lists in various places in the code,
+ * so it is important to keep the active lists LRU_ACTIVE higher in
+ * the array than the corresponding inactive lists, and to keep
+ * the *_FILE lists LRU_FILE higher than the corresponding _ANON lists.
+ */
 #define LRU_BASE 0
+#define LRU_ACTIVE 1
+#define LRU_FILE 2
 
 enum lru_list {
-	LRU_INACTIVE = LRU_BASE,	/* must match order of NR_[IN]ACTIVE */
-	LRU_ACTIVE,			/*  "     "     "   "       "        */
+	LRU_INACTIVE_ANON = LRU_BASE,
+	LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
+	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
+	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
 	NR_LRU_LISTS };
 
 #define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
 
+static inline int is_file_lru(enum lru_list l)
+{
+	return (l == LRU_INACTIVE_FILE || l == LRU_ACTIVE_FILE);
+}
+
 static inline int is_active_lru(enum lru_list l)
 {
-	return (l == LRU_ACTIVE);
+	return (l == LRU_ACTIVE_ANON || l == LRU_ACTIVE_FILE);
 }
 
 enum lru_list page_lru(struct page *page);
@@ -269,6 +286,10 @@ struct zone {
 	spinlock_t		lru_lock;	
 	struct list_head	list[NR_LRU_LISTS];
 	unsigned long		nr_scan[NR_LRU_LISTS];
+
+	unsigned long		recent_rotated_anon;
+	unsigned long		recent_rotated_file;
+
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
Index: linux-2.6.25-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/mm_inline.h	2008-04-24 12:00:01.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/mm_inline.h	2008-04-24 12:01:36.000000000 -0400
@@ -5,7 +5,7 @@
  * page_file_cache - should the page be on a file LRU or anon LRU?
  * @page: the page to test
  *
- * Returns !0 if @page is page cache page backed by a regular filesystem,
+ * Returns LRU_FILE if @page is page cache page backed by a regular filesystem,
  * or 0 if @page is anonymous, tmpfs or otherwise ram or swap backed.
  *
  * We would like to get this info without a page flag, but the state
@@ -18,58 +18,83 @@ static inline int page_file_cache(struct
 		return 0;
 
 	/* The page is page cache backed by a normal filesystem. */
-	return 2;
+	return LRU_FILE;
 }
 
 static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
 	list_add(&page->lru, &zone->list[l]);
-	__inc_zone_state(zone, NR_INACTIVE + l);
+	__inc_zone_state(zone, NR_INACTIVE_ANON + l);
 }
 
 static inline void
 del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
 	list_del(&page->lru);
-	__dec_zone_state(zone, NR_INACTIVE + l);
+	__dec_zone_state(zone, NR_INACTIVE_ANON + l);
 }
 
 static inline void
-add_page_to_active_list(struct zone *zone, struct page *page)
+add_page_to_inactive_anon_list(struct zone *zone, struct page *page)
 {
-	add_page_to_lru_list(zone, page, LRU_ACTIVE);
+	add_page_to_lru_list(zone, page, LRU_INACTIVE_ANON);
 }
 
 static inline void
-add_page_to_inactive_list(struct zone *zone, struct page *page)
+add_page_to_active_anon_list(struct zone *zone, struct page *page)
 {
-	add_page_to_lru_list(zone, page, LRU_INACTIVE);
+	add_page_to_lru_list(zone, page, LRU_ACTIVE_ANON);
 }
 
 static inline void
-del_page_from_active_list(struct zone *zone, struct page *page)
+add_page_to_inactive_file_list(struct zone *zone, struct page *page)
 {
-	del_page_from_lru_list(zone, page, LRU_ACTIVE);
+	add_page_to_lru_list(zone, page, LRU_INACTIVE_FILE);
 }
 
 static inline void
-del_page_from_inactive_list(struct zone *zone, struct page *page)
+add_page_to_active_file_list(struct zone *zone, struct page *page)
 {
-	del_page_from_lru_list(zone, page, LRU_INACTIVE);
+	add_page_to_lru_list(zone, page, LRU_ACTIVE_FILE);
+}
+
+static inline void
+del_page_from_inactive_anon_list(struct zone *zone, struct page *page)
+{
+	del_page_from_lru_list(zone, page, LRU_INACTIVE_ANON);
+}
+
+static inline void
+del_page_from_active_anon_list(struct zone *zone, struct page *page)
+{
+	del_page_from_lru_list(zone, page, LRU_ACTIVE_ANON);
+}
+
+static inline void
+del_page_from_inactive_file_list(struct zone *zone, struct page *page)
+{
+	del_page_from_lru_list(zone, page, LRU_INACTIVE_FILE);
+}
+
+static inline void
+del_page_from_active_file_list(struct zone *zone, struct page *page)
+{
+	del_page_from_lru_list(zone, page, LRU_INACTIVE_FILE);
 }
 
 static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
-	enum lru_list l = LRU_INACTIVE;
+	enum lru_list l = LRU_INACTIVE_ANON;
 
 	list_del(&page->lru);
 	if (PageActive(page)) {
 		__ClearPageActive(page);
-		l = LRU_ACTIVE;
+		l += LRU_ACTIVE;
 	}
-	__dec_zone_state(zone, NR_INACTIVE + l);
+	l += page_file_cache(page);
+	__dec_zone_state(zone, NR_INACTIVE_ANON + l);
 }
 
 #endif
Index: linux-2.6.25-mm1/include/linux/pagevec.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/pagevec.h	2008-04-24 11:59:56.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/pagevec.h	2008-04-24 12:01:36.000000000 -0400
@@ -81,20 +81,37 @@ static inline void pagevec_free(struct p
 		__pagevec_free(pvec);
 }
 
-static inline void __pagevec_lru_add(struct pagevec *pvec)
+static inline void __pagevec_lru_add_anon(struct pagevec *pvec)
 {
-	____pagevec_lru_add(pvec, LRU_INACTIVE);
+	____pagevec_lru_add(pvec, LRU_INACTIVE_ANON);
 }
 
-static inline void __pagevec_lru_add_active(struct pagevec *pvec)
+static inline void __pagevec_lru_add_active_anon(struct pagevec *pvec)
 {
-	____pagevec_lru_add(pvec, LRU_ACTIVE);
+	____pagevec_lru_add(pvec, LRU_ACTIVE_ANON);
 }
 
-static inline void pagevec_lru_add(struct pagevec *pvec)
+static inline void __pagevec_lru_add_file(struct pagevec *pvec)
+{
+	____pagevec_lru_add(pvec, LRU_INACTIVE_FILE);
+}
+
+static inline void __pagevec_lru_add_active_file(struct pagevec *pvec)
+{
+	____pagevec_lru_add(pvec, LRU_ACTIVE_FILE);
+}
+
+
+static inline void pagevec_lru_add_file(struct pagevec *pvec)
+{
+	if (pagevec_count(pvec))
+		__pagevec_lru_add_file(pvec);
+}
+
+static inline void pagevec_lru_add_anon(struct pagevec *pvec)
 {
 	if (pagevec_count(pvec))
-		__pagevec_lru_add(pvec);
+		__pagevec_lru_add_anon(pvec);
 }
 
 #endif /* _LINUX_PAGEVEC_H */
Index: linux-2.6.25-mm1/include/linux/vmstat.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/vmstat.h	2008-04-22 10:33:44.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/vmstat.h	2008-04-24 12:01:36.000000000 -0400
@@ -153,6 +153,16 @@ static inline unsigned long zone_page_st
 	return x;
 }
 
+extern unsigned long global_lru_pages(void);
+
+static inline unsigned long zone_lru_pages(struct zone *zone)
+{
+	return (zone_page_state(zone, NR_ACTIVE_ANON)
+		+ zone_page_state(zone, NR_ACTIVE_FILE)
+		+ zone_page_state(zone, NR_INACTIVE_ANON)
+		+ zone_page_state(zone, NR_INACTIVE_FILE));
+}
+
 #ifdef CONFIG_NUMA
 /*
  * Determine the per node value of a stat item. This function
Index: linux-2.6.25-mm1/mm/page-writeback.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/page-writeback.c	2008-04-22 10:33:45.000000000 -0400
+++ linux-2.6.25-mm1/mm/page-writeback.c	2008-04-24 12:01:36.000000000 -0400
@@ -331,9 +331,7 @@ static unsigned long highmem_dirtyable_m
 		struct zone *z =
 			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
 
-		x += zone_page_state(z, NR_FREE_PAGES)
-			+ zone_page_state(z, NR_INACTIVE)
-			+ zone_page_state(z, NR_ACTIVE);
+		x += zone_page_state(z, NR_FREE_PAGES) + zone_lru_pages(z);
 	}
 	/*
 	 * Make sure that the number of highmem pages is never larger
@@ -351,9 +349,7 @@ static unsigned long determine_dirtyable
 {
 	unsigned long x;
 
-	x = global_page_state(NR_FREE_PAGES)
-		+ global_page_state(NR_INACTIVE)
-		+ global_page_state(NR_ACTIVE);
+	x = global_page_state(NR_FREE_PAGES) + global_lru_pages();
 
 	if (!vm_highmem_is_dirtyable)
 		x -= highmem_dirtyable_memory(x);
Index: linux-2.6.25-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/swap.h	2008-04-24 11:59:51.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/swap.h	2008-04-24 12:01:36.000000000 -0400
@@ -184,14 +184,24 @@ extern void swap_setup(void);
  * lru_cache_add: add a page to the page lists
  * @page: the page to add
  */
-static inline void lru_cache_add(struct page *page)
+static inline void lru_cache_add_anon(struct page *page)
 {
-	__lru_cache_add(page, LRU_INACTIVE);
+	__lru_cache_add(page, LRU_INACTIVE_ANON);
 }
 
-static inline void lru_cache_add_active(struct page *page)
+static inline void lru_cache_add_active_anon(struct page *page)
 {
-	__lru_cache_add(page, LRU_ACTIVE);
+	__lru_cache_add(page, LRU_ACTIVE_ANON);
+}
+
+static inline void lru_cache_add_file(struct page *page)
+{
+	__lru_cache_add(page, LRU_INACTIVE_FILE);
+}
+
+static inline void lru_cache_add_active_file(struct page *page)
+{
+	__lru_cache_add(page, LRU_ACTIVE_FILE);
 }
 
 /* linux/mm/vmscan.c */
@@ -199,7 +209,7 @@ extern unsigned long try_to_free_pages(s
 					gfp_t gfp_mask);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 							gfp_t gfp_mask);
-extern int __isolate_lru_page(struct page *page, int mode);
+extern int __isolate_lru_page(struct page *page, int mode, int file);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
Index: linux-2.6.25-mm1/include/linux/memcontrol.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/memcontrol.h	2008-04-24 10:46:43.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/memcontrol.h	2008-04-24 12:01:36.000000000 -0400
@@ -41,7 +41,7 @@ extern unsigned long mem_cgroup_isolate_
 					unsigned long *scanned, int order,
 					int mode, struct zone *z,
 					struct mem_cgroup *mem_cont,
-					int active);
+					int active, int file);
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
Index: linux-2.6.25-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/memcontrol.c	2008-04-24 10:56:57.000000000 -0400
+++ linux-2.6.25-mm1/mm/memcontrol.c	2008-04-24 12:01:36.000000000 -0400
@@ -161,6 +161,7 @@ struct page_cgroup {
 };
 #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
 #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* page is active in this cgroup */
+#define PAGE_CGROUP_FLAG_FILE	(0x4)	/* page is file system backed */
 
 static int page_cgroup_nid(struct page_cgroup *pc)
 {
@@ -271,8 +272,12 @@ static void unlock_page_cgroup(struct pa
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
 			struct page_cgroup *pc)
 {
-	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
-	int lru = !!from;
+	int lru = LRU_BASE;
+
+	if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+		lru += LRU_ACTIVE;
+	if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+		lru += LRU_FILE;
 
 	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
 
@@ -283,10 +288,12 @@ static void __mem_cgroup_remove_list(str
 static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
 				struct page_cgroup *pc)
 {
-	int lru = LRU_INACTIVE;
+	int lru = LRU_BASE;
 
 	if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
 		lru += LRU_ACTIVE;
+	if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+		lru += LRU_FILE;
 
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
 	list_add(&pc->lru, &mz->lists[lru]);
@@ -297,10 +304,9 @@ static void __mem_cgroup_add_list(struct
 static void __mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
 {
 	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
-	int lru = LRU_INACTIVE;
-
-	if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
-		lru += LRU_ACTIVE;
+	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
+	int file = pc->flags & PAGE_CGROUP_FLAG_FILE;
+	int lru = LRU_FILE * !!file + !!from;
 
 	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
 
@@ -309,7 +315,7 @@ static void __mem_cgroup_move_lists(stru
 	else
 		pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
 
-	lru = !!active;
+	lru = LRU_FILE * !!file + !!active;
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
 	list_move(&pc->lru, &mz->lists[lru]);
 }
@@ -371,21 +377,6 @@ int mem_cgroup_calc_mapped_ratio(struct 
 }
 
 /*
- * This function is called from vmscan.c. In page reclaiming loop. balance
- * between active and inactive list is calculated. For memory controller
- * page reclaiming, we should use using mem_cgroup's imbalance rather than
- * zone's global lru imbalance.
- */
-long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem)
-{
-	unsigned long active, inactive;
-	/* active and inactive are the number of pages. 'long' is ok.*/
-	active = mem_cgroup_get_all_zonestat(mem, LRU_ACTIVE);
-	inactive = mem_cgroup_get_all_zonestat(mem, LRU_INACTIVE);
-	return (long) (active / (inactive + 1));
-}
-
-/*
  * prev_priority control...this will be used in memory reclaim path.
  */
 int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
@@ -430,7 +421,7 @@ unsigned long mem_cgroup_isolate_pages(u
 					unsigned long *scanned, int order,
 					int mode, struct zone *z,
 					struct mem_cgroup *mem_cont,
-					int active)
+					int active, int file)
 {
 	unsigned long nr_taken = 0;
 	struct page *page;
@@ -441,7 +432,7 @@ unsigned long mem_cgroup_isolate_pages(u
 	int nid = z->zone_pgdat->node_id;
 	int zid = zone_idx(z);
 	struct mem_cgroup_per_zone *mz;
-	int lru = !!active;
+	int lru = LRU_FILE * !!file + !!active;
 
 	BUG_ON(!mem_cont);
 	mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
@@ -457,6 +448,9 @@ unsigned long mem_cgroup_isolate_pages(u
 		if (unlikely(!PageLRU(page)))
 			continue;
 
+		/*
+		 * TODO: play better with lumpy reclaim, grabbing anything.
+		 */
 		if (PageActive(page) && !active) {
 			__mem_cgroup_move_lists(pc, true);
 			continue;
@@ -469,7 +463,7 @@ unsigned long mem_cgroup_isolate_pages(u
 		scan++;
 		list_move(&pc->lru, &pc_list);
 
-		if (__isolate_lru_page(page, mode) == 0) {
+		if (__isolate_lru_page(page, mode, file) == 0) {
 			list_move(&page->lru, dst);
 			nr_taken++;
 		}
@@ -574,6 +568,8 @@ retry:
 	pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
 		pc->flags = PAGE_CGROUP_FLAG_CACHE;
+	if (page_file_cache(page))
+		pc->flags |= PAGE_CGROUP_FLAG_FILE;
 
 	lock_page_cgroup(page);
 	if (page_get_page_cgroup(page)) {
@@ -872,14 +868,21 @@ static int mem_control_stat_show(struct 
 	}
 	/* showing # of active pages */
 	{
-		unsigned long active, inactive;
+		unsigned long active_anon, inactive_anon;
+		unsigned long active_file, inactive_file;
 
-		inactive = mem_cgroup_get_all_zonestat(mem_cont,
-						LRU_INACTIVE);
-		active = mem_cgroup_get_all_zonestat(mem_cont,
-						LRU_ACTIVE);
-		cb->fill(cb, "active", (active) * PAGE_SIZE);
-		cb->fill(cb, "inactive", (inactive) * PAGE_SIZE);
+		inactive_anon = mem_cgroup_get_all_zonestat(mem_cont,
+						LRU_INACTIVE_ANON);
+		active_anon = mem_cgroup_get_all_zonestat(mem_cont,
+						LRU_ACTIVE_ANON);
+		inactive_file = mem_cgroup_get_all_zonestat(mem_cont,
+						LRU_INACTIVE_FILE);
+		active_file = mem_cgroup_get_all_zonestat(mem_cont,
+						LRU_ACTIVE_FILE);
+		cb->fill(cb, "active_anon", (active_anon) * PAGE_SIZE);
+		cb->fill(cb, "inactive_anon", (inactive_anon) * PAGE_SIZE);
+		cb->fill(cb, "active_file", (active_file) * PAGE_SIZE);
+		cb->fill(cb, "inactive_file", (inactive_file) * PAGE_SIZE);
 	}
 	return 0;
 }

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH -mm 08/15] SEQ replacement for anonymous pages
  2008-04-28 18:18 [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
                   ` (6 preceding siblings ...)
  2008-04-28 18:18 ` [PATCH -mm 07/15] split LRU lists into anon & file sets Rik van Riel
@ 2008-04-28 18:18 ` Rik van Riel
  2008-04-28 18:18 ` [PATCH -mm 09/15] add some sanity checks to get_scan_ratio Rik van Riel
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2008-04-28 18:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: lee.schermerhorn, akpm, kosaki.motohiro

[-- Attachment #1: rvr-03-linux-2.6-vm-anon-seq.patch --]
[-- Type: text/plain, Size: 7270 bytes --]

We avoid evicting and scanning anonymous pages for the most part, but
under some workloads we can end up with most of memory filled with
anonymous pages.  At that point, we suddenly need to clear the referenced
bits on all of memory, which can take ages on very large memory systems.

We can reduce the maximum number of pages that need to be scanned by
not taking the referenced state into account when deactivating an
anonymous page.  After all, every anonymous page starts out referenced,
so why check?

If an anonymous page gets referenced again before it reaches the end
of the inactive list, we move it back to the active list.

To keep the maximum amount of necessary work reasonable, we scale the
active to inactive ratio with the size of memory, using the formula
active:inactive ratio = sqrt(memory in GB * 10).

Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
instead of by the amount of memory present in the system.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Index: linux-2.6.25-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/mm_inline.h	2008-04-24 12:01:36.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/mm_inline.h	2008-04-24 12:03:35.000000000 -0400
@@ -97,4 +97,16 @@ del_page_from_lru(struct zone *zone, str
 	__dec_zone_state(zone, NR_INACTIVE_ANON + l);
 }
 
+static inline int inactive_anon_low(struct zone *zone)
+{
+	unsigned long active, inactive;
+
+	active = zone_page_state(zone, NR_ACTIVE_ANON);
+	inactive = zone_page_state(zone, NR_INACTIVE_ANON);
+
+	if (inactive * zone->inactive_ratio < active)
+		return 1;
+
+	return 0;
+}
 #endif
Index: linux-2.6.25-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/mmzone.h	2008-04-24 12:01:36.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/mmzone.h	2008-04-24 12:03:35.000000000 -0400
@@ -311,6 +311,11 @@ struct zone {
 	 */
 	int prev_priority;
 
+	/*
+	 * The ratio of active to inactive pages.
+	 */
+	unsigned int inactive_ratio;
+
 
 	ZONE_PADDING(_pad2_)
 	/* Rarely used or read-mostly fields */
Index: linux-2.6.25-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/page_alloc.c	2008-04-24 12:01:36.000000000 -0400
+++ linux-2.6.25-mm1/mm/page_alloc.c	2008-04-24 12:03:35.000000000 -0400
@@ -4268,6 +4268,45 @@ void setup_per_zone_pages_min(void)
 	calculate_totalreserve_pages();
 }
 
+/**
+ * setup_per_zone_inactive_ratio - called when min_free_kbytes changes.
+ *
+ * The inactive anon list should be small enough that the VM never has to
+ * do too much work, but large enough that each inactive page has a chance
+ * to be referenced again before it is swapped out.
+ *
+ * The inactive_anon ratio is the ratio of active to inactive anonymous
+ * pages.  Ie. a ratio of 3 means 3:1 or 25% of the anonymous pages are
+ * on the inactive list.
+ *
+ * total     return    max
+ * memory    value     inactive anon
+ * -------------------------------------
+ *   10MB       1         5MB
+ *  100MB       1        50MB
+ *    1GB       3       250MB
+ *   10GB      10       0.9GB
+ *  100GB      31         3GB
+ *    1TB     101        10GB
+ *   10TB     320        32GB
+ */
+void setup_per_zone_inactive_ratio(void)
+{
+	struct zone *zone;
+
+	for_each_zone(zone) {
+		unsigned int gb, ratio;
+
+		/* Zone size in gigabytes */
+		gb = zone->present_pages >> (30 - PAGE_SHIFT);
+		ratio = int_sqrt(10 * gb);
+		if (!ratio)
+			ratio = 1;
+
+		zone->inactive_ratio = ratio;
+	}
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -4305,6 +4344,7 @@ static int __init init_per_zone_pages_mi
 		min_free_kbytes = 65536;
 	setup_per_zone_pages_min();
 	setup_per_zone_lowmem_reserve();
+	setup_per_zone_inactive_ratio();
 	return 0;
 }
 module_init(init_per_zone_pages_min)
Index: linux-2.6.25-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/vmscan.c	2008-04-24 12:02:16.000000000 -0400
+++ linux-2.6.25-mm1/mm/vmscan.c	2008-04-24 12:03:35.000000000 -0400
@@ -1008,7 +1008,7 @@ static inline int zone_is_near_oom(struc
 static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			struct scan_control *sc, int priority, int file)
 {
-	unsigned long pgmoved;
+	unsigned long pgmoved = 0;
 	int pgdeactivate = 0;
 	unsigned long pgscanned;
 	LIST_HEAD(l_hold);	/* The pages which were snipped off */
@@ -1040,13 +1040,25 @@ static void shrink_active_list(unsigned 
 		cond_resched();
 		page = lru_to_page(&l_hold);
 		list_del(&page->lru);
-		if (page_referenced(page, 0, sc->mem_cgroup))
+		if (page_referenced(page, 0, sc->mem_cgroup) && file) {
+			/* Referenced file pages stay active. */
 			list_add(&page->lru, &l_active);
-		else
+		} else {
 			list_add(&page->lru, &l_inactive);
+			if (!file)
+				/* Anonymous pages always get deactivated. */
+				pgmoved++;
+		}
 	}
 
 	/*
+	 * Count the referenced anon pages as rotated, to balance pageout
+	 * scan pressure between file and anonymous pages in get_sacn_ratio.
+	 */
+	if (!file)
+		zone->recent_rotated_anon += pgmoved;
+
+	/*
 	 * Now put the pages back on the appropriate [file or anon] inactive
 	 * and active lists.
 	 */
@@ -1129,7 +1141,11 @@ static unsigned long shrink_list(enum lr
 {
 	int file = is_file_lru(lru);
 
-	if (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE) {
+	if (lru == LRU_ACTIVE_FILE) {
+		shrink_active_list(nr_to_scan, zone, sc, priority, file);
+		return 0;
+	}
+	if (lru == LRU_ACTIVE_ANON && inactive_anon_low(zone)) {
 		shrink_active_list(nr_to_scan, zone, sc, priority, file);
 		return 0;
 	}
@@ -1239,8 +1255,8 @@ static unsigned long shrink_zone(int pri
 		}
 	}
 
-	while (nr[LRU_ACTIVE_ANON] || nr[LRU_INACTIVE_ANON] ||
-				nr[LRU_ACTIVE_FILE] || nr[LRU_INACTIVE_FILE]) {
+	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
+						 nr[LRU_INACTIVE_FILE]) {
 		for_each_lru(l) {
 			if (nr[l]) {
 				nr_to_scan = min(nr[l],
@@ -1542,6 +1558,14 @@ loop_again:
 			    priority != DEF_PRIORITY)
 				continue;
 
+			/*
+			 * Do some background aging of the anon list, to give
+			 * pages a chance to be referenced before reclaiming.
+			 */
+			if (inactive_anon_low(zone))
+				shrink_active_list(SWAP_CLUSTER_MAX, zone,
+							&sc, priority, 0);
+
 			if (!zone_watermark_ok(zone, order, zone->pages_high,
 					       0, 0)) {
 				end_zone = i;
Index: linux-2.6.25-mm1/mm/vmstat.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/vmstat.c	2008-04-24 12:01:36.000000000 -0400
+++ linux-2.6.25-mm1/mm/vmstat.c	2008-04-24 12:03:35.000000000 -0400
@@ -806,10 +806,12 @@ static void zoneinfo_show_print(struct s
 	seq_printf(m,
 		   "\n  all_unreclaimable: %u"
 		   "\n  prev_priority:     %i"
-		   "\n  start_pfn:         %lu",
+		   "\n  start_pfn:         %lu"
+		   "\n  inactive_ratio:    %u",
 			   zone_is_all_unreclaimable(zone),
 		   zone->prev_priority,
-		   zone->zone_start_pfn);
+		   zone->zone_start_pfn,
+		   zone->inactive_ratio);
 	seq_putc(m, '\n');
 }
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH -mm 09/15] add some sanity checks to get_scan_ratio
  2008-04-28 18:18 [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
                   ` (7 preceding siblings ...)
  2008-04-28 18:18 ` [PATCH -mm 08/15] SEQ replacement for anonymous pages Rik van Riel
@ 2008-04-28 18:18 ` Rik van Riel
  2008-05-15  6:34   ` MinChan Kim
  2008-04-28 18:18 ` [PATCH -mm 10/15] add newly swapped in pages to the inactive list Rik van Riel
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 34+ messages in thread
From: Rik van Riel @ 2008-04-28 18:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: lee.schermerhorn, akpm, kosaki.motohiro

[-- Attachment #1: rvr-04-linux-2.6-scan-ratio-fixes.patch --]
[-- Type: text/plain, Size: 8367 bytes --]

The access ratio based scan rate determination in get_scan_ratio
works ok in most situations, but needs to be corrected in some
corner cases:
- if we run out of swap space, do not bother scanning the anon LRUs
- if we have already freed all of the page cache, we need to scan
  the anon LRUs
- restore the *actual* access ratio based scan rate algorithm, the
  previous versions of this patch series had the wrong version
- scale the number of pages added to zone->nr_scan[l]

Signed-off-by: Rik van Riel <riel@redhat.com>

Index: linux-2.6.25-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/vmscan.c	2008-04-24 12:03:35.000000000 -0400
+++ linux-2.6.25-mm1/mm/vmscan.c	2008-04-24 12:03:40.000000000 -0400
@@ -893,8 +893,13 @@ static unsigned long shrink_inactive_lis
 		__mod_zone_page_state(zone, NR_INACTIVE_ANON,
 						-count[LRU_INACTIVE_ANON]);
 
-		if (scan_global_lru(sc))
+		if (scan_global_lru(sc)) {
 			zone->pages_scanned += nr_scan;
+			zone->recent_scanned_anon += count[LRU_ACTIVE_ANON] +
+						     count[LRU_INACTIVE_ANON];
+			zone->recent_scanned_file += count[LRU_ACTIVE_FILE] +
+						     count[LRU_INACTIVE_FILE];
+		}
 		spin_unlock_irq(&zone->lru_lock);
 
 		nr_scanned += nr_scan;
@@ -944,11 +949,13 @@ static unsigned long shrink_inactive_lis
 			VM_BUG_ON(PageLRU(page));
 			SetPageLRU(page);
 			list_del(&page->lru);
-			if (page_file_cache(page)) {
+			if (page_file_cache(page))
 				lru += LRU_FILE;
-				zone->recent_rotated_file++;
-			} else {
-				zone->recent_rotated_anon++;
+			if (scan_global_lru(sc)) {
+				if (page_file_cache(page))
+					zone->recent_rotated_file++;
+				else
+					zone->recent_rotated_anon++;
 			}
 			if (PageActive(page))
 				lru += LRU_ACTIVE;
@@ -1027,8 +1034,13 @@ static void shrink_active_list(unsigned 
 	 * zone->pages_scanned is used for detect zone's oom
 	 * mem_cgroup remembers nr_scan by itself.
 	 */
-	if (scan_global_lru(sc))
+	if (scan_global_lru(sc)) {
 		zone->pages_scanned += pgscanned;
+		if (file)
+			zone->recent_scanned_file += pgscanned;
+		else
+			zone->recent_scanned_anon += pgscanned;
+	}
 
 	if (file)
 		__mod_zone_page_state(zone, NR_ACTIVE_FILE, -pgmoved);
@@ -1165,9 +1177,8 @@ static unsigned long shrink_list(enum lr
 static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
 					unsigned long *percent)
 {
-	unsigned long anon, file;
+	unsigned long anon, file, free;
 	unsigned long anon_prio, file_prio;
-	unsigned long rotate_sum;
 	unsigned long ap, fp;
 
 	anon  = zone_page_state(zone, NR_ACTIVE_ANON) +
@@ -1175,15 +1186,19 @@ static void get_scan_ratio(struct zone *
 	file  = zone_page_state(zone, NR_ACTIVE_FILE) +
 		zone_page_state(zone, NR_INACTIVE_FILE);
 
-	rotate_sum = zone->recent_rotated_file + zone->recent_rotated_anon;
-
 	/* Keep a floating average of RECENT references. */
-	if (unlikely(rotate_sum > min(anon, file))) {
+	if (unlikely(zone->recent_scanned_anon > anon / zone->inactive_ratio)) {
 		spin_lock_irq(&zone->lru_lock);
-		zone->recent_rotated_file /= 2;
+		zone->recent_scanned_anon /= 2;
 		zone->recent_rotated_anon /= 2;
 		spin_unlock_irq(&zone->lru_lock);
-		rotate_sum /= 2;
+	}
+
+	if (unlikely(zone->recent_scanned_file > file / 4)) {
+		spin_lock_irq(&zone->lru_lock);
+		zone->recent_scanned_file /= 2;
+		zone->recent_rotated_file /= 2;
+		spin_unlock_irq(&zone->lru_lock);
 	}
 
 	/*
@@ -1196,23 +1211,33 @@ static void get_scan_ratio(struct zone *
 	/*
 	 *                  anon       recent_rotated_anon
 	 * %anon = 100 * ----------- / ------------------- * IO cost
-	 *               anon + file       rotate_sum
+	 *               anon + file   recent_scanned_anon
 	 */
-	ap = (anon_prio * anon) / (anon + file + 1);
-	ap *= rotate_sum / (zone->recent_rotated_anon + 1);
-	if (ap == 0)
-		ap = 1;
-	else if (ap > 100)
-		ap = 100;
-	percent[0] = ap;
-
-	fp = (file_prio * file) / (anon + file + 1);
-	fp *= rotate_sum / (zone->recent_rotated_file + 1);
-	if (fp == 0)
-		fp = 1;
-	else if (fp > 100)
-		fp = 100;
-	percent[1] = fp;
+	ap = (anon_prio + 1) * (zone->recent_scanned_anon + 1);
+	ap /= zone->recent_rotated_anon + 1;
+
+	fp = (file_prio + 1) * (zone->recent_scanned_file + 1);
+	fp /= zone->recent_rotated_file + 1;
+
+	/* Normalize to percentages */
+	percent[0] = 100 * ap / (ap + fp + 1);
+	percent[1] = 100 - percent[0];
+
+	free = zone_page_state(zone, NR_FREE_PAGES);
+
+	/*
+	 * If we have no swap space, do not bother scanning anon pages.
+	 */
+	if (nr_swap_pages <= 0) {
+		percent[0] = 0;
+		percent[1] = 100;
+	}
+	/*
+	 * If we already freed most file pages, scan the anon pages
+	 * regardless of the page access ratios or swappiness setting.
+	 */
+	else if (file + free <= zone->pages_high)
+		percent[0] = 100;
 }
 
 
@@ -1233,13 +1258,17 @@ static unsigned long shrink_zone(int pri
 	for_each_lru(l) {
 		if (scan_global_lru(sc)) {
 			int file = is_file_lru(l);
+			int scan;
 			/*
 			 * Add one to nr_to_scan just to make sure that the
-			 * kernel will slowly sift through the active list.
+			 * kernel will slowly sift through each list.
 			 */
-			zone->nr_scan[l] += (zone_page_state(zone,
-				NR_INACTIVE_ANON + l) >> priority) + 1;
-			nr[l] = zone->nr_scan[l] * percent[file] / 100;
+			scan = zone_page_state(zone, NR_INACTIVE_ANON + l);
+			scan >>= priority;
+			scan = (scan * percent[file]) / 100;
+
+			zone->nr_scan[l] += scan + 1;
+			nr[l] = zone->nr_scan[l];
 			if (nr[l] >= sc->swap_cluster_max)
 				zone->nr_scan[l] = 0;
 			else
@@ -1256,7 +1285,7 @@ static unsigned long shrink_zone(int pri
 	}
 
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
-						 nr[LRU_INACTIVE_FILE]) {
+					nr[LRU_INACTIVE_FILE]) {
 		for_each_lru(l) {
 			if (nr[l]) {
 				nr_to_scan = min(nr[l],
@@ -1269,6 +1298,14 @@ static unsigned long shrink_zone(int pri
 		}
 	}
 
+	/*
+	 * Even if we did not try to evict anon pages at all, we want to
+	 * rebalance the anon lru active/inactive ratio.
+	 */
+	if (scan_global_lru(sc) && inactive_anon_low(zone))
+		shrink_list(NR_ACTIVE_ANON, SWAP_CLUSTER_MAX, zone, sc,
+								priority);
+
 	throttle_vm_writeout(sc->gfp_mask);
 	return nr_reclaimed;
 }
Index: linux-2.6.25-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/mmzone.h	2008-04-24 12:03:35.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/mmzone.h	2008-04-24 12:03:40.000000000 -0400
@@ -289,6 +289,8 @@ struct zone {
 
 	unsigned long		recent_rotated_anon;
 	unsigned long		recent_rotated_file;
+	unsigned long		recent_scanned_anon;
+	unsigned long		recent_scanned_file;
 
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
Index: linux-2.6.25-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/page_alloc.c	2008-04-24 12:03:35.000000000 -0400
+++ linux-2.6.25-mm1/mm/page_alloc.c	2008-04-24 12:03:40.000000000 -0400
@@ -3514,7 +3514,8 @@ static void __paginginit free_area_init_
 		}
 		zone->recent_rotated_anon = 0;
 		zone->recent_rotated_file = 0;
-//TODO recent_scanned_* ???
+		zone->recent_scanned_anon = 0;
+		zone->recent_scanned_file = 0;
 		zap_zone_vm_stats(zone);
 		zone->flags = 0;
 		if (!size)
Index: linux-2.6.25-mm1/mm/swap.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/swap.c	2008-04-24 12:01:36.000000000 -0400
+++ linux-2.6.25-mm1/mm/swap.c	2008-04-24 12:03:40.000000000 -0400
@@ -176,8 +176,8 @@ void activate_page(struct page *page)
 
 	spin_lock_irq(&zone->lru_lock);
 	if (PageLRU(page) && !PageActive(page)) {
-		int lru = LRU_BASE;
-		lru += page_file_cache(page);
+		int file = page_file_cache(page);
+		int lru = LRU_BASE + file;
 		del_page_from_lru_list(zone, page, lru);
 
 		SetPageActive(page);
@@ -185,6 +185,15 @@ void activate_page(struct page *page)
 		add_page_to_lru_list(zone, page, lru);
 		__count_vm_event(PGACTIVATE);
 		mem_cgroup_move_lists(page, true);
+
+		if (file) {
+			zone->recent_scanned_file++;
+			zone->recent_rotated_file++;
+		} else {
+			/* Can this happen?  Maybe through tmpfs... */
+			zone->recent_scanned_anon++;
+			zone->recent_rotated_anon++;
+		}
 	}
 	spin_unlock_irq(&zone->lru_lock);
 }

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH -mm 10/15] add newly swapped in pages to the inactive list
  2008-04-28 18:18 [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
                   ` (8 preceding siblings ...)
  2008-04-28 18:18 ` [PATCH -mm 09/15] add some sanity checks to get_scan_ratio Rik van Riel
@ 2008-04-28 18:18 ` Rik van Riel
  2008-04-28 18:18 ` [PATCH -mm 11/15] more aggressively use lumpy reclaim Rik van Riel
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2008-04-28 18:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: lee.schermerhorn, akpm, kosaki.motohiro

[-- Attachment #1: rvr-swapin-inactive.patch --]
[-- Type: text/plain, Size: 984 bytes --]

Swapin_readahead can read in a lot of data that the processes in
memory never need.  Adding swap cache pages to the inactive list
prevents them from putting too much pressure on the working set.

This has the potential to help the programs that are already in
memory, but it could also be a disadvantage to processes that
are trying to get swapped in.

In short, this patch needs testing.

Signed-off-by: Rik van Riel <riel@redhat.com>

Index: linux-2.6.25-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/swap_state.c	2008-04-24 12:01:36.000000000 -0400
+++ linux-2.6.25-mm1/mm/swap_state.c	2008-04-24 12:03:45.000000000 -0400
@@ -301,7 +301,7 @@ struct page *read_swap_cache_async(swp_e
 			/*
 			 * Initiate read into locked page and return.
 			 */
-			lru_cache_add_active_anon(new_page);
+			lru_cache_add_anon(new_page);
 			swap_readpage(NULL, new_page);
 			return new_page;
 		}

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH -mm 11/15] more aggressively use lumpy reclaim
  2008-04-28 18:18 [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
                   ` (9 preceding siblings ...)
  2008-04-28 18:18 ` [PATCH -mm 10/15] add newly swapped in pages to the inactive list Rik van Riel
@ 2008-04-28 18:18 ` Rik van Riel
  2008-04-28 18:18 ` [PATCH -mm 12/15] No Reclaim LRU Infrastructure Rik van Riel
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2008-04-28 18:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: lee.schermerhorn, akpm, kosaki.motohiro

[-- Attachment #1: lumpy-reclaim-lower-order.patch --]
[-- Type: text/plain, Size: 2212 bytes --]

During an AIM7 run on a 16GB system, fork started failing around
32000 threads, despite the system having plenty of free swap and
15GB of pageable memory.

If normal pageout does not result in contiguous free pages for
kernel stacks, fall back to lumpy reclaim instead of failing fork
or doing excessive pageout IO.

I do not know whether this change is needed due to the extreme
stress test or because the inactive list is a smaller fraction
of system memory on huge systems.

Signed-off-by: Rik van Riel <riel@redhat.com>

Index: linux-2.6.25-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/vmscan.c	2008-04-24 12:03:40.000000000 -0400
+++ linux-2.6.25-mm1/mm/vmscan.c	2008-04-24 12:03:49.000000000 -0400
@@ -857,7 +857,8 @@ int isolate_lru_page(struct page *page)
  * of reclaimed pages
  */
 static unsigned long shrink_inactive_list(unsigned long max_scan,
-			struct zone *zone, struct scan_control *sc, int file)
+			struct zone *zone, struct scan_control *sc,
+			int priority, int file)
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
@@ -875,8 +876,19 @@ static unsigned long shrink_inactive_lis
 		unsigned long nr_freed;
 		unsigned long nr_active;
 		unsigned int count[NR_LRU_LISTS] = { 0, };
-		int mode = (sc->order > PAGE_ALLOC_COSTLY_ORDER) ?
-					ISOLATE_BOTH : ISOLATE_INACTIVE;
+		int mode = ISOLATE_INACTIVE;
+
+		/*
+		 * If we need a large contiguous chunk of memory, or have
+		 * trouble getting a small set of contiguous pages, we
+		 * will reclaim both active and inactive pages.
+		 *
+		 * We use the same threshold as pageout congestion_wait below.
+		 */
+		if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+			mode = ISOLATE_BOTH;
+		else if (sc->order && priority < DEF_PRIORITY - 2)
+			mode = ISOLATE_BOTH;
 
 		nr_taken = sc->isolate_pages(sc->swap_cluster_max,
 			     &page_list, &nr_scan, sc->order, mode,
@@ -1161,7 +1173,7 @@ static unsigned long shrink_list(enum lr
 		shrink_active_list(nr_to_scan, zone, sc, priority, file);
 		return 0;
 	}
-	return shrink_inactive_list(nr_to_scan, zone, sc, file);
+	return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
 }
 
 /*

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH -mm 12/15] No Reclaim LRU Infrastructure
  2008-04-28 18:18 [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
                   ` (10 preceding siblings ...)
  2008-04-28 18:18 ` [PATCH -mm 11/15] more aggressively use lumpy reclaim Rik van Riel
@ 2008-04-28 18:18 ` Rik van Riel
  2008-04-28 18:18 ` [PATCH -mm 13/15] Non-reclaimable page statistics Rik van Riel
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2008-04-28 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: lee.schermerhorn, akpm, kosaki.motohiro, linux-mm,
	Lee Schermerhorn

[-- Attachment #1: rvr-11-lts-noreclaim-lru-infrastructure.patch --]
[-- Type: text/plain, Size: 23248 bytes --]

V3 -> V6:
+ remove lru_cache_add_active_or_noreclaim().  Only used by
  optional patch to cull nonreclaimable pages in fault path.
  Will add back to that patch.
+ misc cleanup pointed out by review of V5

V1 -> V3:
+ rebase to 23-mm1 atop RvR's split LRU series
+ define NR_NORECLAIM and LRU_NORECLAIM to avoid errors when not
  configured.

V1 -> V2:
+  handle review comments -- various typos and errors.
+  extract "putback_all_noreclaim_pages()" into a separate patch
   and rework as "scan_all_zones_noreclaim_pages().

Infrastructure to manage pages excluded from reclaim--i.e., hidden
from vmscan.  Based on a patch by Larry Woodman of Red Hat. Reworked
to maintain "nonreclaimable" pages on a separate per-zone LRU list,
to "hide" them from vmscan.  A separate noreclaim pagevec is provided
for shrink_active_list() to move nonreclaimable pages to the noreclaim
list without over burdening the zone lru_lock.

Pages on the noreclaim list have both PG_noreclaim and PG_lru set.
Thus, PG_noreclaim is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.  

The noreclaim infrastructure is enabled by a new mm Kconfig option
[CONFIG_]NORECLAIM_LRU.

A new function 'page_reclaimable(page, vma)' in vmscan.c tests whether
or not a page is reclaimable.  Subsequent patches will add the various
!reclaimable tests.  We'll want to keep these tests light-weight for
use in shrink_active_list() and, possibly, the fault path.

Notes:

1.  for now, use bit 30 in page flags.  This restricts the no reclaim
    infrastructure to 64-bit systems.  [The mlock patch, later in this
    series, uses another of these 64-bit-system-only flags.]

    Rationale:  32-bit systems have no free page flags and are less
    likely to have the large amounts of memory that exhibit the problems
    this series attempts to solve.  [I'm sure someone will disabuse me
    of this notion.]

    Thus, NORECLAIM_LRU currently depends on [CONFIG_]64BIT.

!!! We will need to revisit this if/when Christoph Lameter's page
    flag cleanup goes in.

2.  The pagevec to move pages to the noreclaim list results in another
    loop at the end of shrink_active_list().  If we ultimately adopt Rik
    van Riel's split lru approach, I think we'll need to find a way to
    factor all of these loops into some common code.

3.  TODO:  Memory Controllers maintain separate active and inactive lists,
    now split for anon and file pages.
    Need to consider whether they should also maintain a noreclaim list.  

4.  TODO:  more factoring of lru list handling?  But, I want to get this
    as close to functionally correct as possible before introducing those
    perturbations.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mm_inline.h  |   33 +++++++++++-
 include/linux/mmzone.h     |   24 +++++++++
 include/linux/page-flags.h |   17 ++++++
 include/linux/pagevec.h    |    6 ++
 include/linux/swap.h       |   22 ++++++++
 mm/Kconfig                 |   10 +++
 mm/mempolicy.c             |    2 
 mm/migrate.c               |    6 +-
 mm/page_alloc.c            |    9 +++
 mm/swap.c                  |   30 ++++++++---
 mm/vmscan.c                |  115 ++++++++++++++++++++++++++++++++++++++-------
 11 files changed, 242 insertions(+), 32 deletions(-)

Index: linux-2.6.25-mm1/mm/Kconfig
===================================================================
--- linux-2.6.25-mm1.orig/mm/Kconfig	2008-04-22 10:33:45.000000000 -0400
+++ linux-2.6.25-mm1/mm/Kconfig	2008-04-24 12:03:54.000000000 -0400
@@ -205,3 +205,13 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config NORECLAIM_LRU
+	bool "Add LRU list to track non-reclaimable pages (EXPERIMENTAL, 64BIT only)"
+	depends on EXPERIMENTAL && 64BIT
+	help
+	  Supports tracking of non-reclaimable pages off the [in]active lists
+	  to avoid excessive reclaim overhead on large memory systems.  Pages
+	  may be non-reclaimable because:  they are locked into memory, they
+	  are anonymous pages for which no swap space exists, or they are anon
+	  pages that are expensive to unmap [long anon_vma "related vma" list.]
Index: linux-2.6.25-mm1/include/linux/page-flags.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/page-flags.h	2008-04-24 12:00:01.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/page-flags.h	2008-04-24 12:03:54.000000000 -0400
@@ -94,6 +94,9 @@ enum pageflags {
 	PG_reclaim,		/* To be reclaimed asap */
 	PG_buddy,		/* Page is free, on buddy lists */
 	PG_swapbacked,		/* Page is backed by RAM/swap */
+#ifdef CONFIG_NORECLAIM_LRU
+	PG_noreclaim,		/* Page is "non-reclaimable"  */
+#endif
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 	PG_uncached,		/* Page has been mapped as uncached */
 #endif
@@ -155,6 +158,7 @@ PAGEFLAG(Referenced, referenced) TESTCLE
 PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
 PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
 PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active)
+	TESTCLEARFLAG(Active, active)
 __PAGEFLAG(Slab, slab)
 PAGEFLAG(Checked, owner_priv_1)		/* Used by some filesystems */
 PAGEFLAG(Pinned, owner_priv_1) TESTSCFLAG(Pinned, owner_priv_1) /* Xen */
@@ -191,6 +195,17 @@ PAGEFLAG(SwapCache, swapcache)
 PAGEFLAG_FALSE(SwapCache)
 #endif
 
+#ifdef CONFIG_NORECLAIM_LRU
+PAGEFLAG(Noreclaim, noreclaim) __CLEARPAGEFLAG(Noreclaim, noreclaim)
+TESTCLEARFLAG(Noreclaim, noreclaim)
+#else
+PAGEFLAG_FALSE(Noreclaim)
+#define SetPageNoreclaim(page)
+#define ClearPageNoreclaim(page)
+#define __ClearPageNoreclaim(page)
+#define TestClearPageNoreclaim(page) 0
+#endif
+
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 PAGEFLAG(Uncached, uncached)
 #else
Index: linux-2.6.25-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/mmzone.h	2008-04-24 12:03:40.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/mmzone.h	2008-04-24 12:03:54.000000000 -0400
@@ -85,6 +85,11 @@ enum zone_stat_item {
 	NR_ACTIVE_ANON,		/*  "     "     "   "       "           */
 	NR_INACTIVE_FILE,	/*  "     "     "   "       "           */
 	NR_ACTIVE_FILE,		/*  "     "     "   "       "           */
+#ifdef CONFIG_NORECLAIM_LRU
+	NR_NORECLAIM,	/*  "     "     "   "       "         */
+#else
+	NR_NORECLAIM = NR_ACTIVE_FILE, /* avoid compiler errors in dead code */
+#endif
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
@@ -124,10 +129,18 @@ enum lru_list {
 	LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
 	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
 	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
-	NR_LRU_LISTS };
+#ifdef CONFIG_NORECLAIM_LRU
+	LRU_NORECLAIM,
+#else
+	LRU_NORECLAIM = LRU_ACTIVE_FILE, /* avoid compiler errors in dead code */
+#endif
+	NR_LRU_LISTS
+};
 
 #define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
 
+#define for_each_reclaimable_lru(l) for (l = 0; l <= LRU_ACTIVE_FILE; l++)
+
 static inline int is_file_lru(enum lru_list l)
 {
 	return (l == LRU_INACTIVE_FILE || l == LRU_ACTIVE_FILE);
@@ -138,6 +151,15 @@ static inline int is_active_lru(enum lru
 	return (l == LRU_ACTIVE_ANON || l == LRU_ACTIVE_FILE);
 }
 
+static inline int is_noreclaim_lru(enum lru_list l)
+{
+#ifdef CONFIG_NORECLAIM_LRU
+	return (l == LRU_NORECLAIM);
+#else
+	return 0;
+#endif
+}
+
 enum lru_list page_lru(struct page *page);
 
 struct per_cpu_pages {
Index: linux-2.6.25-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/page_alloc.c	2008-04-24 12:03:40.000000000 -0400
+++ linux-2.6.25-mm1/mm/page_alloc.c	2008-04-24 12:03:54.000000000 -0400
@@ -256,6 +256,9 @@ static void bad_page(struct page *page)
 			1 << PG_private |
 			1 << PG_locked	|
 			1 << PG_active	|
+#ifdef CONFIG_NORECLAIM_LRU
+			1 << PG_noreclaim	|
+#endif
 			1 << PG_dirty	|
 			1 << PG_reclaim |
 			1 << PG_slab    |
@@ -491,6 +494,9 @@ static inline int free_pages_check(struc
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_reserved |
+#ifdef CONFIG_NORECLAIM_LRU
+			1 << PG_noreclaim |
+#endif
 			1 << PG_buddy ))))
 		bad_page(page);
 	if (PageDirty(page))
@@ -642,6 +648,9 @@ static int prep_new_page(struct page *pa
 			1 << PG_private	|
 			1 << PG_locked	|
 			1 << PG_active	|
+#ifdef CONFIG_NORECLAIM_LRU
+			1 << PG_noreclaim	|
+#endif
 			1 << PG_dirty	|
 			1 << PG_slab    |
 			1 << PG_swapcache |
Index: linux-2.6.25-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/mm_inline.h	2008-04-24 12:03:35.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/mm_inline.h	2008-04-24 12:03:54.000000000 -0400
@@ -83,17 +83,42 @@ del_page_from_active_file_list(struct zo
 	del_page_from_lru_list(zone, page, LRU_INACTIVE_FILE);
 }
 
+#ifdef CONFIG_NORECLAIM_LRU
+static inline void
+add_page_to_noreclaim_list(struct zone *zone, struct page *page)
+{
+	add_page_to_lru_list(zone, page, LRU_NORECLAIM);
+}
+
+static inline void
+del_page_from_noreclaim_list(struct zone *zone, struct page *page)
+{
+	del_page_from_lru_list(zone, page, LRU_NORECLAIM);
+}
+#else
+static inline void
+add_page_to_noreclaim_list(struct zone *zone, struct page *page) { }
+
+static inline void
+del_page_from_noreclaim_list(struct zone *zone, struct page *page) { }
+#endif
+
 static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
 	enum lru_list l = LRU_INACTIVE_ANON;
 
 	list_del(&page->lru);
-	if (PageActive(page)) {
-		__ClearPageActive(page);
-		l += LRU_ACTIVE;
+	if (PageNoreclaim(page)) {
+		__ClearPageNoreclaim(page);
+		l = LRU_NORECLAIM;
+	} else {
+		 if (PageActive(page)) {
+			__ClearPageActive(page);
+			l += LRU_ACTIVE;
+		}
+		l += page_file_cache(page);
 	}
-	l += page_file_cache(page);
 	__dec_zone_state(zone, NR_INACTIVE_ANON + l);
 }
 
Index: linux-2.6.25-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/swap.h	2008-04-24 12:01:36.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/swap.h	2008-04-24 12:03:54.000000000 -0400
@@ -204,6 +204,18 @@ static inline void lru_cache_add_active_
 	__lru_cache_add(page, LRU_ACTIVE_FILE);
 }
 
+#ifdef CONFIG_NORECLAIM_LRU
+static inline void lru_cache_add_noreclaim(struct page *page)
+{
+	__lru_cache_add(page, LRU_NORECLAIM);
+}
+#else
+static inline void lru_cache_add_noreclaim(struct page *page)
+{
+	BUG();
+}
+#endif
+
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask);
@@ -228,6 +240,16 @@ static inline int zone_reclaim(struct zo
 }
 #endif
 
+#ifdef CONFIG_NORECLAIM_LRU
+extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
+#else
+static inline int page_reclaimable(struct page *page,
+						struct vm_area_struct *vma)
+{
+	return 1;
+}
+#endif
+
 extern int kswapd_run(int nid);
 
 #ifdef CONFIG_MMU
Index: linux-2.6.25-mm1/include/linux/pagevec.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/pagevec.h	2008-04-24 12:01:36.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/pagevec.h	2008-04-24 12:03:54.000000000 -0400
@@ -101,6 +101,12 @@ static inline void __pagevec_lru_add_act
 	____pagevec_lru_add(pvec, LRU_ACTIVE_FILE);
 }
 
+#ifdef CONFIG_NORECLAIM_LRU
+static inline void __pagevec_lru_add_noreclaim(struct pagevec *pvec)
+{
+	____pagevec_lru_add(pvec, LRU_NORECLAIM);
+}
+#endif
 
 static inline void pagevec_lru_add_file(struct pagevec *pvec)
 {
Index: linux-2.6.25-mm1/mm/swap.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/swap.c	2008-04-24 12:03:40.000000000 -0400
+++ linux-2.6.25-mm1/mm/swap.c	2008-04-24 12:03:54.000000000 -0400
@@ -106,9 +106,13 @@ enum lru_list page_lru(struct page *page
 {
 	enum lru_list lru = LRU_BASE;
 
-	if (PageActive(page))
-		lru += LRU_ACTIVE;
-	lru += page_file_cache(page);
+	if (PageNoreclaim(page))
+		lru = LRU_NORECLAIM;
+	else {
+		if (PageActive(page))
+			lru += LRU_ACTIVE;
+		lru += page_file_cache(page);
+	}
 
 	return lru;
 }
@@ -133,7 +137,8 @@ static void pagevec_move_tail(struct pag
 			zone = pagezone;
 			spin_lock(&zone->lru_lock);
 		}
-		if (PageLRU(page) && !PageActive(page)) {
+		if (PageLRU(page) && !PageActive(page) &&
+					!PageNoreclaim(page)) {
 			int lru = page_file_cache(page);
 			list_move_tail(&page->lru, &zone->list[lru]);
 			pgmoved++;
@@ -154,7 +159,7 @@ static void pagevec_move_tail(struct pag
 void  rotate_reclaimable_page(struct page *page)
 {
 	if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
-	    PageLRU(page)) {
+	    !PageNoreclaim(page) && PageLRU(page)) {
 		struct pagevec *pvec;
 		unsigned long flags;
 
@@ -175,7 +180,7 @@ void activate_page(struct page *page)
 	struct zone *zone = page_zone(page);
 
 	spin_lock_irq(&zone->lru_lock);
-	if (PageLRU(page) && !PageActive(page)) {
+	if (PageLRU(page) && !PageActive(page) && !PageNoreclaim(page)) {
 		int file = page_file_cache(page);
 		int lru = LRU_BASE + file;
 		del_page_from_lru_list(zone, page, lru);
@@ -207,7 +212,8 @@ void activate_page(struct page *page)
  */
 void mark_page_accessed(struct page *page)
 {
-	if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
+	if (!PageActive(page) && !PageNoreclaim(page) &&
+			PageReferenced(page) && PageLRU(page)) {
 		activate_page(page);
 		ClearPageReferenced(page);
 	} else if (!PageReferenced(page)) {
@@ -235,10 +241,14 @@ void __lru_cache_add(struct page *page, 
 void lru_cache_add_lru(struct page *page, enum lru_list lru)
 {
 	if (PageActive(page)) {
+		VM_BUG_ON(PageNoreclaim(page));
 		ClearPageActive(page);
+	} else if (PageNoreclaim(page)) {
+		VM_BUG_ON(PageActive(page));
+		ClearPageNoreclaim(page);
 	}
 
-	VM_BUG_ON(PageLRU(page) || PageActive(page));
+	VM_BUG_ON(PageLRU(page) || PageActive(page) || PageNoreclaim(page));
 	__lru_cache_add(page, lru);
 }
 
@@ -339,6 +349,7 @@ void release_pages(struct page **pages, 
 
 		if (PageLRU(page)) {
 			struct zone *pagezone = page_zone(page);
+
 			if (pagezone != zone) {
 				if (zone)
 					spin_unlock_irqrestore(&zone->lru_lock,
@@ -426,10 +437,13 @@ void ____pagevec_lru_add(struct pagevec 
 			zone = pagezone;
 			spin_lock_irq(&zone->lru_lock);
 		}
+		VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		if (is_active_lru(lru))
 			SetPageActive(page);
+		else if (is_noreclaim_lru(lru))
+			SetPageNoreclaim(page);
 		add_page_to_lru_list(zone, page, lru);
 	}
 	if (zone)
Index: linux-2.6.25-mm1/mm/migrate.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/migrate.c	2008-04-24 12:00:01.000000000 -0400
+++ linux-2.6.25-mm1/mm/migrate.c	2008-04-24 12:03:54.000000000 -0400
@@ -335,8 +335,11 @@ static void migrate_page_copy(struct pag
 		SetPageReferenced(newpage);
 	if (PageUptodate(page))
 		SetPageUptodate(newpage);
-	if (PageActive(page))
+	if (TestClearPageActive(page)) {
+		VM_BUG_ON(PageNoreclaim(page));
 		SetPageActive(newpage);
+	} else if (TestClearPageNoreclaim(page))
+		SetPageNoreclaim(newpage);
 	if (PageChecked(page))
 		SetPageChecked(newpage);
 	if (PageMappedToDisk(page))
@@ -350,7 +353,6 @@ static void migrate_page_copy(struct pag
 #ifdef CONFIG_SWAP
 	ClearPageSwapCache(page);
 #endif
-	ClearPageActive(page);
 	ClearPagePrivate(page);
 	set_page_private(page, 0);
 	page->mapping = NULL;
Index: linux-2.6.25-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/vmscan.c	2008-04-24 12:03:49.000000000 -0400
+++ linux-2.6.25-mm1/mm/vmscan.c	2008-04-24 12:03:54.000000000 -0400
@@ -470,6 +470,11 @@ static unsigned long shrink_page_list(st
 
 		sc->nr_scanned++;
 
+		if (!page_reclaimable(page, NULL)) {
+			SetPageNoreclaim(page);
+			goto keep_locked;
+		}
+
 		if (!sc->may_swap && page_mapped(page))
 			goto keep_locked;
 
@@ -566,7 +571,7 @@ static unsigned long shrink_page_list(st
 		 * possible for a page to have PageDirty set, but it is actually
 		 * clean (all its buffers are clean).  This happens if the
 		 * buffers were written out directly, with submit_bh(). ext3
-		 * will do this, as well as the blockdev mapping. 
+		 * will do this, as well as the blockdev mapping.
 		 * try_to_release_page() will discover that cleanness and will
 		 * drop the buffers and mark the page clean - it can be freed.
 		 *
@@ -598,6 +603,7 @@ activate_locked:
 		/* Not a candidate for swapping, so reclaim swap space. */
 		if (PageSwapCache(page) && vm_swap_full())
 			remove_exclusive_swap_page(page);
+		VM_BUG_ON(PageActive(page));
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
@@ -647,6 +653,14 @@ int __isolate_lru_page(struct page *page
 	if (mode != ISOLATE_BOTH && (!page_file_cache(page) != !file))
 		return ret;
 
+	/*
+	 * Non-reclaimable pages shouldn't make it onto either the active
+	 * nor the inactive list. However, when doing lumpy reclaim of
+	 * higher order pages we can still run into them.
+	 */
+	if (PageNoreclaim(page))
+		return ret;
+
 	ret = -EBUSY;
 	if (likely(get_page_unless_zero(page))) {
 		/*
@@ -758,7 +772,7 @@ static unsigned long isolate_lru_pages(u
 				/* else it is being freed elsewhere */
 				list_move(&cursor_page->lru, src);
 			default:
-				break;
+				break;	/* ! on LRU or wrong list */
 			}
 		}
 	}
@@ -818,8 +832,9 @@ static unsigned long clear_active_flags(
  * Returns -EBUSY if the page was not on an LRU list.
  *
  * The returned page will have PageLRU() cleared.  If it was found on
- * the active list, it will have PageActive set.  That flag may need
- * to be cleared by the caller before letting the page go.
+ * the active list, it will have PageActive set.  If it was found on
+ * the noreclaim list, it will have the PageNoreclaim bit set. That flag
+ * may need to be cleared by the caller before letting the page go.
  *
  * The vmstat statistic corresponding to the list on which the page was
  * found will be decremented.
@@ -844,7 +859,13 @@ int isolate_lru_page(struct page *page)
 			ret = 0;
 			ClearPageLRU(page);
 
+			/* Calculate the LRU list for normal pages ... */
 			lru += page_file_cache(page) + !!PageActive(page);
+
+			/* ... except NoReclaim, which has its own list. */
+			if (PageNoreclaim(page))
+				lru = LRU_NORECLAIM;
+
 			del_page_from_lru_list(zone, page, lru);
 		}
 		spin_unlock_irq(&zone->lru_lock);
@@ -961,16 +982,21 @@ static unsigned long shrink_inactive_lis
 			VM_BUG_ON(PageLRU(page));
 			SetPageLRU(page);
 			list_del(&page->lru);
-			if (page_file_cache(page))
-				lru += LRU_FILE;
-			if (scan_global_lru(sc)) {
+			if (PageNoreclaim(page)) {
+				VM_BUG_ON(PageActive(page));
+				lru = LRU_NORECLAIM;
+			} else {
 				if (page_file_cache(page))
-					zone->recent_rotated_file++;
-				else
-					zone->recent_rotated_anon++;
+					lru += LRU_FILE;
+				if (scan_global_lru(sc)) {
+					if (page_file_cache(page))
+						zone->recent_rotated_file++;
+					else
+						zone->recent_rotated_anon++;
+				}
+				if (PageActive(page))
+					lru += LRU_ACTIVE;
 			}
-			if (PageActive(page))
-				lru += LRU_ACTIVE;
 			add_page_to_lru_list(zone, page, lru);
 			if (!pagevec_add(&pvec, page)) {
 				spin_unlock_irq(&zone->lru_lock);
@@ -1033,6 +1059,7 @@ static void shrink_active_list(unsigned 
 	LIST_HEAD(l_hold);	/* The pages which were snipped off */
 	LIST_HEAD(l_active);
 	LIST_HEAD(l_inactive);
+	LIST_HEAD(l_noreclaim);
 	struct page *page;
 	struct pagevec pvec;
 	enum lru_list lru;
@@ -1064,6 +1091,13 @@ static void shrink_active_list(unsigned 
 		cond_resched();
 		page = lru_to_page(&l_hold);
 		list_del(&page->lru);
+
+		if (!page_reclaimable(page, NULL)) {
+			/* Non-reclaimable pages go onto their own list. */
+			list_add(&page->lru, &l_noreclaim);
+			continue;
+		}
+
 		if (page_referenced(page, 0, sc->mem_cgroup) && file) {
 			/* Referenced file pages stay active. */
 			list_add(&page->lru, &l_active);
@@ -1151,6 +1185,32 @@ static void shrink_active_list(unsigned 
 		zone->recent_rotated_anon += pgmoved;
 	}
 
+#ifdef CONFIG_NORECLAIM_LRU
+	pgmoved = 0;
+	while (!list_empty(&l_noreclaim)) {
+		page = lru_to_page(&l_noreclaim);
+		prefetchw_prev_lru_page(page, &l_noreclaim, flags);
+
+		VM_BUG_ON(PageLRU(page));
+		SetPageLRU(page);
+		VM_BUG_ON(!PageActive(page));
+		ClearPageActive(page);
+		VM_BUG_ON(PageNoreclaim(page));
+		SetPageNoreclaim(page);
+
+		list_move(&page->lru, &zone->list[LRU_NORECLAIM]);
+		pgmoved++;
+		if (!pagevec_add(&pvec, page)) {
+			__mod_zone_page_state(zone, NR_NORECLAIM, pgmoved);
+			pgmoved = 0;
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
+	}
+	__mod_zone_page_state(zone, NR_NORECLAIM, pgmoved);
+#endif
+
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
 	__count_vm_events(PGDEACTIVATE, pgdeactivate);
 	spin_unlock_irq(&zone->lru_lock);
@@ -1267,7 +1327,7 @@ static unsigned long shrink_zone(int pri
 
 	get_scan_ratio(zone, sc, percent);
 
-	for_each_lru(l) {
+	for_each_reclaimable_lru(l) {
 		if (scan_global_lru(sc)) {
 			int file = is_file_lru(l);
 			int scan;
@@ -1298,7 +1358,7 @@ static unsigned long shrink_zone(int pri
 
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
-		for_each_lru(l) {
+		for_each_reclaimable_lru(l) {
 			if (nr[l]) {
 				nr_to_scan = min(nr[l],
 					(unsigned long)sc->swap_cluster_max);
@@ -1851,8 +1911,8 @@ static unsigned long shrink_all_zones(un
 		if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY)
 			continue;
 
-		for_each_lru(l) {
-			/* For pass = 0 we don't shrink the active list */
+		for_each_reclaimable_lru(l) {
+			/* For pass = 0, we don't shrink the active list */
 			if (pass == 0 &&
 				(l == LRU_ACTIVE_ANON || l == LRU_ACTIVE_FILE))
 				continue;
@@ -2190,3 +2250,26 @@ int zone_reclaim(struct zone *zone, gfp_
 	return ret;
 }
 #endif
+
+#ifdef CONFIG_NORECLAIM_LRU
+/*
+ * page_reclaimable - test whether a page is reclaimable
+ * @page: the page to test
+ * @vma: the VMA in which the page is or will be mapped, may be NULL
+ *
+ * Test whether page is reclaimable--i.e., should be placed on active/inactive
+ * lists vs noreclaim list.
+ *
+ * Reasons page might not be reclaimable:
+ * TODO - later patches
+ */
+int page_reclaimable(struct page *page, struct vm_area_struct *vma)
+{
+
+	VM_BUG_ON(PageNoreclaim(page));
+
+	/* TODO:  test page [!]reclaimable conditions */
+
+	return 1;
+}
+#endif
Index: linux-2.6.25-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/mempolicy.c	2008-04-23 17:31:13.000000000 -0400
+++ linux-2.6.25-mm1/mm/mempolicy.c	2008-04-24 12:03:54.000000000 -0400
@@ -2200,7 +2200,7 @@ static void gather_stats(struct page *pa
 	if (PageSwapCache(page))
 		md->swapcache++;
 
-	if (PageActive(page))
+	if (PageActive(page) || PageNoreclaim(page))
 		md->active++;
 
 	if (PageWriteback(page))

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH -mm 13/15] Non-reclaimable page statistics
  2008-04-28 18:18 [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
                   ` (11 preceding siblings ...)
  2008-04-28 18:18 ` [PATCH -mm 12/15] No Reclaim LRU Infrastructure Rik van Riel
@ 2008-04-28 18:18 ` Rik van Riel
  2008-04-28 18:18 ` [PATCH -mm 14/15] ramfs pages are non-reclaimable Rik van Riel
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2008-04-28 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: lee.schermerhorn, akpm, kosaki.motohiro, linux-mm,
	Lee Schermerhorn

[-- Attachment #1: rvr-12-lts-noreclaim-non-reclaimable-page-statistics.patch --]
[-- Type: text/plain, Size: 5114 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split LRU series

V1 -> V2:
	no changes

Report non-reclaimable pages per zone and system wide.

Note:  may want to track/report some specific reasons for 
nonreclaimability for deciding when to splice the noreclaim
lists back to the normal lru.  That will be tricky,
especially in shrink_active_list(), where we'd need someplace
to save the per page reason for non-reclaimability until the
pages are dumped back onto the noreclaim list from the pagevec.

Note:  my tests indicate that NR_NORECLAIM and probably the
other LRU stats aren't being maintained properly--especially
with large amounts of mlocked memory and the mlock patch in
this series installed.  Can't be sure of this, as I don't 
know why the pages are on the noreclaim list. Needs further
investigation.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 drivers/base/node.c |    6 ++++++
 fs/proc/proc_misc.c |    6 ++++++
 mm/page_alloc.c     |   16 +++++++++++++++-
 mm/vmstat.c         |    3 +++
 4 files changed, 30 insertions(+), 1 deletion(-)

Index: linux-2.6.25-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/page_alloc.c	2008-04-24 12:03:54.000000000 -0400
+++ linux-2.6.25-mm1/mm/page_alloc.c	2008-04-24 12:04:01.000000000 -0400
@@ -1933,12 +1933,20 @@ void show_free_areas(void)
 	}
 
 	printk("Active_anon:%lu active_file:%lu inactive_anon%lu\n"
-		" inactive_file:%lu dirty:%lu writeback:%lu unstable:%lu\n"
+		" inactive_file:%lu"
+//TODO:  check/adjust line lengths
+#ifdef CONFIG_NORECLAIM_LRU
+		" noreclaim:%lu"
+#endif
+		" dirty:%lu writeback:%lu unstable:%lu\n"
 		" free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n",
 		global_page_state(NR_ACTIVE_ANON),
 		global_page_state(NR_ACTIVE_FILE),
 		global_page_state(NR_INACTIVE_ANON),
 		global_page_state(NR_INACTIVE_FILE),
+#ifdef CONFIG_NORECLAIM_LRU
+		global_page_state(NR_NORECLAIM),
+#endif
 		global_page_state(NR_FILE_DIRTY),
 		global_page_state(NR_WRITEBACK),
 		global_page_state(NR_UNSTABLE_NFS),
@@ -1965,6 +1973,9 @@ void show_free_areas(void)
 			" inactive_anon:%lukB"
 			" active_file:%lukB"
 			" inactive_file:%lukB"
+#ifdef CONFIG_NORECLAIM_LRU
+			" noreclaim:%lukB"
+#endif
 			" present:%lukB"
 			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
@@ -1978,6 +1989,9 @@ void show_free_areas(void)
 			K(zone_page_state(zone, NR_INACTIVE_ANON)),
 			K(zone_page_state(zone, NR_ACTIVE_FILE)),
 			K(zone_page_state(zone, NR_INACTIVE_FILE)),
+#ifdef CONFIG_NORECLAIM_LRU
+			K(zone_page_state(zone, NR_NORECLAIM)),
+#endif
 			K(zone->present_pages),
 			zone->pages_scanned,
 			(zone_is_all_unreclaimable(zone) ? "yes" : "no")
Index: linux-2.6.25-mm1/mm/vmstat.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/vmstat.c	2008-04-24 12:03:35.000000000 -0400
+++ linux-2.6.25-mm1/mm/vmstat.c	2008-04-24 12:04:01.000000000 -0400
@@ -691,6 +691,9 @@ static const char * const vmstat_text[] 
 	"nr_active_anon",
 	"nr_inactive_file",
 	"nr_active_file",
+#ifdef CONFIG_NORECLAIM_LRU
+	"nr_noreclaim",
+#endif
 	"nr_anon_pages",
 	"nr_mapped",
 	"nr_file_pages",
Index: linux-2.6.25-mm1/drivers/base/node.c
===================================================================
--- linux-2.6.25-mm1.orig/drivers/base/node.c	2008-04-24 12:01:36.000000000 -0400
+++ linux-2.6.25-mm1/drivers/base/node.c	2008-04-24 12:04:01.000000000 -0400
@@ -54,6 +54,9 @@ static ssize_t node_read_meminfo(struct 
 		       "Node %d Inactive(anon): %8lu kB\n"
 		       "Node %d Active(file):   %8lu kB\n"
 		       "Node %d Inactive(file): %8lu kB\n"
+#ifdef CONFIG_NORECLAIM_LRU
+		       "Node %d Noreclaim:    %8lu kB\n"
+#endif
 #ifdef CONFIG_HIGHMEM
 		       "Node %d HighTotal:      %8lu kB\n"
 		       "Node %d HighFree:       %8lu kB\n"
@@ -83,6 +86,9 @@ static ssize_t node_read_meminfo(struct 
 		       nid, node_page_state(nid, NR_INACTIVE_ANON),
 		       nid, node_page_state(nid, NR_ACTIVE_FILE),
 		       nid, node_page_state(nid, NR_INACTIVE_FILE),
+#ifdef CONFIG_NORECLAIM_LRU
+		       nid, node_page_state(nid, NR_NORECLAIM),
+#endif
 #ifdef CONFIG_HIGHMEM
 		       nid, K(i.totalhigh),
 		       nid, K(i.freehigh),
Index: linux-2.6.25-mm1/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.25-mm1.orig/fs/proc/proc_misc.c	2008-04-24 12:01:36.000000000 -0400
+++ linux-2.6.25-mm1/fs/proc/proc_misc.c	2008-04-24 12:04:01.000000000 -0400
@@ -174,6 +174,9 @@ static int meminfo_read_proc(char *page,
 		"Inactive(anon): %8lu kB\n"
 		"Active(file):   %8lu kB\n"
 		"Inactive(file): %8lu kB\n"
+#ifdef CONFIG_NORECLAIM_LRU
+		"Noreclaim:    %8lu kB\n"
+#endif
 #ifdef CONFIG_HIGHMEM
 		"HighTotal:      %8lu kB\n"
 		"HighFree:       %8lu kB\n"
@@ -209,6 +212,9 @@ static int meminfo_read_proc(char *page,
 		K(inactive_anon),
 		K(active_file),
 		K(inactive_file),
+#ifdef CONFIG_NORECLAIM_LRU
+		K(global_page_state(NR_NORECLAIM)),
+#endif
 #ifdef CONFIG_HIGHMEM
 		K(i.totalhigh),
 		K(i.freehigh),

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH -mm 14/15] ramfs pages are non-reclaimable
  2008-04-28 18:18 [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
                   ` (12 preceding siblings ...)
  2008-04-28 18:18 ` [PATCH -mm 13/15] Non-reclaimable page statistics Rik van Riel
@ 2008-04-28 18:18 ` Rik van Riel
  2008-04-28 18:18 ` [PATCH -mm 15/15] SHM_LOCKED pages are nonreclaimable Rik van Riel
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2008-04-28 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: lee.schermerhorn, akpm, kosaki.motohiro, linux-mm,
	Lee Schermerhorn

[-- Attachment #1: rvr-13-lts-noreclaim-ramfs-pages-are-non-reclaimable.patch --]
[-- Type: text/plain, Size: 4894 bytes --]

V3 -> V4:
+ drivers/block/rd.c was replaced by brd.c in 24-rc4-mm1.
  Update patch to add brd_open() to mark mapping as nonreclaimable

V2 -> V3:
+  rebase to 23-mm1 atop RvR's split LRU series [no changes]

V1 -> V2:
+  add ramfs pages to this class of non-reclaimable pages by
   marking ramfs address_space [mapping] as non-reclaimble.

Christoph Lameter pointed out that ram disk pages also clutter the
LRU lists.  When vmscan finds them dirty and tries to clean them,
the ram disk writeback function just redirties the page so that it
goes back onto the active list.  Round and round she goes...

Define new address_space flag [shares address_space flags member
with mapping's gfp mask] to indicate that the address space contains
all non-reclaimable pages.  This will provide for efficient testing
of ramdisk pages in page_reclaimable().

Also provide wrapper functions to set/test the noreclaim state to
minimize #ifdefs in ramdisk driver and any other users of this
facility.

Set the noreclaim state on address_space structures for new
ramdisk inodes.  Test the noreclaim state in page_reclaimable()
to cull non-reclaimable pages.

Similarly, ramfs pages are non-reclaimable.  Set the 'noreclaim'
address_space flag for new ramfs inodes.

These changes depend on [CONFIG_]NORECLAIM_LRU.


Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

 drivers/block/brd.c     |   13 +++++++++++++
 fs/ramfs/inode.c        |    1 +
 include/linux/pagemap.h |   22 ++++++++++++++++++++++
 mm/vmscan.c             |    5 +++++
 4 files changed, 41 insertions(+)

Index: linux-2.6.25-mm1/include/linux/pagemap.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/pagemap.h	2008-04-22 10:33:26.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/pagemap.h	2008-04-24 12:04:06.000000000 -0400
@@ -30,6 +30,28 @@ static inline void mapping_set_error(str
 	}
 }
 
+#ifdef CONFIG_NORECLAIM_LRU
+#define AS_NORECLAIM	(__GFP_BITS_SHIFT + 2)	/* e.g., ramdisk, SHM_LOCK */
+
+static inline void mapping_set_noreclaim(struct address_space *mapping)
+{
+	set_bit(AS_NORECLAIM, &mapping->flags);
+}
+
+static inline int mapping_non_reclaimable(struct address_space *mapping)
+{
+	if (mapping && (mapping->flags & AS_NORECLAIM))
+		return 1;
+	return 0;
+}
+#else
+static inline void mapping_set_noreclaim(struct address_space *mapping) { }
+static inline int mapping_non_reclaimable(struct address_space *mapping)
+{
+	return 0;
+}
+#endif
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
Index: linux-2.6.25-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/vmscan.c	2008-04-24 12:03:54.000000000 -0400
+++ linux-2.6.25-mm1/mm/vmscan.c	2008-04-24 12:04:06.000000000 -0400
@@ -2261,6 +2261,8 @@ int zone_reclaim(struct zone *zone, gfp_
  * lists vs noreclaim list.
  *
  * Reasons page might not be reclaimable:
+ * (1) page's mapping marked non-reclaimable
+ *
  * TODO - later patches
  */
 int page_reclaimable(struct page *page, struct vm_area_struct *vma)
@@ -2268,6 +2270,9 @@ int page_reclaimable(struct page *page, 
 
 	VM_BUG_ON(PageNoreclaim(page));
 
+	if (mapping_non_reclaimable(page_mapping(page)))
+		return 0;
+
 	/* TODO:  test page [!]reclaimable conditions */
 
 	return 1;
Index: linux-2.6.25-mm1/fs/ramfs/inode.c
===================================================================
--- linux-2.6.25-mm1.orig/fs/ramfs/inode.c	2008-04-22 10:33:44.000000000 -0400
+++ linux-2.6.25-mm1/fs/ramfs/inode.c	2008-04-24 12:04:06.000000000 -0400
@@ -61,6 +61,7 @@ struct inode *ramfs_get_inode(struct sup
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
 		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		mapping_set_noreclaim(inode->i_mapping);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
 		default:
Index: linux-2.6.25-mm1/drivers/block/brd.c
===================================================================
--- linux-2.6.25-mm1.orig/drivers/block/brd.c	2008-04-22 10:33:42.000000000 -0400
+++ linux-2.6.25-mm1/drivers/block/brd.c	2008-04-24 12:04:06.000000000 -0400
@@ -374,8 +374,21 @@ static int brd_ioctl(struct inode *inode
 	return error;
 }
 
+/*
+ * brd_open():
+ * Just mark the mapping as containing non-reclaimable pages
+ */
+static int brd_open(struct inode *inode, struct file *filp)
+{
+	struct address_space *mapping = inode->i_mapping;
+
+	mapping_set_noreclaim(mapping);
+	return 0;
+}
+
 static struct block_device_operations brd_fops = {
 	.owner =		THIS_MODULE,
+	.open  =		brd_open,
 	.ioctl =		brd_ioctl,
 #ifdef CONFIG_BLK_DEV_XIP
 	.direct_access =	brd_direct_access,

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH -mm 15/15] SHM_LOCKED pages are nonreclaimable
  2008-04-28 18:18 [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
                   ` (13 preceding siblings ...)
  2008-04-28 18:18 ` [PATCH -mm 14/15] ramfs pages are non-reclaimable Rik van Riel
@ 2008-04-28 18:18 ` Rik van Riel
  2008-04-28 19:55 ` [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
  2008-05-08  4:54 ` Rik van Riel
  16 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2008-04-28 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: lee.schermerhorn, akpm, kosaki.motohiro, linux-mm,
	Lee Schermerhorn

[-- Attachment #1: rvr-14-lts-noreclaim-SHM_LOCKED-pages-are-nonreclaimable.patch --]
[-- Type: text/plain, Size: 6473 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split LRU series.
+ Use scan_mapping_noreclaim_page() on unlock.  See below.

V1 -> V2:
+  modify to use reworked 'scan_all_zones_noreclaim_pages()'
   See 'TODO' below - still pending.

While working with Nick Piggin's mlock patches, I noticed that
shmem segments locked via shmctl(SHM_LOCKED) were not being handled.
SHM_LOCKed pages work like ramdisk pages--the writeback function
just redirties the page so that it can't be reclaimed.  Deal with
these using the same approach as for ram disk pages.

Use the AS_NORECLAIM flag to mark address_space of SHM_LOCKed
shared memory regions as non-reclaimable.  Then these pages
will be culled off the normal LRU lists during vmscan.

Add new wrapper function to clear the mapping's noreclaim state
when/if shared memory segment is munlocked.

Add 'scan_mapping_noreclaim_page()' to mm/vmscan.c to scan all
pages in the shmem segment's mapping [struct address_space] for
reclaimability now that they're no longer locked.  If so, move
them to the appropriate zone lru list.

Changes depend on [CONFIG_]NORECLAIM_LRU.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

 include/linux/pagemap.h |   10 ++++-
 include/linux/swap.h    |    4 ++
 mm/shmem.c              |    3 +
 mm/vmscan.c             |   85 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 100 insertions(+), 2 deletions(-)

Index: linux-2.6.25-mm1/mm/shmem.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/shmem.c	2008-04-24 12:00:01.000000000 -0400
+++ linux-2.6.25-mm1/mm/shmem.c	2008-04-24 12:04:11.000000000 -0400
@@ -1469,10 +1469,13 @@ int shmem_lock(struct file *file, int lo
 		if (!user_shm_lock(inode->i_size, user))
 			goto out_nomem;
 		info->flags |= VM_LOCKED;
+		mapping_set_noreclaim(file->f_mapping);
 	}
 	if (!lock && (info->flags & VM_LOCKED) && user) {
 		user_shm_unlock(inode->i_size, user);
 		info->flags &= ~VM_LOCKED;
+		mapping_clear_noreclaim(file->f_mapping);
+		scan_mapping_noreclaim_pages(file->f_mapping);
 	}
 	retval = 0;
 out_nomem:
Index: linux-2.6.25-mm1/include/linux/pagemap.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/pagemap.h	2008-04-24 12:04:06.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/pagemap.h	2008-04-24 12:04:11.000000000 -0400
@@ -38,14 +38,20 @@ static inline void mapping_set_noreclaim
 	set_bit(AS_NORECLAIM, &mapping->flags);
 }
 
+static inline void mapping_clear_noreclaim(struct address_space *mapping)
+{
+	clear_bit(AS_NORECLAIM, &mapping->flags);
+}
+
 static inline int mapping_non_reclaimable(struct address_space *mapping)
 {
-	if (mapping && (mapping->flags & AS_NORECLAIM))
-		return 1;
+	if (mapping)
+		return test_bit(AS_NORECLAIM, &mapping->flags);
 	return 0;
 }
 #else
 static inline void mapping_set_noreclaim(struct address_space *mapping) { }
+static inline void mapping_clear_noreclaim(struct address_space *mapping) { }
 static inline int mapping_non_reclaimable(struct address_space *mapping)
 {
 	return 0;
Index: linux-2.6.25-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/vmscan.c	2008-04-24 12:04:06.000000000 -0400
+++ linux-2.6.25-mm1/mm/vmscan.c	2008-04-24 12:04:11.000000000 -0400
@@ -2277,4 +2277,89 @@ int page_reclaimable(struct page *page, 
 
 	return 1;
 }
+
+/**
+ * check_move_noreclaim_page - check page for reclaimability and move to appropriate zone lru list
+ * @page: page to check reclaimability and move to appropriate lru list
+ * @zone: zone page is in
+ *
+ * Checks a page for reclaimability and moves the page to the appropriate
+ * zone lru list.
+ *
+ * Restrictions: zone->lru_lock must be held, page must be on LRU and must
+ * have PageNoreclaim set.
+ */
+static void check_move_noreclaim_page(struct page *page, struct zone *zone)
+{
+
+	ClearPageNoreclaim(page); /* for page_reclaimable() */
+	if (page_reclaimable(page, NULL)) {
+		enum lru_list l = LRU_INACTIVE_ANON + page_file_cache(page);
+		__dec_zone_state(zone, NR_NORECLAIM);
+		list_move(&page->lru, &zone->list[l]);
+		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
+	} else {
+		/*
+		 * rotate noreclaim list
+		 */
+		SetPageNoreclaim(page);
+		list_move(&page->lru, &zone->list[LRU_NORECLAIM]);
+	}
+}
+
+/**
+ * scan_mapping_noreclaim_pages - scan an address space for reclaimable pages
+ * @mapping: struct address_space to scan for reclaimable pages
+ *
+ * Scan all pages in mapping.  Check non-reclaimable pages for
+ * reclaimability and move them to the appropriate zone lru list.
+ */
+void scan_mapping_noreclaim_pages(struct address_space *mapping)
+{
+	pgoff_t next = 0;
+	pgoff_t end   = i_size_read(mapping->host);
+	struct zone *zone;
+	struct pagevec pvec;
+
+	if (mapping->nrpages == 0)
+		return;
+
+	pagevec_init(&pvec, 0);
+	while (next < end &&
+		pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+		int i;
+
+		zone = NULL;
+
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *page = pvec.pages[i];
+			pgoff_t page_index = page->index;
+			struct zone *pagezone = page_zone(page);
+
+			if (page_index > next)
+				next = page_index;
+			next++;
+
+			if (TestSetPageLocked(page))
+				continue;
+
+			if (pagezone != zone) {
+				if (zone)
+					spin_unlock(&zone->lru_lock);
+				zone = pagezone;
+				spin_lock(&zone->lru_lock);
+			}
+
+			if (PageLRU(page) && PageNoreclaim(page))
+				check_move_noreclaim_page(page, zone);
+
+			unlock_page(page);
+
+		}
+		if (zone)
+			spin_unlock(&zone->lru_lock);
+		pagevec_release(&pvec);
+	}
+
+}
 #endif
Index: linux-2.6.25-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/swap.h	2008-04-24 12:03:54.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/swap.h	2008-04-24 12:04:11.000000000 -0400
@@ -242,12 +242,16 @@ static inline int zone_reclaim(struct zo
 
 #ifdef CONFIG_NORECLAIM_LRU
 extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
+extern void scan_mapping_noreclaim_pages(struct address_space *);
 #else
 static inline int page_reclaimable(struct page *page,
 						struct vm_area_struct *vma)
 {
 	return 1;
 }
+static inline void scan_mapping_noreclaim_pages(struct address_space *mapping)
+{
+}
 #endif
 
 extern int kswapd_run(int nid);

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -mm 00/15] VM pageout scalability improvements (V6)
  2008-04-28 18:18 [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
                   ` (14 preceding siblings ...)
  2008-04-28 18:18 ` [PATCH -mm 15/15] SHM_LOCKED pages are nonreclaimable Rik van Riel
@ 2008-04-28 19:55 ` Rik van Riel
  2008-05-08  4:54 ` Rik van Riel
  16 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2008-04-28 19:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: lee.schermerhorn, akpm, kosaki.motohiro

On Mon, 28 Apr 2008 14:18:35 -0400
Rik van Riel <riel@redhat.com> wrote:

> This patch series improves VM scalability 

Here is some (absolutely minimal) performance data.

Running "fillmem 16000" on my 16 GB test system puts the system about 800MB
into swap.  Both run time and CPU use are improved with this patch series.
The numbers are an average of about a dozen runs.

2.6.25-mm1:
real	1m31
user	0m11
sys	0m32

2.5.25-mm1-splitvm:
real	1m02
user	0m11
sys	0m22

This does not account for kswapd CPU use, which accumulated to about 56
seconds total in the splitvm kernel, but around 5 minutes on 2.6.25-mm1.

I will make a kernel RPM with the split LRU patch set available so other
people can easily do performance tests.

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -mm 00/15] VM pageout scalability improvements (V6)
  2008-04-28 18:18 [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
                   ` (15 preceding siblings ...)
  2008-04-28 19:55 ` [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
@ 2008-05-08  4:54 ` Rik van Riel
  16 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2008-05-08  4:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: lee.schermerhorn, akpm, kosaki.motohiro

On Mon, 28 Apr 2008 14:18:35 -0400
Rik van Riel <riel@redhat.com> wrote:

> On large memory systems, the VM can spend way too much time scanning
> through pages that it cannot (or should not) evict from memory. Not
> only does it use up CPU time, but it also provokes lock contention
> and can leave large systems under memory presure in a catatonic state.

> This patch series improves VM scalability by:

> An all-in-one patch can be found at:
> 
> 	http://people.redhat.com/riel/splitvm/

I have prepared kernel RPMs with the split LRU patch series and
made them available at the above URL.  The kernel packages are
based on the 2.6.25 Fedora 9 kernel RPM.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -mm 07/15] split LRU lists into anon & file sets
  2008-04-28 18:18 ` [PATCH -mm 07/15] split LRU lists into anon & file sets Rik van Riel
@ 2008-05-10  7:50   ` MinChan Kim
  0 siblings, 0 replies; 34+ messages in thread
From: MinChan Kim @ 2008-05-10  7:50 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, lee.schermerhorn, akpm, kosaki.motohiro

> Index: linux-2.6.25-mm1/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.25-mm1.orig/mm/page_alloc.c       2008-04-24 12:00:01.000000000 -0400
> +++ linux-2.6.25-mm1/mm/page_alloc.c    2008-04-24 12:01:36.000000000 -0400
> @@ -1923,10 +1923,13 @@ void show_free_areas(void)
>                }
>        }
>
> -       printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu\n"
> +       printk("Active_anon:%lu active_file:%lu inactive_anon%lu\n"
> +               " inactive_file:%lu dirty:%lu writeback:%lu unstable:%lu\n"
>                " free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n",

I found typo error.

index 2b2e205..ad84c99 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1932,7 +1932,7 @@ void show_free_areas(void)
                }
        }

-       printk("Active_anon:%lu active_file:%lu inactive_anon%lu\n"
+       printk("Active_anon:%lu active_file:%lu inactive_anon:%lu\n"
                " inactive_file:%lu"
 //TODO:  check/adjust line lengths
 #ifdef CONFIG_NORECLAIM_LRU


-- 
Kinds regards,
MinChan Kim

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH -mm 05/15] free swap space on swap-in/activation
  2008-04-28 18:18 ` [PATCH -mm 05/15] free swap space on swap-in/activation Rik van Riel
@ 2008-05-12 11:21   ` Daisuke Nishimura
  2008-05-12 13:33     ` Rik van Riel
       [not found]     ` <1210600296.7300.23.camel@lts-notebook>
  0 siblings, 2 replies; 34+ messages in thread
From: Daisuke Nishimura @ 2008-05-12 11:21 UTC (permalink / raw)
  To: Rik van Riel; +Cc: lee.schermerhorn, akpm, kosaki.motohiro, linux-kernel

Hi.

I have a question about the intention of this patch.

I think all the remove_exclusive_swap_page() you have added
are called while the _count of the page is incremented by
__isolate_lru_page().

So, IMHO, swap caches that will be freed by this patch should have
page counts from __isolate_lru_page() and the swap cache itself.

They are different from, for example, those are freed by do_swap_page()
-> remove_exclusive_swap_page(), that is, swap caches
that have been just mapped and have page counts from
the process(only user of the page) and the swap cache itself.

So, the intention of this patch is
not to free a swap space of swap cache that has already swapped in
and only used one process as do_swap_page() does,
but to free a swap space that is only used as a swap cache
(not used by any processes), right?


Regards,
Daisuke Nishimura.

Rik van Riel wrote:
> Free swap cache entries when swapping in pages if vm_swap_full()
> [swap space > 1/2 used].  Uses new pagevec to reduce pressure
> on locks.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> 
> Index: linux-2.6.25-mm1/mm/vmscan.c
> ===================================================================
> --- linux-2.6.25-mm1.orig/mm/vmscan.c	2008-04-24 10:46:43.000000000 -0400
> +++ linux-2.6.25-mm1/mm/vmscan.c	2008-04-24 11:59:56.000000000 -0400
> @@ -619,6 +619,9 @@ free_it:
>  		continue;
>  
>  activate_locked:
> +		/* Not a candidate for swapping, so reclaim swap space. */
> +		if (PageSwapCache(page) && vm_swap_full())
> +			remove_exclusive_swap_page(page);
>  		SetPageActive(page);
>  		pgactivate++;
>  keep_locked:
> @@ -1203,6 +1206,8 @@ static void shrink_active_list(unsigned 
>  			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
>  			pgmoved = 0;
>  			spin_unlock_irq(&zone->lru_lock);
> +			if (vm_swap_full())
> +				pagevec_swap_free(&pvec);
>  			__pagevec_release(&pvec);
>  			spin_lock_irq(&zone->lru_lock);
>  		}
> @@ -1212,6 +1217,8 @@ static void shrink_active_list(unsigned 
>  	__count_zone_vm_events(PGREFILL, zone, pgscanned);
>  	__count_vm_events(PGDEACTIVATE, pgdeactivate);
>  	spin_unlock_irq(&zone->lru_lock);
> +	if (vm_swap_full())
> +		pagevec_swap_free(&pvec);
>  
>  	pagevec_release(&pvec);
>  }
> Index: linux-2.6.25-mm1/mm/swap.c
> ===================================================================
> --- linux-2.6.25-mm1.orig/mm/swap.c	2008-04-24 11:59:51.000000000 -0400
> +++ linux-2.6.25-mm1/mm/swap.c	2008-04-24 11:59:56.000000000 -0400
> @@ -443,6 +443,24 @@ void pagevec_strip(struct pagevec *pvec)
>  	}
>  }
>  
> +/*
> + * Try to free swap space from the pages in a pagevec
> + */
> +void pagevec_swap_free(struct pagevec *pvec)
> +{
> +	int i;
> +
> +	for (i = 0; i < pagevec_count(pvec); i++) {
> +		struct page *page = pvec->pages[i];
> +
> +		if (PageSwapCache(page) && !TestSetPageLocked(page)) {
> +			if (PageSwapCache(page))
> +				remove_exclusive_swap_page(page);
> +			unlock_page(page);
> +		}
> +	}
> +}
> +
>  /**
>   * pagevec_lookup - gang pagecache lookup
>   * @pvec:	Where the resulting pages are placed
> Index: linux-2.6.25-mm1/include/linux/pagevec.h
> ===================================================================
> --- linux-2.6.25-mm1.orig/include/linux/pagevec.h	2008-04-24 11:59:51.000000000 -0400
> +++ linux-2.6.25-mm1/include/linux/pagevec.h	2008-04-24 11:59:56.000000000 -0400
> @@ -25,6 +25,7 @@ void __pagevec_release_nonlru(struct pag
>  void __pagevec_free(struct pagevec *pvec);
>  void ____pagevec_lru_add(struct pagevec *pvec, enum lru_list lru);
>  void pagevec_strip(struct pagevec *pvec);
> +void pagevec_swap_free(struct pagevec *pvec);
>  unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
>  		pgoff_t start, unsigned nr_pages);
>  unsigned pagevec_lookup_tag(struct pagevec *pvec,
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -mm 05/15] free swap space on swap-in/activation
  2008-05-12 11:21   ` Daisuke Nishimura
@ 2008-05-12 13:33     ` Rik van Riel
  2008-05-13 13:00       ` Daisuke Nishimura
       [not found]     ` <1210600296.7300.23.camel@lts-notebook>
  1 sibling, 1 reply; 34+ messages in thread
From: Rik van Riel @ 2008-05-12 13:33 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: lee.schermerhorn, akpm, kosaki.motohiro, linux-kernel

On Mon, 12 May 2008 20:21:32 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> So, the intention of this patch is
> not to free a swap space of swap cache that has already swapped in
> and only used one process as do_swap_page() does,
> but to free a swap space that is only used as a swap cache
> (not used by any processes), right?

No.  The code should also free the swap space of pages that are
mapped by processes, in order to free up swap space for other
uses.  This only happens if vm_swap_full() is true.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -mm 05/15] free swap space on swap-in/activation
       [not found]     ` <1210600296.7300.23.camel@lts-notebook>
@ 2008-05-13 12:39       ` Daisuke Nishimura
  0 siblings, 0 replies; 34+ messages in thread
From: Daisuke Nishimura @ 2008-05-13 12:39 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Rik van Riel, akpm, kosaki.motohiro, linux-kernel, Hugh Dickins

On 2008/05/12 22:51 +0900, Lee Schermerhorn wrote:
> On Mon, 2008-05-12 at 20:21 +0900, Daisuke Nishimura wrote:
>> Hi.
>>
>> I have a question about the intention of this patch.
>>
>> I think all the remove_exclusive_swap_page() you have added
>> are called while the _count of the page is incremented by
>> __isolate_lru_page().
>>
>> So, IMHO, swap caches that will be freed by this patch should have
>> page counts from __isolate_lru_page() and the swap cache itself.
>>
>> They are different from, for example, those are freed by do_swap_page()
>> -> remove_exclusive_swap_page(), that is, swap caches
>> that have been just mapped and have page counts from
>> the process(only user of the page) and the swap cache itself.
>>
>> So, the intention of this patch is
>> not to free a swap space of swap cache that has already swapped in
>> and only used one process as do_swap_page() does,
>> but to free a swap space that is only used as a swap cache
>> (not used by any processes), right?
> 
> I agree.  I had noticed this when I wanted to use
> remove_exclusive_swap_page() when a page in the swap cache became
> mlocked, but I hadn't got to ask about it yet.  As you note, we hold an
> extra reference in that case.  Further more, an anonymous page, in the
> swap cache, can be shared, read-only by multiple tasks--ancestors and
> descendants.  So, I think the check could be changed to something like:
> 
> (page_count(page) - page_mapcount(page) - extra_ref) == 1  to remove, or
> '!= 1' to reject removal. 
> 
> "extra_ref" would be a new argument to remove_exclusive_swap_page() that
> would be passed as 0 in do_swap_page() and, I think, free_swap_cache()
> and swap_writepage()--current callers.  In Rik's patch and when mlocking
> a page, we'd pass extra_ref as 1 or 2 because we hold the extra ref from
> isolate_lru_page and maybe one from get_user_pages() in the mlock case.
> 
> Thoughts?
> 

I agree that the check in remove_excusive_swap_page() should be changed,
and your new check seems rational for me.

But I have one comment.

I think swap_writepage(), which is one of the current caller, should not
call remove_exclusive_swap_page() with extra_ref=0, because swap_writepage()
can be called via shrink_page_list() -> pageout(), that is, while the
page count is incremented by page isolation.
If there is a swap cache page that would be freed in current shrink_page_list()
-> pageout() -> swap_writepage() -> remove_exclusive_swap_page() sequence,
the page count of the page is 2(page isolation and swap cache).
But if you modify remove_exclusive_swap_page() and call it with extra_ref=0,
the condition isn't met.
I don't know wether swap_writepage() would be called via other sequence or not,
but it should not call remove_exclusive_swap_page() simply with extra_ref=0.


Thanks,
Daisuke Nishimura.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -mm 05/15] free swap space on swap-in/activation
  2008-05-12 13:33     ` Rik van Riel
@ 2008-05-13 13:00       ` Daisuke Nishimura
  2008-05-13 13:11         ` Rik van Riel
  2008-05-13 16:04         ` [PATCH] take pageout refcount into account for remove_exclusive_swap_page Rik van Riel
  0 siblings, 2 replies; 34+ messages in thread
From: Daisuke Nishimura @ 2008-05-13 13:00 UTC (permalink / raw)
  To: Rik van Riel; +Cc: lee.schermerhorn, akpm, kosaki.motohiro, linux-kernel

On 2008/05/12 22:33 +0900, Rik van Riel wrote:
> On Mon, 12 May 2008 20:21:32 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
>> So, the intention of this patch is
>> not to free a swap space of swap cache that has already swapped in
>> and only used one process as do_swap_page() does,
>> but to free a swap space that is only used as a swap cache
>> (not used by any processes), right?
> 
> No.  The code should also free the swap space of pages that are
> mapped by processes, in order to free up swap space for other
> uses.  This only happens if vm_swap_full() is true.
> 

OK.

I thought that current code cannot free the swap space
of pages that are mapped by processes(and I think it cannot),
so I asked the intention of this patch for clarification.


Thanks,
Daisuke Nishimura.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -mm 05/15] free swap space on swap-in/activation
  2008-05-13 13:00       ` Daisuke Nishimura
@ 2008-05-13 13:11         ` Rik van Riel
  2008-05-13 16:04         ` [PATCH] take pageout refcount into account for remove_exclusive_swap_page Rik van Riel
  1 sibling, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2008-05-13 13:11 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: lee.schermerhorn, akpm, kosaki.motohiro, linux-kernel

On Tue, 13 May 2008 22:00:44 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On 2008/05/12 22:33 +0900, Rik van Riel wrote:
> > On Mon, 12 May 2008 20:21:32 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > 
> >> So, the intention of this patch is
> >> not to free a swap space of swap cache that has already swapped in
> >> and only used one process as do_swap_page() does,
> >> but to free a swap space that is only used as a swap cache
> >> (not used by any processes), right?
> > 
> > No.  The code should also free the swap space of pages that are
> > mapped by processes, in order to free up swap space for other
> > uses.  This only happens if vm_swap_full() is true.
> > 
> 
> OK.
> 
> I thought that current code cannot free the swap space
> of pages that are mapped by processes(and I think it cannot),
> so I asked the intention of this patch for clarification.

You may be right, due to the refcounting magic that is going
on.  I will verify the code and send in a fix if it turns out
to be needed.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] take pageout refcount into account for remove_exclusive_swap_page
  2008-05-13 13:00       ` Daisuke Nishimura
  2008-05-13 13:11         ` Rik van Riel
@ 2008-05-13 16:04         ` Rik van Riel
  2008-05-13 17:43           ` Lee Schermerhorn
  1 sibling, 1 reply; 34+ messages in thread
From: Rik van Riel @ 2008-05-13 16:04 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: lee.schermerhorn, akpm, kosaki.motohiro, linux-kernel

The pageout code takes a reference count to the page, which means
that remove_exclusive_swap_page (when called from the pageout code)
needs to take that extra refcount into account for mapped pages.

Introduces a remove_exclusive_swap_page_ref function to avoid
exposing too much magic to the rest of the VM.

Signed-off-by: Rik van Riel <riel@redhat.com>
Debugged-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

---
Daisuke: thank you for pointing out the problem.  Does this patch look like
a reasonable fix to you?

Andrew: this patch is incremental to my patch
	"[PATCH -mm 05/15] free swap space on swap-in/activation"


Index: linux-2.6.25-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/vmscan.c	2008-05-13 11:59:34.000000000 -0400
+++ linux-2.6.25-mm1/mm/vmscan.c	2008-05-13 11:54:03.000000000 -0400
@@ -621,7 +621,7 @@ free_it:
 activate_locked:
 		/* Not a candidate for swapping, so reclaim swap space. */
 		if (PageSwapCache(page) && vm_swap_full())
-			remove_exclusive_swap_page(page);
+			remove_exclusive_swap_page_ref(page);
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
Index: linux-2.6.25-mm1/mm/swap.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/swap.c	2008-05-13 11:59:34.000000000 -0400
+++ linux-2.6.25-mm1/mm/swap.c	2008-05-13 11:53:47.000000000 -0400
@@ -455,7 +455,7 @@ void pagevec_swap_free(struct pagevec *p
 
 		if (PageSwapCache(page) && !TestSetPageLocked(page)) {
 			if (PageSwapCache(page))
-				remove_exclusive_swap_page(page);
+				remove_exclusive_swap_page_ref(page);
 			unlock_page(page);
 		}
 	}
Index: linux-2.6.25-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.25-mm1.orig/include/linux/swap.h	2008-05-13 11:20:46.000000000 -0400
+++ linux-2.6.25-mm1/include/linux/swap.h	2008-05-13 11:47:31.000000000 -0400
@@ -266,6 +266,7 @@ extern sector_t swapdev_block(int, pgoff
 extern struct swap_info_struct *get_swap_info_struct(unsigned);
 extern int can_share_swap_page(struct page *);
 extern int remove_exclusive_swap_page(struct page *);
+extern int remove_exclusive_swap_page_ref(struct page *);
 struct backing_dev_info;
 
 extern spinlock_t swap_lock;
Index: linux-2.6.25-mm1/mm/swapfile.c
===================================================================
--- linux-2.6.25-mm1.orig/mm/swapfile.c	2008-04-22 10:33:45.000000000 -0400
+++ linux-2.6.25-mm1/mm/swapfile.c	2008-05-13 11:53:32.000000000 -0400
@@ -343,7 +343,7 @@ int can_share_swap_page(struct page *pag
  * Work out if there are any other processes sharing this
  * swap cache page. Free it if you can. Return success.
  */
-int remove_exclusive_swap_page(struct page *page)
+static int remove_exclusive_swap_page_count(struct page *page, int count)
 {
 	int retval;
 	struct swap_info_struct * p;
@@ -356,7 +356,7 @@ int remove_exclusive_swap_page(struct pa
 		return 0;
 	if (PageWriteback(page))
 		return 0;
-	if (page_count(page) != 2) /* 2: us + cache */
+	if (page_count(page) != count) /* 2: us + cache */
 		return 0;
 
 	entry.val = page_private(page);
@@ -369,7 +369,7 @@ int remove_exclusive_swap_page(struct pa
 	if (p->swap_map[swp_offset(entry)] == 1) {
 		/* Recheck the page count with the swapcache lock held.. */
 		write_lock_irq(&swapper_space.tree_lock);
-		if ((page_count(page) == 2) && !PageWriteback(page)) {
+		if ((page_count(page) == count) && !PageWriteback(page)) {
 			__delete_from_swap_cache(page);
 			SetPageDirty(page);
 			retval = 1;
@@ -387,6 +387,25 @@ int remove_exclusive_swap_page(struct pa
 }
 
 /*
+ * Most of the time the page should have two references: one for the
+ * process and one for the swap cache.
+ */
+int remove_exclusive_swap_page(struct page *page)
+{
+	return remove_exclusive_swap_page_count(page, 2);
+}
+
+/*
+ * The pageout code holds an extra reference to the page.  That raises
+ * the reference count to test for to 2 for a page that is only in the
+ * swap cache and 3 for a page that is mapped by a process.
+ */
+int remove_exclusive_swap_page_ref(struct page *page)
+{
+	return remove_exclusive_swap_page_count(page, 2 + page_mapped(page));
+}
+
+/*
  * Free the swap entry like above, but also try to
  * free the page cache entry if it is the last user.
  */

All Rights Reversed

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] take pageout refcount into account for remove_exclusive_swap_page
  2008-05-13 16:04         ` [PATCH] take pageout refcount into account for remove_exclusive_swap_page Rik van Riel
@ 2008-05-13 17:43           ` Lee Schermerhorn
  2008-05-13 18:09             ` Rik van Riel
  0 siblings, 1 reply; 34+ messages in thread
From: Lee Schermerhorn @ 2008-05-13 17:43 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Daisuke Nishimura, akpm, kosaki.motohiro, linux-kernel

On Tue, 2008-05-13 at 12:04 -0400, Rik van Riel wrote:
> The pageout code takes a reference count to the page, which means
> that remove_exclusive_swap_page (when called from the pageout code)
> needs to take that extra refcount into account for mapped pages.
> 
> Introduces a remove_exclusive_swap_page_ref function to avoid
> exposing too much magic to the rest of the VM.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Debugged-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 
> ---
> Daisuke: thank you for pointing out the problem.  Does this patch look like
> a reasonable fix to you?
> 
> Andrew: this patch is incremental to my patch
> 	"[PATCH -mm 05/15] free swap space on swap-in/activation"
> 
> 
> Index: linux-2.6.25-mm1/mm/vmscan.c
> ===================================================================
> --- linux-2.6.25-mm1.orig/mm/vmscan.c	2008-05-13 11:59:34.000000000 -0400
> +++ linux-2.6.25-mm1/mm/vmscan.c	2008-05-13 11:54:03.000000000 -0400
> @@ -621,7 +621,7 @@ free_it:
>  activate_locked:
>  		/* Not a candidate for swapping, so reclaim swap space. */
>  		if (PageSwapCache(page) && vm_swap_full())
> -			remove_exclusive_swap_page(page);
> +			remove_exclusive_swap_page_ref(page);
>  		SetPageActive(page);
>  		pgactivate++;
>  keep_locked:
> Index: linux-2.6.25-mm1/mm/swap.c
> ===================================================================
> --- linux-2.6.25-mm1.orig/mm/swap.c	2008-05-13 11:59:34.000000000 -0400
> +++ linux-2.6.25-mm1/mm/swap.c	2008-05-13 11:53:47.000000000 -0400
> @@ -455,7 +455,7 @@ void pagevec_swap_free(struct pagevec *p
>  
>  		if (PageSwapCache(page) && !TestSetPageLocked(page)) {
>  			if (PageSwapCache(page))
> -				remove_exclusive_swap_page(page);
> +				remove_exclusive_swap_page_ref(page);
>  			unlock_page(page);
>  		}
>  	}
> Index: linux-2.6.25-mm1/include/linux/swap.h
> ===================================================================
> --- linux-2.6.25-mm1.orig/include/linux/swap.h	2008-05-13 11:20:46.000000000 -0400
> +++ linux-2.6.25-mm1/include/linux/swap.h	2008-05-13 11:47:31.000000000 -0400
> @@ -266,6 +266,7 @@ extern sector_t swapdev_block(int, pgoff
>  extern struct swap_info_struct *get_swap_info_struct(unsigned);
>  extern int can_share_swap_page(struct page *);
>  extern int remove_exclusive_swap_page(struct page *);
> +extern int remove_exclusive_swap_page_ref(struct page *);
>  struct backing_dev_info;
>  
>  extern spinlock_t swap_lock;
> Index: linux-2.6.25-mm1/mm/swapfile.c
> ===================================================================
> --- linux-2.6.25-mm1.orig/mm/swapfile.c	2008-04-22 10:33:45.000000000 -0400
> +++ linux-2.6.25-mm1/mm/swapfile.c	2008-05-13 11:53:32.000000000 -0400
> @@ -343,7 +343,7 @@ int can_share_swap_page(struct page *pag
>   * Work out if there are any other processes sharing this
>   * swap cache page. Free it if you can. Return success.
>   */
> -int remove_exclusive_swap_page(struct page *page)
> +static int remove_exclusive_swap_page_count(struct page *page, int count)
>  {
>  	int retval;
>  	struct swap_info_struct * p;
> @@ -356,7 +356,7 @@ int remove_exclusive_swap_page(struct pa
>  		return 0;
>  	if (PageWriteback(page))
>  		return 0;
> -	if (page_count(page) != 2) /* 2: us + cache */
> +	if (page_count(page) != count) /* 2: us + cache */

Maybe change comment to "/* count:  us + ptes + cache */" ???

>  		return 0;
>  
>  	entry.val = page_private(page);
> @@ -369,7 +369,7 @@ int remove_exclusive_swap_page(struct pa
>  	if (p->swap_map[swp_offset(entry)] == 1) {
>  		/* Recheck the page count with the swapcache lock held.. */
>  		write_lock_irq(&swapper_space.tree_lock);
> -		if ((page_count(page) == 2) && !PageWriteback(page)) {
> +		if ((page_count(page) == count) && !PageWriteback(page)) {
>  			__delete_from_swap_cache(page);
>  			SetPageDirty(page);
>  			retval = 1;
> @@ -387,6 +387,25 @@ int remove_exclusive_swap_page(struct pa
>  }
>  
>  /*
> + * Most of the time the page should have two references: one for the
> + * process and one for the swap cache.
> + */
> +int remove_exclusive_swap_page(struct page *page)
> +{
> +	return remove_exclusive_swap_page_count(page, 2);
> +}
> +
> +/*
> + * The pageout code holds an extra reference to the page.  That raises
> + * the reference count to test for to 2 for a page that is only in the
> + * swap cache and 3 for a page that is mapped by a process.

Or, more generally, 2 + N, for an anon page that is mapped [must be
read-only, right?] by N processes.  This can happen after, e.g., fork().
Looks like this patch handles the condition just fine, but you might
want to reflect this in the comment.

Now, I think I can use this to try to remove anon pages from the swap
cache when they're mlocked.

> + */
> +int remove_exclusive_swap_page_ref(struct page *page)
> +{
> +	return remove_exclusive_swap_page_count(page, 2 + page_mapped(page));
> +}
> +
> +/*
>   * Free the swap entry like above, but also try to
>   * free the page cache entry if it is the last user.
>   */
> 
> All Rights Reversed


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] take pageout refcount into account for remove_exclusive_swap_page
  2008-05-13 17:43           ` Lee Schermerhorn
@ 2008-05-13 18:09             ` Rik van Riel
  2008-05-13 19:02               ` Lee Schermerhorn
  0 siblings, 1 reply; 34+ messages in thread
From: Rik van Riel @ 2008-05-13 18:09 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Daisuke Nishimura, akpm, kosaki.motohiro, linux-kernel

On Tue, 13 May 2008 13:43:55 -0400
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> Or, more generally, 2 + N, for an anon page that is mapped [must be
> read-only, right?] by N processes.  This can happen after, e.g., fork().
> Looks like this patch handles the condition just fine, but you might
> want to reflect this in the comment.

No, this patch only removes a page from the swap cache that is mapped
by one process.  The function page_mapped() returns either 1 or 0, not
the same as page_mapcount().
 
I am not sure if trying to handle swap cache pages that are mapped by
multiple processes could get us into other corner cases and think that
we should probably try to stick to the safe thing for now.

Besides, shouldn't anonymous shared pages be COW and relatively rare?

> Now, I think I can use this to try to remove anon pages from the swap
> cache when they're mlocked.
> 
> > + */
> > +int remove_exclusive_swap_page_ref(struct page *page)
> > +{
> > +	return remove_exclusive_swap_page_count(page, 2 + page_mapped(page));
> > +}
> > +
> > +/*
> >   * Free the swap entry like above, but also try to
> >   * free the page cache entry if it is the last user.
> >   */
> > 
> > All Rights Reversed


-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] take pageout refcount into account for remove_exclusive_swap_page
  2008-05-13 18:09             ` Rik van Riel
@ 2008-05-13 19:02               ` Lee Schermerhorn
  2008-05-15  2:15                 ` Daisuke Nishimura
  0 siblings, 1 reply; 34+ messages in thread
From: Lee Schermerhorn @ 2008-05-13 19:02 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Daisuke Nishimura, akpm, kosaki.motohiro, linux-kernel

On Tue, 2008-05-13 at 14:09 -0400, Rik van Riel wrote:
> On Tue, 13 May 2008 13:43:55 -0400
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> 
> > Or, more generally, 2 + N, for an anon page that is mapped [must be
> > read-only, right?] by N processes.  This can happen after, e.g., fork().
> > Looks like this patch handles the condition just fine, but you might
> > want to reflect this in the comment.
> 
> No, this patch only removes a page from the swap cache that is mapped
> by one process.  The function page_mapped() returns either 1 or 0, not
> the same as page_mapcount().

Duh!  I was reading "page_mapcount()", 'cause that's what I've been
considering using for this purpose.

>  
> I am not sure if trying to handle swap cache pages that are mapped by
> multiple processes could get us into other corner cases and think that
> we should probably try to stick to the safe thing for now.

OK.  I can test the more general case down the road.

> 
> Besides, shouldn't anonymous shared pages be COW and relatively rare?

Well, all anon pages are shared right after fork(), right?  They only
become private once they've been written to.  I don't have a feel for
the relative numbers of shared anon vs COWed anon--either in general or
in the swap cache.

> 
> > Now, I think I can use this to try to remove anon pages from the swap
> > cache when they're mlocked.

I suppose I can just go ahead and use this version with my stress load
and count the times when I could have freed the swap cache entry, but
didn't because it's mapped in multiple address spaces.  Later ... :)

> > 
> > > + */
> > > +int remove_exclusive_swap_page_ref(struct page *page)
> > > +{
> > > +	return remove_exclusive_swap_page_count(page, 2 + page_mapped(page));
> > > +}
> > > +
> > > +/*
> > >   * Free the swap entry like above, but also try to
> > >   * free the page cache entry if it is the last user.
> > >   */
> > > 
> > > All Rights Reversed
> 
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] take pageout refcount into account for remove_exclusive_swap_page
  2008-05-13 19:02               ` Lee Schermerhorn
@ 2008-05-15  2:15                 ` Daisuke Nishimura
  2008-05-15 17:55                   ` Lee Schermerhorn
  0 siblings, 1 reply; 34+ messages in thread
From: Daisuke Nishimura @ 2008-05-15  2:15 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Rik van Riel, akpm, kosaki.motohiro, linux-kernel

On 2008/05/14 4:02 +0900, Lee Schermerhorn wrote:
> On Tue, 2008-05-13 at 14:09 -0400, Rik van Riel wrote:
>> On Tue, 13 May 2008 13:43:55 -0400
>> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
>>
>>> Or, more generally, 2 + N, for an anon page that is mapped [must be
>>> read-only, right?] by N processes.  This can happen after, e.g., fork().
>>> Looks like this patch handles the condition just fine, but you might
>>> want to reflect this in the comment.
>> No, this patch only removes a page from the swap cache that is mapped
>> by one process.  The function page_mapped() returns either 1 or 0, not
>> the same as page_mapcount().
> 
> Duh!  I was reading "page_mapcount()", 'cause that's what I've been
> considering using for this purpose.
> 
>>  
>> I am not sure if trying to handle swap cache pages that are mapped by
>> multiple processes could get us into other corner cases and think that
>> we should probably try to stick to the safe thing for now.

I think it would be better to add a comment of
remove_exclusive_swap_page_ref() that it doesn't handle
swap caches that are mapped by multiple processes for safty.

> 
> OK.  I can test the more general case down the road.
> 
>> Besides, shouldn't anonymous shared pages be COW and relatively rare?
> 
> Well, all anon pages are shared right after fork(), right?  They only
> become private once they've been written to.  I don't have a feel for
> the relative numbers of shared anon vs COWed anon--either in general or
> in the swap cache.
> 
>>> Now, I think I can use this to try to remove anon pages from the swap
>>> cache when they're mlocked.
> 
> I suppose I can just go ahead and use this version with my stress load
> and count the times when I could have freed the swap cache entry, but
> didn't because it's mapped in multiple address spaces.  Later ... :)
> 

Lee, Rik's version looks good to me except mlocked case, but
do you have any plan to handle these cases?


>>>> + */
>>>> +int remove_exclusive_swap_page_ref(struct page *page)
>>>> +{
>>>> +	return remove_exclusive_swap_page_count(page, 2 + page_mapped(page));
>>>> +}
>>>> +
>>>> +/*
>>>>   * Free the swap entry like above, but also try to
>>>>   * free the page cache entry if it is the last user.
>>>>   */
>>>>
>>>> All Rights Reversed
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -mm 09/15] add some sanity checks to get_scan_ratio
  2008-04-28 18:18 ` [PATCH -mm 09/15] add some sanity checks to get_scan_ratio Rik van Riel
@ 2008-05-15  6:34   ` MinChan Kim
  2008-05-15 13:12     ` Rik van Riel
  0 siblings, 1 reply; 34+ messages in thread
From: MinChan Kim @ 2008-05-15  6:34 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, lee.schermerhorn, akpm, kosaki.motohiro

> @@ -1256,7 +1285,7 @@ static unsigned long shrink_zone(int pri
>        }
>
>        while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> -                                                nr[LRU_INACTIVE_FILE]) {
> +                                       nr[LRU_INACTIVE_FILE]) {
>                for_each_lru(l) {
>                        if (nr[l]) {
>                                nr_to_scan = min(nr[l],
> @@ -1269,6 +1298,14 @@ static unsigned long shrink_zone(int pri
>                }
>        }
>
> +       /*
> +        * Even if we did not try to evict anon pages at all, we want to
> +        * rebalance the anon lru active/inactive ratio.
> +        */
> +       if (scan_global_lru(sc) && inactive_anon_low(zone))
> +               shrink_list(NR_ACTIVE_ANON, SWAP_CLUSTER_MAX, zone, sc,
> +                                                               priority);
> +
>        throttle_vm_writeout(sc->gfp_mask);
>        return nr_reclaimed;
>  }

I think It's rather typo error.
If it is error, It will cause wrong algorithm.

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1375,7 +1375,7 @@ static unsigned long shrink_zone(int priority, struct zon
         * rebalance the anon lru active/inactive ratio.
         */
        if (scan_global_lru(sc) && inactive_anon_low(zone))
-               shrink_list(NR_ACTIVE_ANON, SWAP_CLUSTER_MAX, zone, sc,
+               shrink_list(LRU_ACTIVE_ANON, SWAP_CLUSTER_MAX, zone, sc,
                                                                priority);

        throttle_vm_writeout(sc->gfp_mask);



shrink_list called twice about LRU_ACTIVE_ANON if
(nr[LRU_INACTIVE_ANON] != 0 && inactive_anon_low(zone))
Is it your intention ? You want to put the pressure twice active anon
list on above condition ?

If your intention is right, I think following code about readability
is good than old.

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1375,8 +1375,7 @@ static unsigned long shrink_zone(int priority, struct zon
         * rebalance the anon lru active/inactive ratio.
         */
        if (scan_global_lru(sc) && inactive_anon_low(zone))
-               shrink_list(NR_ACTIVE_ANON, SWAP_CLUSTER_MAX, zone, sc,
-                                                               priority);
+               shrink_inactive_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);

        throttle_vm_writeout(sc->gfp_mask);
        return nr_reclaimed;


Gmail client will mangle my patch. This is just purpose of review.

-- 
Kinds regards,
MinChan Kim

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -mm 09/15] add some sanity checks to get_scan_ratio
  2008-05-15  6:34   ` MinChan Kim
@ 2008-05-15 13:12     ` Rik van Riel
  2008-05-16  6:42       ` MinChan Kim
  0 siblings, 1 reply; 34+ messages in thread
From: Rik van Riel @ 2008-05-15 13:12 UTC (permalink / raw)
  To: MinChan Kim; +Cc: linux-kernel, lee.schermerhorn, akpm, kosaki.motohiro

On Thu, 15 May 2008 15:34:51 +0900
"MinChan Kim" <minchan.kim@gmail.com> wrote:

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1375,8 +1375,7 @@ static unsigned long shrink_zone(int priority, struct zon
>          * rebalance the anon lru active/inactive ratio.
>          */
>         if (scan_global_lru(sc) && inactive_anon_low(zone))
> -               shrink_list(NR_ACTIVE_ANON, SWAP_CLUSTER_MAX, zone, sc,
> -                                                               priority);
> +               shrink_inactive_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
> 
>         throttle_vm_writeout(sc->gfp_mask);
>         return nr_reclaimed;

shrink_active_list, but yes, that is the idea.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] take pageout refcount into account for remove_exclusive_swap_page
  2008-05-15  2:15                 ` Daisuke Nishimura
@ 2008-05-15 17:55                   ` Lee Schermerhorn
  0 siblings, 0 replies; 34+ messages in thread
From: Lee Schermerhorn @ 2008-05-15 17:55 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Rik van Riel, akpm, kosaki.motohiro, linux-kernel

On Thu, 2008-05-15 at 11:15 +0900, Daisuke Nishimura wrote:
> On 2008/05/14 4:02 +0900, Lee Schermerhorn wrote:
> > On Tue, 2008-05-13 at 14:09 -0400, Rik van Riel wrote:
> >> On Tue, 13 May 2008 13:43:55 -0400
> >> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> >>
> >>> Or, more generally, 2 + N, for an anon page that is mapped [must be
> >>> read-only, right?] by N processes.  This can happen after, e.g., fork().
> >>> Looks like this patch handles the condition just fine, but you might
> >>> want to reflect this in the comment.
> >> No, this patch only removes a page from the swap cache that is mapped
> >> by one process.  The function page_mapped() returns either 1 or 0, not
> >> the same as page_mapcount().
> > 
> > Duh!  I was reading "page_mapcount()", 'cause that's what I've been
> > considering using for this purpose.
> > 
> >>  
> >> I am not sure if trying to handle swap cache pages that are mapped by
> >> multiple processes could get us into other corner cases and think that
> >> we should probably try to stick to the safe thing for now.
> 
> I think it would be better to add a comment of
> remove_exclusive_swap_page_ref() that it doesn't handle
> swap caches that are mapped by multiple processes for safty.
> 
> > 
> > OK.  I can test the more general case down the road.
> > 
> >> Besides, shouldn't anonymous shared pages be COW and relatively rare?
> > 
> > Well, all anon pages are shared right after fork(), right?  They only
> > become private once they've been written to.  I don't have a feel for
> > the relative numbers of shared anon vs COWed anon--either in general or
> > in the swap cache.
> > 
> >>> Now, I think I can use this to try to remove anon pages from the swap
> >>> cache when they're mlocked.
> > 
> > I suppose I can just go ahead and use this version with my stress load
> > and count the times when I could have freed the swap cache entry, but
> > didn't because it's mapped in multiple address spaces.  Later ... :)
> > 
> 
> Lee, Rik's version looks good to me except mlocked case, but
> do you have any plan to handle these cases?

Well, it's really just an optimization--freeing swap cache entry, if
possible, when mlocking a page.  I'll patch Rik's version to use
page_mapcount() and test heavily before proposing to do this.  I suppose
we can also discuss whether we want a separate version that frees swap
mapped by multiple tasks for the mlock case, and keep Rik's version for
vmscan.  However, I think that if it's safe in the mlock case, it should
be safe for vmscan.  We'll see.

Lee

> 
> 
> >>>> + */
> >>>> +int remove_exclusive_swap_page_ref(struct page *page)
> >>>> +{
> >>>> +	return remove_exclusive_swap_page_count(page, 2 + page_mapped(page));
> >>>> +}
> >>>> +
> >>>> +/*
> >>>>   * Free the swap entry like above, but also try to
> >>>>   * free the page cache entry if it is the last user.
> >>>>   */
> >>>>
> >>>> All Rights Reversed
> >>
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -mm 09/15] add some sanity checks to get_scan_ratio
  2008-05-15 13:12     ` Rik van Riel
@ 2008-05-16  6:42       ` MinChan Kim
  2008-05-16 16:47         ` Rik van Riel
  0 siblings, 1 reply; 34+ messages in thread
From: MinChan Kim @ 2008-05-16  6:42 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, lee.schermerhorn, akpm, kosaki.motohiro

On Thu, May 15, 2008 at 10:12 PM, Rik van Riel <riel@redhat.com> wrote:
> On Thu, 15 May 2008 15:34:51 +0900
> "MinChan Kim" <minchan.kim@gmail.com> wrote:
>
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1375,8 +1375,7 @@ static unsigned long shrink_zone(int priority, struct zon
>>          * rebalance the anon lru active/inactive ratio.
>>          */
>>         if (scan_global_lru(sc) && inactive_anon_low(zone))
>> -               shrink_list(NR_ACTIVE_ANON, SWAP_CLUSTER_MAX, zone, sc,
>> -                                                               priority);
>> +               shrink_inactive_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
>>
>>         throttle_vm_writeout(sc->gfp_mask);
>>         return nr_reclaimed;
>
> shrink_active_list, but yes, that is the idea.

But you used argument with NR_ACTIVE_ANON during call shrink_list.
I think It is not your intention.
So, you have to change LRU_ACTIVE_ANON or call shrink_active_list directly.




-- 
Kinds regards,
MinChan Kim

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -mm 09/15] add some sanity checks to get_scan_ratio
  2008-05-16  6:42       ` MinChan Kim
@ 2008-05-16 16:47         ` Rik van Riel
  0 siblings, 0 replies; 34+ messages in thread
From: Rik van Riel @ 2008-05-16 16:47 UTC (permalink / raw)
  To: MinChan Kim; +Cc: linux-kernel, lee.schermerhorn, akpm, kosaki.motohiro

On Fri, 16 May 2008 15:42:59 +0900
"MinChan Kim" <minchan.kim@gmail.com> wrote:

> But you used argument with NR_ACTIVE_ANON during call shrink_list.
> I think It is not your intention.
> So, you have to change LRU_ACTIVE_ANON or call shrink_active_list directly.

I applied your fix to my tree.  Thank you.

(now to finish forward porting the whole thing and resubmitting it...)

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2008-05-16 16:48 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-28 18:18 [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
2008-04-28 18:18 ` [PATCH -mm 01/15] FYI: vmstats are "off-by-one" Rik van Riel
2008-04-28 18:18 ` [PATCH -mm 02/15] move isolate_lru_page() to vmscan.c Rik van Riel
2008-04-28 18:18 ` [PATCH -mm 03/15] Use an indexed array for LRU variables Rik van Riel
2008-04-28 18:18 ` [PATCH -mm 04/15] use an array for the LRU pagevecs Rik van Riel
2008-04-28 18:18 ` [PATCH -mm 05/15] free swap space on swap-in/activation Rik van Riel
2008-05-12 11:21   ` Daisuke Nishimura
2008-05-12 13:33     ` Rik van Riel
2008-05-13 13:00       ` Daisuke Nishimura
2008-05-13 13:11         ` Rik van Riel
2008-05-13 16:04         ` [PATCH] take pageout refcount into account for remove_exclusive_swap_page Rik van Riel
2008-05-13 17:43           ` Lee Schermerhorn
2008-05-13 18:09             ` Rik van Riel
2008-05-13 19:02               ` Lee Schermerhorn
2008-05-15  2:15                 ` Daisuke Nishimura
2008-05-15 17:55                   ` Lee Schermerhorn
     [not found]     ` <1210600296.7300.23.camel@lts-notebook>
2008-05-13 12:39       ` [PATCH -mm 05/15] free swap space on swap-in/activation Daisuke Nishimura
2008-04-28 18:18 ` [PATCH -mm 06/15] define page_file_cache() function Rik van Riel
2008-04-28 18:18 ` [PATCH -mm 07/15] split LRU lists into anon & file sets Rik van Riel
2008-05-10  7:50   ` MinChan Kim
2008-04-28 18:18 ` [PATCH -mm 08/15] SEQ replacement for anonymous pages Rik van Riel
2008-04-28 18:18 ` [PATCH -mm 09/15] add some sanity checks to get_scan_ratio Rik van Riel
2008-05-15  6:34   ` MinChan Kim
2008-05-15 13:12     ` Rik van Riel
2008-05-16  6:42       ` MinChan Kim
2008-05-16 16:47         ` Rik van Riel
2008-04-28 18:18 ` [PATCH -mm 10/15] add newly swapped in pages to the inactive list Rik van Riel
2008-04-28 18:18 ` [PATCH -mm 11/15] more aggressively use lumpy reclaim Rik van Riel
2008-04-28 18:18 ` [PATCH -mm 12/15] No Reclaim LRU Infrastructure Rik van Riel
2008-04-28 18:18 ` [PATCH -mm 13/15] Non-reclaimable page statistics Rik van Riel
2008-04-28 18:18 ` [PATCH -mm 14/15] ramfs pages are non-reclaimable Rik van Riel
2008-04-28 18:18 ` [PATCH -mm 15/15] SHM_LOCKED pages are nonreclaimable Rik van Riel
2008-04-28 19:55 ` [PATCH -mm 00/15] VM pageout scalability improvements (V6) Rik van Riel
2008-05-08  4:54 ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).