[PATCH/RFC 0/14] Page Reclaim Scalability

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH/RFC 0/14] Page Reclaim Scalability
@ 2007-09-14 20:53 Lee Schermerhorn
  2007-09-14 20:54 ` [PATCH/RFC 1/14] Reclaim Scalability: Convert anon_vma lock to read/write lock Lee Schermerhorn
                   ` (15 more replies)
  0 siblings, 16 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-14 20:53 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

As I discussed with some of you in Cambridge:

[PATCH/RFC] 0/14 Page Reclaim Scalability Patches:

The objective of this series of patches is not to make page reclaim
"smarter"--e.g., by improving the heuristics or using a new replacement
algorithm.  Rather, the objective is to make the existing algorithm
more effective by removing from consideration those pages that are 
difficult or impossible to reclaim so that the page reclaim algorithm 
can concentrate on those pages which have a good chance of being 
reclaimed.  This is especially important for servers with large
amounts of memory--in the millions of pages.  Doing this should benefit
any future improvements to the reclaim algorithm itself.

Some of the conditions that make pages difficult or impossible to reclaim:
  1) page is ramdisk page.
  2) page is anon or shmem, but no swap space available
  3) page is mlocked into memory, including SHM_LOCKed shmem pages.
  4) page is anon with an excessive number of related vmas [on the
     anon_vma list]; or is a file-backed page, with an excessive
     number of vmas mapping the page.

Pages that fall in categories 1-3 above remain on the LRU lists,
despite being non-reclaimable.  vmscan can spend a great deal of time
shuffling these pages around the lists.  Pages in category 4 are
theoretically reclaimable, but the system can enter livelock, with
all cpus spinning on the respective anon_vma lock or i_mmap_lock.

The basic mechanism employed to achieve the stated objective is to
manage "non-reclaimable" pages off the LRU active and inactive lists
on a separate "noreclaim" list.  The "noreclaim" list is based on a
patch by Larry Woodman of Red Hat.  I have enhanced this concept to
make the noreclaim list a peer of the LRU active and inactive list--
i.e., yet another LRU list.  This approach simplifies the management
of noreclaim pages, as we have well established protocols for managing
pages on the LRU.  From my discussions with developers who attended
the VM Summit in Cambridge ~2-3Sept, I understand that there is some
agreement with this approach.

This series, although very much still a work in progress, has been
running in various forms for several months on test machines at HP--
fairly large ia64 NUMA servers--under reasonable high stress loads.  
I have posted a previous version of a subset of these patches on
linux-mm.   The current version has ungone a fair amount of rework
based on discussions with vm developers, but there is still much to
be done.  I'm reposting the new series in hopes of kick-starting
the discussion to either progress this series to acceptance, or
to kill it off so that we can direct our attentions to some other
approach.

Here is a brief [promise I'll try] summary of the patches to
follow.  More details and discussion in the patch descriptions.
The patch names are taken from the file names in my series.

Currently atop 2.6.23-rc4-mm1:

1) make-anon_vma-lock-rw
2) make-i_mmap_lock-rw

The first two patches are not part of the noreclaim infrastructure.
Rather, these patches improve parallelism in shrink_page_list()--
specifically in page_referenced() and try_to_unmap()--by making the
anon_vma lock and the i_mmap_lock reader/writer spinlocks.  

3) move-and-rework-isolate_lru_page

>From Nick Piggin's "keep mlocked pages off LRU" patch, this patch
moves the "isolate_lru_page()" function from mm/migrate.c
to mm/vmscan.c from where it is used by both the page migration
code and this noreclaim series [mlock patches below].

4) introduce-page_anon-function

Extracted from Rik van Riel's "split LRU" patch.  Used by
noreclaim series to detect swap-backed pages.

Aside:  at one point, I had this series working with Rik's
split LRU patch in the same tree.  I have separated them
for now, but plan to remerge at some point for further testing.
Rik's patch in more of a "make reclaim smarter" patch.

5) use-indexed-array-of-lru-lists

Christoph Lameter's cleanup of per zone LRU list handling.
Useful here as noreclaim adds an additional "LRU" list.  Will
also be useful with Rik's "split LRU" mechanism.

Aside:  I note that in 23-rc4-mm1, the memory controller has 
its own active and inactive list.  It may also benefit from
use of Christoph's patch.  Further, we'll need to consider 
whether memory controllers should maintain separate noreclaim
lists.

6) noreclaim-01-no-reclaim-infrastructure

This patch provides the basic noreclaim list mechanism and a
skeletal "page_reclaimable()" predicate function to test whether
a page should be diverted to the noreclaim list.  Subsequent
patches add tests to page_reclaimable().

7) noreclaim-02-report-nonreclaimable-memory

Provides basic accounting/statistics for non-reclaimable
pages.

8) noreclaim-03-ramdisk-pages-are-nonreclaimable

Enhances page_reclaimable() to detect ram_disk pages and
"just say no".  See the patch description for details.

9) noreclaim-04-SHM_LOCKed-pages-are-nonreclaimable

Similarly, declare pages in SHM_LOCKED shmem segments as
non-reclaimable.

10) noreclaim-05-track-anon_vma-related-vmas

Reference count anon_vma--number of vmas in the list.
Declare anon pages non-reclaimable if the count exceeds a 
tunable threshold.

TODO:  similar for file-backed pages.  No such patch yet.

11) noreclaim-06-unswappable-anon-and-shmem

Using Rik's page_anon() function, declare swap-backed
pages as non-reclaimable when no swap space exists.

TODO:  bring the pages back when [sufficient] swap space
freed or added.  See patch description.

12) noreclaim-07.1-prepare-for-mlocked-pages
13) noreclaim-07.2-move-mlocked-pages-off-the-LRU

These two patches are a rework of Nick Piggin's series to
do the same thing--move mlocked pages off the LRU.  The rework
eliminates the use of one of the lru list links as the mlock
count, so that these pages can be maintained on the noreclaim
list.  The count is replaced by a single page flag that is
maintained by mlock/munlock/munmap code.

14) noreclaim-08-cull-nonreclaimable-anon-pages-in-fault-path

This is an optional patch, inspired by Nick's mlock patch.  It 
checks for nonreclaimable anon pages created by copy-on-write
and diverts them to the noreclaim list so that vmscan never
sees them.  Without this patch, shrink_active_list() will see
these pages once and move them to the noreclaim list.  

-----------
A note to reviewers:  these patches contain intentional, glaring
style violations:  use of '//TODO' comments.  I KNOW that these are
style violations and will remove them as the questions they raise
are resolved.  I want them to stand out in hopes that you'll read
the contents.

Thanks,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH/RFC 1/14] Reclaim Scalability:  Convert anon_vma lock to read/write lock
  2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
@ 2007-09-14 20:54 ` Lee Schermerhorn
  2007-09-17 11:02   ` Mel Gorman
  2007-09-14 20:54 ` [PATCH/RFC 2/14] Reclaim Scalability: convert inode i_mmap_lock to reader/writer lock Lee Schermerhorn
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-14 20:54 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

[PATCH/RFC] 01/14 Reclaim Scalability:  Convert anon_vma list lock a read/write lock

Against 2.6.23-rc4-mm1

Make the anon_vma list lock a read/write lock.  Heaviest use of this
lock is in the page_referenced()/try_to_unmap() calls from vmscan
[shrink_page_list()].  These functions can use a read lock to allow
some parallelism for different cpus trying to reclaim pages mapped
via the same set of vmas.

This change should not change the footprint of the anon_vma in the
non-debug case.

Note:  I have seen systems livelock with all cpus in reclaim, down
in page_referenced_anon() or try_to_unmap_anon() spinning on the
anon_vma lock.  I have only seen this with the AIM7 benchmark with
workloads of 10s of thousands of tasks.  All of these tasks are
children of a single ancestor, so they all share the same anon_vma
for each vm area in their respective mm's.  I'm told that Apache
can fork thousands of children to handle incoming connections, and
I've seen similar livelocks--albeit on the i_mmap_lock [next patch]
running 1000s of Oracle users on a large ia64 platform.

With this patch [along with Rik van Riel's split LRU patch] we were
able to see the AIM7 workload start swapping, instead of hanging,
for the first time.  Same workload DID hang with just Rik's patch,
so this patch is apparently useful.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/rmap.h |    9 ++++++---
 mm/migrate.c         |    4 ++--
 mm/mmap.c            |    4 ++--
 mm/rmap.c            |   20 ++++++++++----------
 4 files changed, 20 insertions(+), 17 deletions(-)

Index: Linux/include/linux/rmap.h
===================================================================
--- Linux.orig/include/linux/rmap.h	2007-09-10 10:09:47.000000000 -0400
+++ Linux/include/linux/rmap.h	2007-09-10 11:43:11.000000000 -0400
@@ -25,7 +25,7 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-	spinlock_t lock;	/* Serialize access to vma list */
+	rwlock_t rwlock;	/* Serialize access to vma list */
 	struct list_head head;	/* List of private "related" vmas */
 };
 
@@ -43,18 +43,21 @@ static inline void anon_vma_free(struct 
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+/*
+ * This needs to be a write lock for __vma_link()
+ */
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+		write_lock(&anon_vma->rwlock);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		write_unlock(&anon_vma->rwlock);
 }
 
 /*
Index: Linux/mm/rmap.c
===================================================================
--- Linux.orig/mm/rmap.c	2007-09-10 10:09:47.000000000 -0400
+++ Linux/mm/rmap.c	2007-09-10 11:43:11.000000000 -0400
@@ -25,7 +25,7 @@
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_lock
- *         anon_vma->lock
+ *         anon_vma->rwlock
  *           mm->page_table_lock or pte_lock
  *             zone->lru_lock (in mark_page_accessed, isolate_lru_page)
  *             swap_lock (in swap_duplicate, swap_info_get)
@@ -68,7 +68,7 @@ int anon_vma_prepare(struct vm_area_stru
 		if (anon_vma) {
 			allocated = NULL;
 			locked = anon_vma;
-			spin_lock(&locked->lock);
+			write_lock(&locked->rwlock);
 		} else {
 			anon_vma = anon_vma_alloc();
 			if (unlikely(!anon_vma))
@@ -87,7 +87,7 @@ int anon_vma_prepare(struct vm_area_stru
 		spin_unlock(&mm->page_table_lock);
 
 		if (locked)
-			spin_unlock(&locked->lock);
+			write_unlock(&locked->rwlock);
 		if (unlikely(allocated))
 			anon_vma_free(allocated);
 	}
@@ -113,9 +113,9 @@ void anon_vma_link(struct vm_area_struct
 	struct anon_vma *anon_vma = vma->anon_vma;
 
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		write_lock(&anon_vma->rwlock);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
-		spin_unlock(&anon_vma->lock);
+		write_unlock(&anon_vma->rwlock);
 	}
 }
 
@@ -127,12 +127,12 @@ void anon_vma_unlink(struct vm_area_stru
 	if (!anon_vma)
 		return;
 
-	spin_lock(&anon_vma->lock);
+	write_lock(&anon_vma->rwlock);
 	list_del(&vma->anon_vma_node);
 
 	/* We must garbage collect the anon_vma if it's empty */
 	empty = list_empty(&anon_vma->head);
-	spin_unlock(&anon_vma->lock);
+	write_unlock(&anon_vma->rwlock);
 
 	if (empty)
 		anon_vma_free(anon_vma);
@@ -143,7 +143,7 @@ static void anon_vma_ctor(void *data, st
 {
 	struct anon_vma *anon_vma = data;
 
-	spin_lock_init(&anon_vma->lock);
+ 	rwlock_init(&anon_vma->rwlock);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -170,7 +170,7 @@ static struct anon_vma *page_lock_anon_v
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
+	read_lock(&anon_vma->rwlock);
 	return anon_vma;
 out:
 	rcu_read_unlock();
@@ -179,7 +179,7 @@ out:
 
 static void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
-	spin_unlock(&anon_vma->lock);
+	read_unlock(&anon_vma->rwlock);
 	rcu_read_unlock();
 }
 
Index: Linux/mm/mmap.c
===================================================================
--- Linux.orig/mm/mmap.c	2007-09-10 10:09:47.000000000 -0400
+++ Linux/mm/mmap.c	2007-09-10 11:43:11.000000000 -0400
@@ -570,7 +570,7 @@ again:			remove_next = 1 + (end > next->
 	if (vma->anon_vma)
 		anon_vma = vma->anon_vma;
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		write_lock(&anon_vma->rwlock);
 		/*
 		 * Easily overlooked: when mprotect shifts the boundary,
 		 * make sure the expanding vma has anon_vma set if the
@@ -624,7 +624,7 @@ again:			remove_next = 1 + (end > next->
 	}
 
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		write_unlock(&anon_vma->rwlock);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
Index: Linux/mm/migrate.c
===================================================================
--- Linux.orig/mm/migrate.c	2007-09-10 10:09:47.000000000 -0400
+++ Linux/mm/migrate.c	2007-09-10 11:43:11.000000000 -0400
@@ -234,12 +234,12 @@ static void remove_anon_migration_ptes(s
 	 * We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
 	 */
 	anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
+	read_lock(&anon_vma->rwlock);
 
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&anon_vma->lock);
+	read_unlock(&anon_vma->rwlock);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 1/14] Reclaim Scalability:  Convert anon_vma lock to read/write lock
  2007-09-14 20:54 ` [PATCH/RFC 1/14] Reclaim Scalability: Convert anon_vma lock to read/write lock Lee Schermerhorn
@ 2007-09-17 11:02   ` Mel Gorman
  2007-09-18  2:41     ` KAMEZAWA Hiroyuki
  2007-09-18 20:17     ` Lee Schermerhorn
  0 siblings, 2 replies; 77+ messages in thread
From: Mel Gorman @ 2007-09-17 11:02 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On (14/09/07 16:54), Lee Schermerhorn didst pronounce:
> [PATCH/RFC] 01/14 Reclaim Scalability:  Convert anon_vma list lock a read/write lock
> 
> Against 2.6.23-rc4-mm1
> 
> Make the anon_vma list lock a read/write lock.  Heaviest use of this
> lock is in the page_referenced()/try_to_unmap() calls from vmscan
> [shrink_page_list()].  These functions can use a read lock to allow
> some parallelism for different cpus trying to reclaim pages mapped
> via the same set of vmas.
> 
> This change should not change the footprint of the anon_vma in the
> non-debug case.
> 
> Note:  I have seen systems livelock with all cpus in reclaim, down
> in page_referenced_anon() or try_to_unmap_anon() spinning on the
> anon_vma lock.  I have only seen this with the AIM7 benchmark with
> workloads of 10s of thousands of tasks.  All of these tasks are
> children of a single ancestor, so they all share the same anon_vma
> for each vm area in their respective mm's.  I'm told that Apache
> can fork thousands of children to handle incoming connections, and
> I've seen similar livelocks--albeit on the i_mmap_lock [next patch]
> running 1000s of Oracle users on a large ia64 platform.
> 
> With this patch [along with Rik van Riel's split LRU patch] we were
> able to see the AIM7 workload start swapping, instead of hanging,
> for the first time.  Same workload DID hang with just Rik's patch,
> so this patch is apparently useful.
> 

In light of what Peter and Linus said about rw-locks being more expensive
than spinlocks, we'll need to measure this with some benchmark. The plus
side is that this patch can be handled in isolation because it's either a
scalability fix or it isn't. It's worth investigating because you say it
fixed a real problem where under load the job was able to complete with
this patch and live-locked without it.

kernbench is unlikely to show up anything useful here although it might be
worth running anyway just in case. brk_test from aim9 might be useful as it's
a micro-benchmark that uses brk() which is a path affected by this patch. As
aim7 is exercising this path, it would be interesting to see does it show
performance differences in the normal non-stressed case. Other suggestions?

For this one, it'll be important to test on single CPU systems with an
SMP kernel as well as this feels like one of those fixes that helps large
machines but punishes small ones.

When you decide on a test-case, I can test just this patch and see what
results I find.

Otherwise the patch looks ok.

> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  include/linux/rmap.h |    9 ++++++---
>  mm/migrate.c         |    4 ++--
>  mm/mmap.c            |    4 ++--
>  mm/rmap.c            |   20 ++++++++++----------
>  4 files changed, 20 insertions(+), 17 deletions(-)
> 
> Index: Linux/include/linux/rmap.h
> ===================================================================
> --- Linux.orig/include/linux/rmap.h	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/include/linux/rmap.h	2007-09-10 11:43:11.000000000 -0400
> @@ -25,7 +25,7 @@
>   * pointing to this anon_vma once its vma list is empty.
>   */
>  struct anon_vma {
> -	spinlock_t lock;	/* Serialize access to vma list */
> +	rwlock_t rwlock;	/* Serialize access to vma list */
>  	struct list_head head;	/* List of private "related" vmas */
>  };
>  
> @@ -43,18 +43,21 @@ static inline void anon_vma_free(struct 
>  	kmem_cache_free(anon_vma_cachep, anon_vma);
>  }
>  
> +/*
> + * This needs to be a write lock for __vma_link()
> + */
>  static inline void anon_vma_lock(struct vm_area_struct *vma)
>  {
>  	struct anon_vma *anon_vma = vma->anon_vma;
>  	if (anon_vma)
> -		spin_lock(&anon_vma->lock);
> +		write_lock(&anon_vma->rwlock);
>  }
>  
>  static inline void anon_vma_unlock(struct vm_area_struct *vma)
>  {
>  	struct anon_vma *anon_vma = vma->anon_vma;
>  	if (anon_vma)
> -		spin_unlock(&anon_vma->lock);
> +		write_unlock(&anon_vma->rwlock);
>  }
>  
>  /*
> Index: Linux/mm/rmap.c
> ===================================================================
> --- Linux.orig/mm/rmap.c	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/mm/rmap.c	2007-09-10 11:43:11.000000000 -0400
> @@ -25,7 +25,7 @@
>   *   mm->mmap_sem
>   *     page->flags PG_locked (lock_page)
>   *       mapping->i_mmap_lock
> - *         anon_vma->lock
> + *         anon_vma->rwlock
>   *           mm->page_table_lock or pte_lock
>   *             zone->lru_lock (in mark_page_accessed, isolate_lru_page)
>   *             swap_lock (in swap_duplicate, swap_info_get)
> @@ -68,7 +68,7 @@ int anon_vma_prepare(struct vm_area_stru
>  		if (anon_vma) {
>  			allocated = NULL;
>  			locked = anon_vma;
> -			spin_lock(&locked->lock);
> +			write_lock(&locked->rwlock);
>  		} else {
>  			anon_vma = anon_vma_alloc();
>  			if (unlikely(!anon_vma))
> @@ -87,7 +87,7 @@ int anon_vma_prepare(struct vm_area_stru
>  		spin_unlock(&mm->page_table_lock);
>  
>  		if (locked)
> -			spin_unlock(&locked->lock);
> +			write_unlock(&locked->rwlock);
>  		if (unlikely(allocated))
>  			anon_vma_free(allocated);
>  	}
> @@ -113,9 +113,9 @@ void anon_vma_link(struct vm_area_struct
>  	struct anon_vma *anon_vma = vma->anon_vma;
>  
>  	if (anon_vma) {
> -		spin_lock(&anon_vma->lock);
> +		write_lock(&anon_vma->rwlock);
>  		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
> -		spin_unlock(&anon_vma->lock);
> +		write_unlock(&anon_vma->rwlock);
>  	}
>  }
>  
> @@ -127,12 +127,12 @@ void anon_vma_unlink(struct vm_area_stru
>  	if (!anon_vma)
>  		return;
>  
> -	spin_lock(&anon_vma->lock);
> +	write_lock(&anon_vma->rwlock);
>  	list_del(&vma->anon_vma_node);
>  
>  	/* We must garbage collect the anon_vma if it's empty */
>  	empty = list_empty(&anon_vma->head);
> -	spin_unlock(&anon_vma->lock);
> +	write_unlock(&anon_vma->rwlock);
>  
>  	if (empty)
>  		anon_vma_free(anon_vma);
> @@ -143,7 +143,7 @@ static void anon_vma_ctor(void *data, st
>  {
>  	struct anon_vma *anon_vma = data;
>  
> -	spin_lock_init(&anon_vma->lock);
> + 	rwlock_init(&anon_vma->rwlock);
>  	INIT_LIST_HEAD(&anon_vma->head);
>  }
>  
> @@ -170,7 +170,7 @@ static struct anon_vma *page_lock_anon_v
>  		goto out;
>  
>  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> -	spin_lock(&anon_vma->lock);
> +	read_lock(&anon_vma->rwlock);
>  	return anon_vma;
>  out:
>  	rcu_read_unlock();
> @@ -179,7 +179,7 @@ out:
>  
>  static void page_unlock_anon_vma(struct anon_vma *anon_vma)
>  {
> -	spin_unlock(&anon_vma->lock);
> +	read_unlock(&anon_vma->rwlock);
>  	rcu_read_unlock();
>  }
>  
> Index: Linux/mm/mmap.c
> ===================================================================
> --- Linux.orig/mm/mmap.c	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/mm/mmap.c	2007-09-10 11:43:11.000000000 -0400
> @@ -570,7 +570,7 @@ again:			remove_next = 1 + (end > next->
>  	if (vma->anon_vma)
>  		anon_vma = vma->anon_vma;
>  	if (anon_vma) {
> -		spin_lock(&anon_vma->lock);
> +		write_lock(&anon_vma->rwlock);
>  		/*
>  		 * Easily overlooked: when mprotect shifts the boundary,
>  		 * make sure the expanding vma has anon_vma set if the
> @@ -624,7 +624,7 @@ again:			remove_next = 1 + (end > next->
>  	}
>  
>  	if (anon_vma)
> -		spin_unlock(&anon_vma->lock);
> +		write_unlock(&anon_vma->rwlock);
>  	if (mapping)
>  		spin_unlock(&mapping->i_mmap_lock);
>  
> Index: Linux/mm/migrate.c
> ===================================================================
> --- Linux.orig/mm/migrate.c	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/mm/migrate.c	2007-09-10 11:43:11.000000000 -0400
> @@ -234,12 +234,12 @@ static void remove_anon_migration_ptes(s
>  	 * We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
>  	 */
>  	anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
> -	spin_lock(&anon_vma->lock);
> +	read_lock(&anon_vma->rwlock);
>  
>  	list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
>  		remove_migration_pte(vma, old, new);
>  
> -	spin_unlock(&anon_vma->lock);
> +	read_unlock(&anon_vma->rwlock);
>  }
>  
>  /*
> 

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 1/14] Reclaim Scalability:  Convert anon_vma lock to read/write lock
  2007-09-17 11:02   ` Mel Gorman
@ 2007-09-18  2:41     ` KAMEZAWA Hiroyuki
  2007-09-18 11:01       ` Mel Gorman
  2007-09-18 15:37       ` Lee Schermerhorn
  2007-09-18 20:17     ` Lee Schermerhorn
  1 sibling, 2 replies; 77+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-09-18  2:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, linux-mm, akpm, clameter, riel, balbir, andrea,
	a.p.zijlstra, eric.whitney, npiggin

On Mon, 17 Sep 2007 12:02:35 +0100
mel@skynet.ie (Mel Gorman) wrote:

> On (14/09/07 16:54), Lee Schermerhorn didst pronounce:
> > [PATCH/RFC] 01/14 Reclaim Scalability:  Convert anon_vma list lock a read/write lock
> > 
> > Against 2.6.23-rc4-mm1
> > 
> > Make the anon_vma list lock a read/write lock.  Heaviest use of this
> > lock is in the page_referenced()/try_to_unmap() calls from vmscan
> > [shrink_page_list()].  These functions can use a read lock to allow
> > some parallelism for different cpus trying to reclaim pages mapped
> > via the same set of vmas.
<snip>
> In light of what Peter and Linus said about rw-locks being more expensive
> than spinlocks, we'll need to measure this with some benchmark. The plus
> side is that this patch can be handled in isolation because it's either a
> scalability fix or it isn't. It's worth investigating because you say it
> fixed a real problem where under load the job was able to complete with
> this patch and live-locked without it.
>
> When you decide on a test-case, I can test just this patch and see what
> results I find.
> 

One of the case I can imagine is..
==
1. Use NUMA.
2. create *large* anon_vma and use it with MPOL_INTERLEAVE
3. When memory is exhausted (on several nodes), all kswapd on nodes will
   see one anon_vma->lock.
==
Maybe the worst case.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 1/14] Reclaim Scalability:  Convert anon_vma lock to read/write lock
  2007-09-18  2:41     ` KAMEZAWA Hiroyuki
@ 2007-09-18 11:01       ` Mel Gorman
  2007-09-18 14:57         ` Rik van Riel
  2007-09-18 15:37       ` Lee Schermerhorn
  1 sibling, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2007-09-18 11:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Lee Schermerhorn, linux-mm, akpm, clameter, riel, balbir, andrea,
	a.p.zijlstra, eric.whitney, npiggin

On (18/09/07 11:41), KAMEZAWA Hiroyuki didst pronounce:
> On Mon, 17 Sep 2007 12:02:35 +0100
> mel@skynet.ie (Mel Gorman) wrote:
> 
> > On (14/09/07 16:54), Lee Schermerhorn didst pronounce:
> > > [PATCH/RFC] 01/14 Reclaim Scalability:  Convert anon_vma list lock a read/write lock
> > > 
> > > Against 2.6.23-rc4-mm1
> > > 
> > > Make the anon_vma list lock a read/write lock.  Heaviest use of this
> > > lock is in the page_referenced()/try_to_unmap() calls from vmscan
> > > [shrink_page_list()].  These functions can use a read lock to allow
> > > some parallelism for different cpus trying to reclaim pages mapped
> > > via the same set of vmas.
> <snip>
> > In light of what Peter and Linus said about rw-locks being more expensive
> > than spinlocks, we'll need to measure this with some benchmark. The plus
> > side is that this patch can be handled in isolation because it's either a
> > scalability fix or it isn't. It's worth investigating because you say it
> > fixed a real problem where under load the job was able to complete with
> > this patch and live-locked without it.
> >
> > When you decide on a test-case, I can test just this patch and see what
> > results I find.
> > 
> 
> One of the case I can imagine is..
> ==
> 1. Use NUMA.
> 2. create *large* anon_vma and use it with MPOL_INTERLEAVE
> 3. When memory is exhausted (on several nodes), all kswapd on nodes will
>    see one anon_vma->lock.
> ==
> Maybe the worst case.

It certainly sounds like a bad case. Would be very difficult to measure
as part of a test though as latencies in kswapd are not very obvious.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 1/14] Reclaim Scalability:  Convert anon_vma lock to read/write lock
  2007-09-18 11:01       ` Mel Gorman
@ 2007-09-18 14:57         ` Rik van Riel
  0 siblings, 0 replies; 77+ messages in thread
From: Rik van Riel @ 2007-09-18 14:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Lee Schermerhorn, linux-mm, akpm, clameter,
	balbir, andrea, a.p.zijlstra, eric.whitney, npiggin

Mel Gorman wrote:
> On (18/09/07 11:41), KAMEZAWA Hiroyuki didst pronounce:
>> On Mon, 17 Sep 2007 12:02:35 +0100
>> mel@skynet.ie (Mel Gorman) wrote:
>>
>>> On (14/09/07 16:54), Lee Schermerhorn didst pronounce:
>>>> [PATCH/RFC] 01/14 Reclaim Scalability:  Convert anon_vma list lock a read/write lock
>>>>
>>>> Against 2.6.23-rc4-mm1
>>>>
>>>> Make the anon_vma list lock a read/write lock.  Heaviest use of this
>>>> lock is in the page_referenced()/try_to_unmap() calls from vmscan
>>>> [shrink_page_list()].  These functions can use a read lock to allow
>>>> some parallelism for different cpus trying to reclaim pages mapped
>>>> via the same set of vmas.
>> <snip>
>>> In light of what Peter and Linus said about rw-locks being more expensive
>>> than spinlocks, we'll need to measure this with some benchmark. The plus
>>> side is that this patch can be handled in isolation because it's either a
>>> scalability fix or it isn't. It's worth investigating because you say it
>>> fixed a real problem where under load the job was able to complete with
>>> this patch and live-locked without it.
>>>
>>> When you decide on a test-case, I can test just this patch and see what
>>> results I find.
>>>
>> One of the case I can imagine is..
>> ==
>> 1. Use NUMA.
>> 2. create *large* anon_vma and use it with MPOL_INTERLEAVE
>> 3. When memory is exhausted (on several nodes), all kswapd on nodes will
>>    see one anon_vma->lock.
>> ==
>> Maybe the worst case.
> 
> It certainly sounds like a bad case. Would be very difficult to measure
> as part of a test though as latencies in kswapd are not very obvious.

We have observed this problem in customer workloads.

I believe Larry Woodman has a test program that may be
able to trigger the problem.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 1/14] Reclaim Scalability:  Convert anon_vma lock to read/write lock
  2007-09-18  2:41     ` KAMEZAWA Hiroyuki
  2007-09-18 11:01       ` Mel Gorman
@ 2007-09-18 15:37       ` Lee Schermerhorn
  1 sibling, 0 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-18 15:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, akpm, clameter, riel, balbir, andrea,
	a.p.zijlstra, eric.whitney, npiggin

On Tue, 2007-09-18 at 11:41 +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 17 Sep 2007 12:02:35 +0100
> mel@skynet.ie (Mel Gorman) wrote:
> 
> > On (14/09/07 16:54), Lee Schermerhorn didst pronounce:
> > > [PATCH/RFC] 01/14 Reclaim Scalability:  Convert anon_vma list lock a read/write lock
> > > 
> > > Against 2.6.23-rc4-mm1
> > > 
> > > Make the anon_vma list lock a read/write lock.  Heaviest use of this
> > > lock is in the page_referenced()/try_to_unmap() calls from vmscan
> > > [shrink_page_list()].  These functions can use a read lock to allow
> > > some parallelism for different cpus trying to reclaim pages mapped
> > > via the same set of vmas.
> <snip>
> > In light of what Peter and Linus said about rw-locks being more expensive
> > than spinlocks, we'll need to measure this with some benchmark. The plus
> > side is that this patch can be handled in isolation because it's either a
> > scalability fix or it isn't. It's worth investigating because you say it
> > fixed a real problem where under load the job was able to complete with
> > this patch and live-locked without it.
> >
> > When you decide on a test-case, I can test just this patch and see what
> > results I find.
> > 
> 
> One of the case I can imagine is..
> ==
> 1. Use NUMA.
> 2. create *large* anon_vma and use it with MPOL_INTERLEAVE
> 3. When memory is exhausted (on several nodes), all kswapd on nodes will
>    see one anon_vma->lock.
> ==
> Maybe the worst case.

Actually, if you only have one mm/vma mapping the area, it won't be that
bad.  You'll still have contention on the spinlock, but with only one
vma mapping it, page_referenced_anon() and try_to_unmap_anon() will be
relatively fast.  The problem we've seen is when you have lots [10s of
thousands] of vmas referencing the same anon_vma.  This occurs when the
tasks are all descendants of a single original parent w/o exec()ing. 

I've only seen this with the AIM7 benchmark.  This is the workload that
was able to make progress with this patch, but the system hung
indefinitely without it.  But, AIM7 is not necessarily something we want
to optimize for.  However, I've been told that Apache can exhibit
similar behavior with thousands of in-coming connections.  Anyone know
if this is true?

I HAVE seen similar behavior in a high-user count Oracle OLTP workload,
but on the file rmap--the i_mmap_lock--in page_referenced_file(), etc.
That's probably worth fixing.

I'm in the process of running a series of parallel kernel builds on
kernels without the rmap rw_lock patches, and then with each one
individually.  This load doesn't exhibit the problem these patches are
intended to address, but kernel builds do fork a lot of children that
should result in a lot of vma linking/unlinking [but maybe not if using
vfork()].  Similar for the i_mmap_lock.  This lock is also used to
protect the truncate_count, so it must be taken for write in this
context--mostly in unmap_mapping_range*().  

I'll post the results as soon as I have them for both ia64 and x86_64.
Later today.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 1/14] Reclaim Scalability:  Convert anon_vma lock to read/write lock
  2007-09-17 11:02   ` Mel Gorman
  2007-09-18  2:41     ` KAMEZAWA Hiroyuki
@ 2007-09-18 20:17     ` Lee Schermerhorn
  2007-09-20 10:19       ` Mel Gorman
  1 sibling, 1 reply; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-18 20:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, akpm, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Mon, 2007-09-17 at 12:02 +0100, Mel Gorman wrote:
> On (14/09/07 16:54), Lee Schermerhorn didst pronounce:
> > [PATCH/RFC] 01/14 Reclaim Scalability:  Convert anon_vma list lock a read/write lock
> > 
> > Against 2.6.23-rc4-mm1
> > 
> > Make the anon_vma list lock a read/write lock.  Heaviest use of this
> > lock is in the page_referenced()/try_to_unmap() calls from vmscan
> > [shrink_page_list()].  These functions can use a read lock to allow
> > some parallelism for different cpus trying to reclaim pages mapped
> > via the same set of vmas.
> > 
> > This change should not change the footprint of the anon_vma in the
> > non-debug case.
> > 
> > Note:  I have seen systems livelock with all cpus in reclaim, down
> > in page_referenced_anon() or try_to_unmap_anon() spinning on the
> > anon_vma lock.  I have only seen this with the AIM7 benchmark with
> > workloads of 10s of thousands of tasks.  All of these tasks are
> > children of a single ancestor, so they all share the same anon_vma
> > for each vm area in their respective mm's.  I'm told that Apache
> > can fork thousands of children to handle incoming connections, and
> > I've seen similar livelocks--albeit on the i_mmap_lock [next patch]
> > running 1000s of Oracle users on a large ia64 platform.
> > 
> > With this patch [along with Rik van Riel's split LRU patch] we were
> > able to see the AIM7 workload start swapping, instead of hanging,
> > for the first time.  Same workload DID hang with just Rik's patch,
> > so this patch is apparently useful.
> > 
> 
> In light of what Peter and Linus said about rw-locks being more expensive
> than spinlocks, we'll need to measure this with some benchmark. The plus
> side is that this patch can be handled in isolation because it's either a
> scalability fix or it isn't. It's worth investigating because you say it
> fixed a real problem where under load the job was able to complete with
> this patch and live-locked without it.
> 
> kernbench is unlikely to show up anything useful here although it might be
> worth running anyway just in case. brk_test from aim9 might be useful as it's
> a micro-benchmark that uses brk() which is a path affected by this patch. As
> aim7 is exercising this path, it would be interesting to see does it show
> performance differences in the normal non-stressed case. Other suggestions?

As Mel predicted, kernel builds don't seem to be affected by this patch,
nor the i_mmap_lock rw_lock patch.  Below I've included results for an
old ia64 system that I have pretty much exclusive access to.  I can't
get 23-rc4-mm1 nor rc6-mm1 to boot on an x86_64 [AMD-based] right
now--still trying to capture stack trace [not easy from a remote
console :-(].  

I don't have access to the large server with storage for testing Oracle
and AIM right now.  When I get it back, I will try both of these patches
both for any added overhead and to verify that they alleviate the
problem they're trying to solve.  [I do have evidence that the anon_vma
rw lock improves the situation with AIM7 on a ~21-rcx kernel earlier
this year].

These times are the average [+ std dev'n] of 10 consecutive runs after
reboot of a '-j32' build of ia64 defconfig.

23-rc4-mm1 - no rmap rw_lock -- i.e., spinlocks
	 Real	  User	 System
	101.94	1205.10	  92.85 
	  0.56	   1.04	  0.73

23-rc4-mm1 w/ anon_vma rw_lock
	 Real	  User	 System
	101.64	1205.36	  91.83
	  0.65	   0.59	   0.67

23-rc4-mm1 w/ i_mmap_lock rw_lock
	 Real	  User	 System
	101.70	1204.57	  92.20
	  0.51	   0.73	  0.39

This is a NUMA system, so the differences are more like the result of
differences in locality--roll of the dice--than the lock types.

More data later, when I get it...

Lee

	





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 1/14] Reclaim Scalability:  Convert anon_vma lock to read/write lock
  2007-09-18 20:17     ` Lee Schermerhorn
@ 2007-09-20 10:19       ` Mel Gorman
  0 siblings, 0 replies; 77+ messages in thread
From: Mel Gorman @ 2007-09-20 10:19 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On (18/09/07 16:17), Lee Schermerhorn didst pronounce:
> On Mon, 2007-09-17 at 12:02 +0100, Mel Gorman wrote:
> > On (14/09/07 16:54), Lee Schermerhorn didst pronounce:
> > > [PATCH/RFC] 01/14 Reclaim Scalability:  Convert anon_vma list lock a read/write lock
> > > 
> > > Against 2.6.23-rc4-mm1
> > > 
> > > Make the anon_vma list lock a read/write lock.  Heaviest use of this
> > > lock is in the page_referenced()/try_to_unmap() calls from vmscan
> > > [shrink_page_list()].  These functions can use a read lock to allow
> > > some parallelism for different cpus trying to reclaim pages mapped
> > > via the same set of vmas.
> > > 
> > > This change should not change the footprint of the anon_vma in the
> > > non-debug case.
> > > 
> > > Note:  I have seen systems livelock with all cpus in reclaim, down
> > > in page_referenced_anon() or try_to_unmap_anon() spinning on the
> > > anon_vma lock.  I have only seen this with the AIM7 benchmark with
> > > workloads of 10s of thousands of tasks.  All of these tasks are
> > > children of a single ancestor, so they all share the same anon_vma
> > > for each vm area in their respective mm's.  I'm told that Apache
> > > can fork thousands of children to handle incoming connections, and
> > > I've seen similar livelocks--albeit on the i_mmap_lock [next patch]
> > > running 1000s of Oracle users on a large ia64 platform.
> > > 
> > > With this patch [along with Rik van Riel's split LRU patch] we were
> > > able to see the AIM7 workload start swapping, instead of hanging,
> > > for the first time.  Same workload DID hang with just Rik's patch,
> > > so this patch is apparently useful.
> > > 
> > 
> > In light of what Peter and Linus said about rw-locks being more expensive
> > than spinlocks, we'll need to measure this with some benchmark. The plus
> > side is that this patch can be handled in isolation because it's either a
> > scalability fix or it isn't. It's worth investigating because you say it
> > fixed a real problem where under load the job was able to complete with
> > this patch and live-locked without it.
> > 
> > kernbench is unlikely to show up anything useful here although it might be
> > worth running anyway just in case. brk_test from aim9 might be useful as it's
> > a micro-benchmark that uses brk() which is a path affected by this patch. As
> > aim7 is exercising this path, it would be interesting to see does it show
> > performance differences in the normal non-stressed case. Other suggestions?
> 
> As Mel predicted, kernel builds don't seem to be affected by this patch,
> nor the i_mmap_lock rw_lock patch.  Below I've included results for an
> old ia64 system that I have pretty much exclusive access to.  I can't
> get 23-rc4-mm1 nor rc6-mm1 to boot on an x86_64 [AMD-based] right
> now--still trying to capture stack trace [not easy from a remote
> console :-(].  
> 

On x86_64, I got -0.34% and -0.03% regressions on two different machines with
kernbench. However, that is pretty close to noise. On a range of machines
NUMA and non-NUMA with 2.6.23-rc6-mm1 I saw Total CPU figures ranging from
-1.23% to 1.02% and -1.09% to 6.54% System CPU time. DBench figures were
from -2.54% to 4.94%.  The DBench figures tend to vary by about this much
anyway so basic smoke test at least.

hackbench (tested just in case) didn't show up anything unusual. I
didn't do scalability testing with multiple processes like aim7 yet but
so far we're looking ok.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH/RFC 2/14] Reclaim Scalability:  convert inode i_mmap_lock to reader/writer lock
  2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
  2007-09-14 20:54 ` [PATCH/RFC 1/14] Reclaim Scalability: Convert anon_vma lock to read/write lock Lee Schermerhorn
@ 2007-09-14 20:54 ` Lee Schermerhorn
  2007-09-17 12:53   ` Mel Gorman
  2007-09-20  1:24   ` Andrea Arcangeli
  2007-09-14 20:54 ` [PATCH/RFC 3/14] Reclaim Scalability: move isolate_lru_page() to vmscan.c Lee Schermerhorn
                   ` (13 subsequent siblings)
  15 siblings, 2 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-14 20:54 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

PATCH/RFC 02/14 Reclaim Scalability:  make the inode i_mmap_lock a reader/writer lock

Against:  2.6.23-rc4-mm1

I have seen soft cpu lockups in page_referenced_file() due to 
contention on i_mmap_lock() for different pages.  Making the
i_mmap_lock a reader/writer lock should increase parallelism
in vmscan for file back pages mapped into many address spaces.

Read lock the i_mmap_lock for all usage except:

1) mmap/munmap:  linking vma into i_mmap prio_tree or removing
2) unmap_mapping_range:   protecting vm_truncate_count

rmap:  try_to_unmap_file() required new cond_resched_rwlock().
To reduce code duplication, I recast cond_resched_lock() as a
[static inline] wrapper around reworked cond_sched_lock() =>
__cond_resched_lock(void *lock, int type). 
New cond_resched_rwlock() implemented as another wrapper.  

Note:  This patch is meant to address a situation I've seen
running large Oracle OLTP workload--1000s of users--on an
large HP ia64 NUMA platform.  The system hung, spitting out
"soft lockup" messages on the console.  Stack traces showed
that all cpus were in page_referenced(), as mentioned above.
I let the system run overnight in this state--it never
recovered before I decided to reboot.

TODO:  I've yet to test this patch with the same workload
to see what happens.  Don't have access to the system now.


Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/hugetlbfs/inode.c  |    7 +++--
 fs/inode.c            |    2 -
 fs/revoke.c           |   18 +++++++-------
 include/linux/fs.h    |    2 -
 include/linux/mm.h    |    2 -
 include/linux/sched.h |   17 ++++++++++---
 kernel/fork.c         |    4 +--
 kernel/sched.c        |   64 ++++++++++++++++++++++++++++++++++++++++++--------
 mm/filemap_xip.c      |    4 +--
 mm/fremap.c           |    4 +--
 mm/hugetlb.c          |    8 +++---
 mm/memory.c           |   13 +++++-----
 mm/migrate.c          |    4 +--
 mm/mmap.c             |   18 +++++++-------
 mm/mremap.c           |    4 +--
 mm/rmap.c             |   16 ++++++------
 16 files changed, 123 insertions(+), 64 deletions(-)

Index: Linux/include/linux/fs.h
===================================================================
--- Linux.orig/include/linux/fs.h	2007-09-10 10:09:47.000000000 -0400
+++ Linux/include/linux/fs.h	2007-09-10 11:43:26.000000000 -0400
@@ -506,7 +506,7 @@ struct address_space {
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
+	rwlock_t		i_mmap_lock;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
Index: Linux/include/linux/mm.h
===================================================================
--- Linux.orig/include/linux/mm.h	2007-09-10 10:09:47.000000000 -0400
+++ Linux/include/linux/mm.h	2007-09-10 11:43:26.000000000 -0400
@@ -684,7 +684,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
-	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
+	rwlock_t *i_mmap_lock;			/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
 };
 
Index: Linux/fs/inode.c
===================================================================
--- Linux.orig/fs/inode.c	2007-09-10 10:09:43.000000000 -0400
+++ Linux/fs/inode.c	2007-09-10 11:43:26.000000000 -0400
@@ -203,7 +203,7 @@ void inode_init_once(struct inode *inode
 	init_rwsem(&inode->i_alloc_sem);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	rwlock_init(&inode->i_data.tree_lock);
-	spin_lock_init(&inode->i_data.i_mmap_lock);
+	rwlock_init(&inode->i_data.i_mmap_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
Index: Linux/fs/hugetlbfs/inode.c
===================================================================
--- Linux.orig/fs/hugetlbfs/inode.c	2007-09-10 10:09:43.000000000 -0400
+++ Linux/fs/hugetlbfs/inode.c	2007-09-10 11:43:26.000000000 -0400
@@ -411,6 +411,9 @@ static void hugetlbfs_drop_inode(struct 
 		hugetlbfs_forget_inode(inode);
 }
 
+/*
+ * LOCKING:  __unmap_hugepage_range() requires write lock on i_mmap_lock
+ */
 static inline void
 hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff)
 {
@@ -445,10 +448,10 @@ static int hugetlb_vmtruncate(struct ino
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
-	spin_lock(&mapping->i_mmap_lock);
+	write_lock(&mapping->i_mmap_lock);
 	if (!prio_tree_empty(&mapping->i_mmap))
 		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
-	spin_unlock(&mapping->i_mmap_lock);
+	write_unlock(&mapping->i_mmap_lock);
 	truncate_hugepages(inode, offset);
 	return 0;
 }
Index: Linux/fs/revoke.c
===================================================================
--- Linux.orig/fs/revoke.c	2007-09-10 10:09:44.000000000 -0400
+++ Linux/fs/revoke.c	2007-09-10 11:43:26.000000000 -0400
@@ -272,7 +272,7 @@ static int revoke_break_cow(struct files
 
 /*
  *	 LOCKING: down_write(&mm->mmap_sem)
- *	 	    -> spin_lock(&mapping->i_mmap_lock)
+ *	 	    -> write_lock(&mapping->i_mmap_lock)
  */
 static int revoke_vma(struct vm_area_struct *vma, struct zap_details *details)
 {
@@ -298,14 +298,14 @@ static int revoke_vma(struct vm_area_str
 	return 0;
 
   out_need_break:
-	spin_unlock(details->i_mmap_lock);
+	write_unlock(details->i_mmap_lock);
 	cond_resched();
-	spin_lock(details->i_mmap_lock);
+	write_lock(details->i_mmap_lock);
 	return -EINTR;
 }
 
 /*
- *	LOCKING: spin_lock(&mapping->i_mmap_lock)
+ *	LOCKING: write_lock(&mapping->i_mmap_lock)
  */
 static int revoke_mm(struct mm_struct *mm, struct address_space *mapping,
 		     struct file *to_exclude)
@@ -335,7 +335,7 @@ static int revoke_mm(struct mm_struct *m
 		if (err)
 			break;
 
-		__unlink_file_vma(vma);
+		__unlink_file_vma(vma);	/* requires write_lock(i_mmap_lock) */
 		fput(vma->vm_file);
 		vma->vm_file = NULL;
 	}
@@ -345,7 +345,7 @@ static int revoke_mm(struct mm_struct *m
 }
 
 /*
- *	LOCKING: spin_lock(&mapping->i_mmap_lock)
+ *	LOCKING: write_lock(&mapping->i_mmap_lock)
  */
 static void revoke_mapping_tree(struct address_space *mapping,
 				struct file *to_exclude)
@@ -377,7 +377,7 @@ static void revoke_mapping_tree(struct a
 }
 
 /*
- *	LOCKING: spin_lock(&mapping->i_mmap_lock)
+ *	LOCKING: write_lock(&mapping->i_mmap_lock)
  */
 static void revoke_mapping_list(struct address_space *mapping,
 				struct file *to_exclude)
@@ -408,12 +408,12 @@ static void revoke_mapping_list(struct a
 
 static void revoke_mapping(struct address_space *mapping, struct file *to_exclude)
 {
-	spin_lock(&mapping->i_mmap_lock);
+	write_lock(&mapping->i_mmap_lock);
 	if (unlikely(!prio_tree_empty(&mapping->i_mmap)))
 		revoke_mapping_tree(mapping, to_exclude);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
 		revoke_mapping_list(mapping, to_exclude);
-	spin_unlock(&mapping->i_mmap_lock);
+	write_unlock(&mapping->i_mmap_lock);
 }
 
 static void restore_file(struct revokefs_inode_info *info)
Index: Linux/kernel/fork.c
===================================================================
--- Linux.orig/kernel/fork.c	2007-09-10 10:09:47.000000000 -0400
+++ Linux/kernel/fork.c	2007-09-10 11:43:26.000000000 -0400
@@ -262,12 +262,12 @@ static inline int dup_mmap(struct mm_str
 				atomic_dec(&inode->i_writecount);
 
 			/* insert tmp into the share list, just after mpnt */
-			spin_lock(&file->f_mapping->i_mmap_lock);
+			write_lock(&file->f_mapping->i_mmap_lock);
 			tmp->vm_truncate_count = mpnt->vm_truncate_count;
 			flush_dcache_mmap_lock(file->f_mapping);
 			vma_prio_tree_add(tmp, mpnt);
 			flush_dcache_mmap_unlock(file->f_mapping);
-			spin_unlock(&file->f_mapping->i_mmap_lock);
+			write_unlock(&file->f_mapping->i_mmap_lock);
 		}
 
 		/*
Index: Linux/mm/filemap_xip.c
===================================================================
--- Linux.orig/mm/filemap_xip.c	2007-09-10 10:09:47.000000000 -0400
+++ Linux/mm/filemap_xip.c	2007-09-10 11:43:26.000000000 -0400
@@ -182,7 +182,7 @@ __xip_unmap (struct address_space * mapp
 	if (!page)
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	read_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
 		address = vma->vm_start +
@@ -200,7 +200,7 @@ __xip_unmap (struct address_space * mapp
 			page_cache_release(page);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&mapping->i_mmap_lock);
 }
 
 /*
Index: Linux/mm/fremap.c
===================================================================
--- Linux.orig/mm/fremap.c	2007-09-10 10:09:47.000000000 -0400
+++ Linux/mm/fremap.c	2007-09-10 11:43:26.000000000 -0400
@@ -200,13 +200,13 @@ asmlinkage long sys_remap_file_pages(uns
 			}
 			goto out;
 		}
-		spin_lock(&mapping->i_mmap_lock);
+		write_lock(&mapping->i_mmap_lock);
 		flush_dcache_mmap_lock(mapping);
 		vma->vm_flags |= VM_NONLINEAR;
 		vma_prio_tree_remove(vma, &mapping->i_mmap);
 		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
 		flush_dcache_mmap_unlock(mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		write_unlock(&mapping->i_mmap_lock);
 	}
 
 	err = populate_range(mm, vma, start, size, pgoff);
Index: Linux/mm/hugetlb.c
===================================================================
--- Linux.orig/mm/hugetlb.c	2007-09-10 10:09:47.000000000 -0400
+++ Linux/mm/hugetlb.c	2007-09-10 11:43:26.000000000 -0400
@@ -451,9 +451,9 @@ void unmap_hugepage_range(struct vm_area
 	 * do nothing in this case.
 	 */
 	if (vma->vm_file) {
-		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+		write_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 		__unmap_hugepage_range(vma, start, end);
-		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+		write_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 	}
 }
 
@@ -693,7 +693,7 @@ void hugetlb_change_protection(struct vm
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	read_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -708,7 +708,7 @@ void hugetlb_change_protection(struct vm
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	read_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 
 	flush_tlb_range(vma, start, end);
 }
Index: Linux/mm/memory.c
===================================================================
--- Linux.orig/mm/memory.c	2007-09-10 10:09:47.000000000 -0400
+++ Linux/mm/memory.c	2007-09-10 11:43:26.000000000 -0400
@@ -816,7 +816,7 @@ unsigned long unmap_vmas(struct mmu_gath
 	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
 	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
-	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+	rwlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
 	int fullmm = (*tlbp)->fullmm;
 
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
@@ -1728,7 +1728,7 @@ unwritable_page:
  * can't efficiently keep all vmas in step with mapping->truncate_count:
  * so instead reset them all whenever it wraps back to 0 (then go to 1).
  * mapping->truncate_count and vma->vm_truncate_count are protected by
- * i_mmap_lock.
+ * write locked i_mmap_lock.
  *
  * In order to make forward progress despite repeatedly restarting some
  * large vma, note the restart_addr from unmap_vmas when it breaks out:
@@ -1793,9 +1793,10 @@ again:
 			goto again;
 	}
 
-	spin_unlock(details->i_mmap_lock);
+//TODO:  why not cond_resched_lock() here [rwlock version]?
+	write_unlock(details->i_mmap_lock);
 	cond_resched();
-	spin_lock(details->i_mmap_lock);
+	write_lock(details->i_mmap_lock);
 	return -EINTR;
 }
 
@@ -1891,7 +1892,7 @@ void unmap_mapping_range(struct address_
 		details.last_index = ULONG_MAX;
 	details.i_mmap_lock = &mapping->i_mmap_lock;
 
-	spin_lock(&mapping->i_mmap_lock);
+	write_lock(&mapping->i_mmap_lock);
 
 	/* Protect against endless unmapping loops */
 	mapping->truncate_count++;
@@ -1906,7 +1907,7 @@ void unmap_mapping_range(struct address_
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
 		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
-	spin_unlock(&mapping->i_mmap_lock);
+	write_unlock(&mapping->i_mmap_lock);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
Index: Linux/mm/migrate.c
===================================================================
--- Linux.orig/mm/migrate.c	2007-09-10 11:43:11.000000000 -0400
+++ Linux/mm/migrate.c	2007-09-10 11:43:26.000000000 -0400
@@ -207,12 +207,12 @@ static void remove_file_migration_ptes(s
 	if (!mapping)
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	read_lock(&mapping->i_mmap_lock);
 
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&mapping->i_mmap_lock);
 }
 
 /*
Index: Linux/mm/mmap.c
===================================================================
--- Linux.orig/mm/mmap.c	2007-09-10 11:43:11.000000000 -0400
+++ Linux/mm/mmap.c	2007-09-10 11:43:26.000000000 -0400
@@ -182,7 +182,7 @@ error:
 }
 
 /*
- * Requires inode->i_mapping->i_mmap_lock
+ * Requires write locked inode->i_mapping->i_mmap_lock
  */
 static void __remove_shared_vm_struct(struct vm_area_struct *vma,
 		struct file *file, struct address_space *mapping)
@@ -201,7 +201,7 @@ static void __remove_shared_vm_struct(st
 }
 
 /*
- * Requires inode->i_mapping->i_mmap_lock
+ * Requires write locked inode->i_mapping->i_mmap_lock
  */
 void __unlink_file_vma(struct vm_area_struct *vma)
 {
@@ -221,9 +221,9 @@ void unlink_file_vma(struct vm_area_stru
 
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		write_lock(&mapping->i_mmap_lock);
 		__remove_shared_vm_struct(vma, file, mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		write_unlock(&mapping->i_mmap_lock);
 	}
 }
 
@@ -445,7 +445,7 @@ static void vma_link(struct mm_struct *m
 		mapping = vma->vm_file->f_mapping;
 
 	if (mapping) {
-		spin_lock(&mapping->i_mmap_lock);
+		write_lock(&mapping->i_mmap_lock);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
 	anon_vma_lock(vma);
@@ -455,7 +455,7 @@ static void vma_link(struct mm_struct *m
 
 	anon_vma_unlock(vma);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		write_unlock(&mapping->i_mmap_lock);
 
 	mm->map_count++;
 	validate_mm(mm);
@@ -542,7 +542,7 @@ again:			remove_next = 1 + (end > next->
 		mapping = file->f_mapping;
 		if (!(vma->vm_flags & VM_NONLINEAR))
 			root = &mapping->i_mmap;
-		spin_lock(&mapping->i_mmap_lock);
+		write_lock(&mapping->i_mmap_lock);
 		if (importer &&
 		    vma->vm_truncate_count != next->vm_truncate_count) {
 			/*
@@ -626,7 +626,7 @@ again:			remove_next = 1 + (end > next->
 	if (anon_vma)
 		write_unlock(&anon_vma->rwlock);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		write_unlock(&mapping->i_mmap_lock);
 
 	if (remove_next) {
 		if (file)
@@ -2064,7 +2064,7 @@ void exit_mmap(struct mm_struct *mm)
 
 /* Insert vm structure into process list sorted by address
  * and into the inode's i_mmap tree.  If vm_file is non-NULL
- * then i_mmap_lock is taken here.
+ * then i_mmap_lock is write locked here.
  */
 int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
 {
Index: Linux/mm/mremap.c
===================================================================
--- Linux.orig/mm/mremap.c	2007-09-10 10:09:38.000000000 -0400
+++ Linux/mm/mremap.c	2007-09-10 11:43:26.000000000 -0400
@@ -83,7 +83,7 @@ static void move_ptes(struct vm_area_str
 		 * and we propagate stale pages into the dst afterward.
 		 */
 		mapping = vma->vm_file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		read_lock(&mapping->i_mmap_lock);
 		if (new_vma->vm_truncate_count &&
 		    new_vma->vm_truncate_count != vma->vm_truncate_count)
 			new_vma->vm_truncate_count = 0;
@@ -115,7 +115,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_nested(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		read_unlock(&mapping->i_mmap_lock);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
Index: Linux/mm/rmap.c
===================================================================
--- Linux.orig/mm/rmap.c	2007-09-10 11:43:11.000000000 -0400
+++ Linux/mm/rmap.c	2007-09-10 11:43:26.000000000 -0400
@@ -365,7 +365,7 @@ static int page_referenced_file(struct p
 	 */
 	BUG_ON(!PageLocked(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	read_lock(&mapping->i_mmap_lock);
 
 	/*
 	 * i_mmap_lock does not stabilize mapcount at all, but mapcount
@@ -391,7 +391,7 @@ static int page_referenced_file(struct p
 			break;
 	}
 
-	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&mapping->i_mmap_lock);
 	return referenced;
 }
 
@@ -472,12 +472,12 @@ static int page_mkclean_file(struct addr
 
 	BUG_ON(PageAnon(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	read_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		if (vma->vm_flags & VM_SHARED)
 			ret += page_mkclean_one(page, vma);
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
@@ -904,7 +904,7 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 
-	spin_lock(&mapping->i_mmap_lock);
+	read_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		ret = try_to_unmap_one(page, vma, migration);
 		if (ret == SWAP_FAIL || !page_mapped(page))
@@ -941,7 +941,7 @@ static int try_to_unmap_file(struct page
 	mapcount = page_mapcount(page);
 	if (!mapcount)
 		goto out;
-	cond_resched_lock(&mapping->i_mmap_lock);
+	cond_resched_rwlock(&mapping->i_mmap_lock, 0);
 
 	max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
 	if (max_nl_cursor == 0)
@@ -963,7 +963,7 @@ static int try_to_unmap_file(struct page
 			}
 			vma->vm_private_data = (void *) max_nl_cursor;
 		}
-		cond_resched_lock(&mapping->i_mmap_lock);
+		cond_resched_rwlock(&mapping->i_mmap_lock, 0);
 		max_nl_cursor += CLUSTER_SIZE;
 	} while (max_nl_cursor <= max_nl_size);
 
@@ -975,7 +975,7 @@ static int try_to_unmap_file(struct page
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
 		vma->vm_private_data = NULL;
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
Index: Linux/include/linux/sched.h
===================================================================
--- Linux.orig/include/linux/sched.h	2007-09-10 10:09:47.000000000 -0400
+++ Linux/include/linux/sched.h	2007-09-10 11:43:26.000000000 -0400
@@ -1823,12 +1823,23 @@ static inline int need_resched(void)
  * cond_resched() and cond_resched_lock(): latency reduction via
  * explicit rescheduling in places that are safe. The return
  * value indicates whether a reschedule was done in fact.
- * cond_resched_lock() will drop the spinlock before scheduling,
- * cond_resched_softirq() will enable bhs before scheduling.
+ * cond_resched_softirq() will enable bhs before scheduling,
+ * cond_resched_*lock() will drop the *lock before scheduling.
  */
 extern int cond_resched(void);
-extern int cond_resched_lock(spinlock_t * lock);
 extern int cond_resched_softirq(void);
+extern int __cond_resched_lock(void * lock, int lock_type);
+
+#define COND_RESCHED_SPIN  2
+static inline int cond_resched_lock(spinlock_t * lock)
+{
+	return __cond_resched_lock(lock, COND_RESCHED_SPIN);
+}
+
+static inline int cond_resched_rwlock(rwlock_t * lock, int write_lock)
+{
+	return __cond_resched_lock(lock, !!write_lock);
+}
 
 /*
  * Does a critical section need to be broken due to another
Index: Linux/kernel/sched.c
===================================================================
--- Linux.orig/kernel/sched.c	2007-09-10 10:09:47.000000000 -0400
+++ Linux/kernel/sched.c	2007-09-10 11:43:26.000000000 -0400
@@ -4635,34 +4635,78 @@ int __sched cond_resched(void)
 EXPORT_SYMBOL(cond_resched);
 
 /*
- * cond_resched_lock() - if a reschedule is pending, drop the given lock,
+ * helper functions for __cond_resched_lock()
+ */
+static int __need_lockbreak(void *lock, int type)
+{
+	if (likely(type == COND_RESCHED_SPIN))
+		return need_lockbreak((spinlock_t *)lock);
+	else
+		return need_lockbreak((rwlock_t *)lock);
+}
+
+static void __reacquire_lock(void *lock, int type)
+{
+	if (likely(type == COND_RESCHED_SPIN))
+		spin_lock((spinlock_t *)lock);
+	else if (type)
+		write_unlock((rwlock_t *)lock);
+	else
+		read_unlock((rwlock_t *)lock);
+}
+
+static void __drop_lock(void *lock, int type)
+{
+	if (likely(type == COND_RESCHED_SPIN))
+		spin_unlock((spinlock_t *)lock);
+	else if (type)
+		write_unlock((rwlock_t *)lock);
+	else
+		read_unlock((rwlock_t *)lock);
+}
+
+static void __release_lock(void *lock, int type)
+{
+	if (likely(type == COND_RESCHED_SPIN))
+		spin_release(&(spinlock_t *)lock->dep_map, 1, _RET_IP_);
+	else
+		rwlock_release(&(rwlock_t *)lock->dep_map, 1, _RET_IP_);
+}
+
+/*
+ * __cond_resched_lock() - if a reschedule is pending, drop the given lock,
  * call schedule, and on return reacquire the lock.
  *
+ * Lock type:
+ *  0 = rwlock held for read
+ *  1 = rwlock held for write
+ *  2 = COND_RESCHED_SPIN = spinlock
+ *
  * This works OK both with and without CONFIG_PREEMPT.  We do strange low-level
  * operations here to prevent schedule() from being called twice (once via
- * spin_unlock(), once by hand).
+ * *_unlock(), once by hand).
  */
-int cond_resched_lock(spinlock_t *lock)
+int __cond_resched_lock(void *lock, int type)
 {
 	int ret = 0;
 
-	if (need_lockbreak(lock)) {
-		spin_unlock(lock);
+	if (__need_lockbreak(lock, type)) {
+		__drop_lock(lock, type);
 		cpu_relax();
 		ret = 1;
-		spin_lock(lock);
+		__reacquire_lock(lock, type);
 	}
 	if (need_resched() && system_state == SYSTEM_RUNNING) {
-		spin_release(&lock->dep_map, 1, _THIS_IP_);
-		_raw_spin_unlock(lock);
+		__release_lock(lock, type);
+		__drop_lock(lock, type);
 		preempt_enable_no_resched();
 		__cond_resched();
 		ret = 1;
-		spin_lock(lock);
+		__reacquire_lock(lock, type);
 	}
 	return ret;
 }
-EXPORT_SYMBOL(cond_resched_lock);
+EXPORT_SYMBOL(__cond_resched_lock);
 
 int __sched cond_resched_softirq(void)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 2/14] Reclaim Scalability:  convert inode i_mmap_lock to reader/writer lock
  2007-09-14 20:54 ` [PATCH/RFC 2/14] Reclaim Scalability: convert inode i_mmap_lock to reader/writer lock Lee Schermerhorn
@ 2007-09-17 12:53   ` Mel Gorman
  2007-09-20  1:24   ` Andrea Arcangeli
  1 sibling, 0 replies; 77+ messages in thread
From: Mel Gorman @ 2007-09-17 12:53 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On (14/09/07 16:54), Lee Schermerhorn didst pronounce:
> PATCH/RFC 02/14 Reclaim Scalability:  make the inode i_mmap_lock a reader/writer lock
> 
> Against:  2.6.23-rc4-mm1
> 
> I have seen soft cpu lockups in page_referenced_file() due to 
> contention on i_mmap_lock() for different pages.  Making the
> i_mmap_lock a reader/writer lock should increase parallelism
> in vmscan for file back pages mapped into many address spaces.
> 

Same as the last patch. With respect to the rw-lock, we need to be sure
this is not regressing the general case. A mmap() microbenchmark would
show it up but rapid mapping and unmapping doesn't feel particularly
realistic.

Considering the nature of the lock though, pretty much any benchmark
will show up problems, right?

> Read lock the i_mmap_lock for all usage except:
> 
> 1) mmap/munmap:  linking vma into i_mmap prio_tree or removing
> 2) unmap_mapping_range:   protecting vm_truncate_count
> 
> rmap:  try_to_unmap_file() required new cond_resched_rwlock().
> To reduce code duplication, I recast cond_resched_lock() as a
> [static inline] wrapper around reworked cond_sched_lock() =>
> __cond_resched_lock(void *lock, int type). 
> New cond_resched_rwlock() implemented as another wrapper.  
> 
> Note:  This patch is meant to address a situation I've seen
> running large Oracle OLTP workload--1000s of users--on an
> large HP ia64 NUMA platform.  The system hung, spitting out
> "soft lockup" messages on the console.  Stack traces showed
> that all cpus were in page_referenced(), as mentioned above.
> I let the system run overnight in this state--it never
> recovered before I decided to reboot.
> 
> TODO:  I've yet to test this patch with the same workload
> to see what happens.  Don't have access to the system now.
> 

If you do get access to the machine, can you see if the patch reduces
the number of transactions Oracle is capable of?

> 
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  fs/hugetlbfs/inode.c  |    7 +++--
>  fs/inode.c            |    2 -
>  fs/revoke.c           |   18 +++++++-------
>  include/linux/fs.h    |    2 -
>  include/linux/mm.h    |    2 -
>  include/linux/sched.h |   17 ++++++++++---
>  kernel/fork.c         |    4 +--
>  kernel/sched.c        |   64 ++++++++++++++++++++++++++++++++++++++++++--------
>  mm/filemap_xip.c      |    4 +--
>  mm/fremap.c           |    4 +--
>  mm/hugetlb.c          |    8 +++---
>  mm/memory.c           |   13 +++++-----
>  mm/migrate.c          |    4 +--
>  mm/mmap.c             |   18 +++++++-------
>  mm/mremap.c           |    4 +--
>  mm/rmap.c             |   16 ++++++------
>  16 files changed, 123 insertions(+), 64 deletions(-)
> 
> Index: Linux/include/linux/fs.h
> ===================================================================
> --- Linux.orig/include/linux/fs.h	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/include/linux/fs.h	2007-09-10 11:43:26.000000000 -0400
> @@ -506,7 +506,7 @@ struct address_space {
>  	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
>  	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
>  	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
> -	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
> +	rwlock_t		i_mmap_lock;	/* protect tree, count, list */
>  	unsigned int		truncate_count;	/* Cover race condition with truncate */
>  	unsigned long		nrpages;	/* number of total pages */
>  	pgoff_t			writeback_index;/* writeback starts here */
> Index: Linux/include/linux/mm.h
> ===================================================================
> --- Linux.orig/include/linux/mm.h	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/include/linux/mm.h	2007-09-10 11:43:26.000000000 -0400
> @@ -684,7 +684,7 @@ struct zap_details {
>  	struct address_space *check_mapping;	/* Check page->mapping if set */
>  	pgoff_t	first_index;			/* Lowest page->index to unmap */
>  	pgoff_t last_index;			/* Highest page->index to unmap */
> -	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
> +	rwlock_t *i_mmap_lock;			/* For unmap_mapping_range: */
>  	unsigned long truncate_count;		/* Compare vm_truncate_count */
>  };
>  
> Index: Linux/fs/inode.c
> ===================================================================
> --- Linux.orig/fs/inode.c	2007-09-10 10:09:43.000000000 -0400
> +++ Linux/fs/inode.c	2007-09-10 11:43:26.000000000 -0400
> @@ -203,7 +203,7 @@ void inode_init_once(struct inode *inode
>  	init_rwsem(&inode->i_alloc_sem);
>  	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
>  	rwlock_init(&inode->i_data.tree_lock);
> -	spin_lock_init(&inode->i_data.i_mmap_lock);
> +	rwlock_init(&inode->i_data.i_mmap_lock);
>  	INIT_LIST_HEAD(&inode->i_data.private_list);
>  	spin_lock_init(&inode->i_data.private_lock);
>  	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
> Index: Linux/fs/hugetlbfs/inode.c
> ===================================================================
> --- Linux.orig/fs/hugetlbfs/inode.c	2007-09-10 10:09:43.000000000 -0400
> +++ Linux/fs/hugetlbfs/inode.c	2007-09-10 11:43:26.000000000 -0400
> @@ -411,6 +411,9 @@ static void hugetlbfs_drop_inode(struct 
>  		hugetlbfs_forget_inode(inode);
>  }
>  
> +/*
> + * LOCKING:  __unmap_hugepage_range() requires write lock on i_mmap_lock
> + */
>  static inline void
>  hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff)
>  {
> @@ -445,10 +448,10 @@ static int hugetlb_vmtruncate(struct ino
>  	pgoff = offset >> PAGE_SHIFT;
>  
>  	i_size_write(inode, offset);
> -	spin_lock(&mapping->i_mmap_lock);
> +	write_lock(&mapping->i_mmap_lock);
>  	if (!prio_tree_empty(&mapping->i_mmap))
>  		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
> -	spin_unlock(&mapping->i_mmap_lock);
> +	write_unlock(&mapping->i_mmap_lock);
>  	truncate_hugepages(inode, offset);
>  	return 0;
>  }
> Index: Linux/fs/revoke.c
> ===================================================================
> --- Linux.orig/fs/revoke.c	2007-09-10 10:09:44.000000000 -0400
> +++ Linux/fs/revoke.c	2007-09-10 11:43:26.000000000 -0400
> @@ -272,7 +272,7 @@ static int revoke_break_cow(struct files
>  
>  /*
>   *	 LOCKING: down_write(&mm->mmap_sem)
> - *	 	    -> spin_lock(&mapping->i_mmap_lock)
> + *	 	    -> write_lock(&mapping->i_mmap_lock)
>   */
>  static int revoke_vma(struct vm_area_struct *vma, struct zap_details *details)
>  {
> @@ -298,14 +298,14 @@ static int revoke_vma(struct vm_area_str
>  	return 0;
>  
>    out_need_break:
> -	spin_unlock(details->i_mmap_lock);
> +	write_unlock(details->i_mmap_lock);
>  	cond_resched();
> -	spin_lock(details->i_mmap_lock);
> +	write_lock(details->i_mmap_lock);
>  	return -EINTR;
>  }
>  
>  /*
> - *	LOCKING: spin_lock(&mapping->i_mmap_lock)
> + *	LOCKING: write_lock(&mapping->i_mmap_lock)
>   */
>  static int revoke_mm(struct mm_struct *mm, struct address_space *mapping,
>  		     struct file *to_exclude)
> @@ -335,7 +335,7 @@ static int revoke_mm(struct mm_struct *m
>  		if (err)
>  			break;
>  
> -		__unlink_file_vma(vma);
> +		__unlink_file_vma(vma);	/* requires write_lock(i_mmap_lock) */
>  		fput(vma->vm_file);
>  		vma->vm_file = NULL;
>  	}
> @@ -345,7 +345,7 @@ static int revoke_mm(struct mm_struct *m
>  }
>  
>  /*
> - *	LOCKING: spin_lock(&mapping->i_mmap_lock)
> + *	LOCKING: write_lock(&mapping->i_mmap_lock)
>   */
>  static void revoke_mapping_tree(struct address_space *mapping,
>  				struct file *to_exclude)
> @@ -377,7 +377,7 @@ static void revoke_mapping_tree(struct a
>  }
>  
>  /*
> - *	LOCKING: spin_lock(&mapping->i_mmap_lock)
> + *	LOCKING: write_lock(&mapping->i_mmap_lock)
>   */
>  static void revoke_mapping_list(struct address_space *mapping,
>  				struct file *to_exclude)
> @@ -408,12 +408,12 @@ static void revoke_mapping_list(struct a
>  
>  static void revoke_mapping(struct address_space *mapping, struct file *to_exclude)
>  {
> -	spin_lock(&mapping->i_mmap_lock);
> +	write_lock(&mapping->i_mmap_lock);
>  	if (unlikely(!prio_tree_empty(&mapping->i_mmap)))
>  		revoke_mapping_tree(mapping, to_exclude);
>  	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
>  		revoke_mapping_list(mapping, to_exclude);
> -	spin_unlock(&mapping->i_mmap_lock);
> +	write_unlock(&mapping->i_mmap_lock);
>  }
>  
>  static void restore_file(struct revokefs_inode_info *info)
> Index: Linux/kernel/fork.c
> ===================================================================
> --- Linux.orig/kernel/fork.c	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/kernel/fork.c	2007-09-10 11:43:26.000000000 -0400
> @@ -262,12 +262,12 @@ static inline int dup_mmap(struct mm_str
>  				atomic_dec(&inode->i_writecount);
>  
>  			/* insert tmp into the share list, just after mpnt */
> -			spin_lock(&file->f_mapping->i_mmap_lock);
> +			write_lock(&file->f_mapping->i_mmap_lock);
>  			tmp->vm_truncate_count = mpnt->vm_truncate_count;
>  			flush_dcache_mmap_lock(file->f_mapping);
>  			vma_prio_tree_add(tmp, mpnt);
>  			flush_dcache_mmap_unlock(file->f_mapping);
> -			spin_unlock(&file->f_mapping->i_mmap_lock);
> +			write_unlock(&file->f_mapping->i_mmap_lock);
>  		}
>  
>  		/*
> Index: Linux/mm/filemap_xip.c
> ===================================================================
> --- Linux.orig/mm/filemap_xip.c	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/mm/filemap_xip.c	2007-09-10 11:43:26.000000000 -0400
> @@ -182,7 +182,7 @@ __xip_unmap (struct address_space * mapp
>  	if (!page)
>  		return;
>  
> -	spin_lock(&mapping->i_mmap_lock);
> +	read_lock(&mapping->i_mmap_lock);
>  	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
>  		mm = vma->vm_mm;
>  		address = vma->vm_start +
> @@ -200,7 +200,7 @@ __xip_unmap (struct address_space * mapp
>  			page_cache_release(page);
>  		}
>  	}
> -	spin_unlock(&mapping->i_mmap_lock);
> +	read_unlock(&mapping->i_mmap_lock);
>  }
>  
>  /*
> Index: Linux/mm/fremap.c
> ===================================================================
> --- Linux.orig/mm/fremap.c	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/mm/fremap.c	2007-09-10 11:43:26.000000000 -0400
> @@ -200,13 +200,13 @@ asmlinkage long sys_remap_file_pages(uns
>  			}
>  			goto out;
>  		}
> -		spin_lock(&mapping->i_mmap_lock);
> +		write_lock(&mapping->i_mmap_lock);
>  		flush_dcache_mmap_lock(mapping);
>  		vma->vm_flags |= VM_NONLINEAR;
>  		vma_prio_tree_remove(vma, &mapping->i_mmap);
>  		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
>  		flush_dcache_mmap_unlock(mapping);
> -		spin_unlock(&mapping->i_mmap_lock);
> +		write_unlock(&mapping->i_mmap_lock);
>  	}
>  
>  	err = populate_range(mm, vma, start, size, pgoff);
> Index: Linux/mm/hugetlb.c
> ===================================================================
> --- Linux.orig/mm/hugetlb.c	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/mm/hugetlb.c	2007-09-10 11:43:26.000000000 -0400
> @@ -451,9 +451,9 @@ void unmap_hugepage_range(struct vm_area
>  	 * do nothing in this case.
>  	 */
>  	if (vma->vm_file) {
> -		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
> +		write_lock(&vma->vm_file->f_mapping->i_mmap_lock);
>  		__unmap_hugepage_range(vma, start, end);
> -		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
> +		write_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
>  	}
>  }
>  
> @@ -693,7 +693,7 @@ void hugetlb_change_protection(struct vm
>  	BUG_ON(address >= end);
>  	flush_cache_range(vma, address, end);
>  
> -	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
> +	read_lock(&vma->vm_file->f_mapping->i_mmap_lock);
>  	spin_lock(&mm->page_table_lock);
>  	for (; address < end; address += HPAGE_SIZE) {
>  		ptep = huge_pte_offset(mm, address);
> @@ -708,7 +708,7 @@ void hugetlb_change_protection(struct vm
>  		}
>  	}
>  	spin_unlock(&mm->page_table_lock);
> -	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
> +	read_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
>  
>  	flush_tlb_range(vma, start, end);
>  }
> Index: Linux/mm/memory.c
> ===================================================================
> --- Linux.orig/mm/memory.c	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/mm/memory.c	2007-09-10 11:43:26.000000000 -0400
> @@ -816,7 +816,7 @@ unsigned long unmap_vmas(struct mmu_gath
>  	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
>  	int tlb_start_valid = 0;
>  	unsigned long start = start_addr;
> -	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
> +	rwlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
>  	int fullmm = (*tlbp)->fullmm;
>  
>  	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
> @@ -1728,7 +1728,7 @@ unwritable_page:
>   * can't efficiently keep all vmas in step with mapping->truncate_count:
>   * so instead reset them all whenever it wraps back to 0 (then go to 1).
>   * mapping->truncate_count and vma->vm_truncate_count are protected by
> - * i_mmap_lock.
> + * write locked i_mmap_lock.
>   *
>   * In order to make forward progress despite repeatedly restarting some
>   * large vma, note the restart_addr from unmap_vmas when it breaks out:
> @@ -1793,9 +1793,10 @@ again:
>  			goto again;
>  	}
>  
> -	spin_unlock(details->i_mmap_lock);
> +//TODO:  why not cond_resched_lock() here [rwlock version]?
> +	write_unlock(details->i_mmap_lock);
>  	cond_resched();
> -	spin_lock(details->i_mmap_lock);
> +	write_lock(details->i_mmap_lock);

I guess it's not used because it just doesn't exist :/ . Not sure why
that is although because no one thought it was necessary is the likely
answer.

>  	return -EINTR;
>  }
>  
> @@ -1891,7 +1892,7 @@ void unmap_mapping_range(struct address_
>  		details.last_index = ULONG_MAX;
>  	details.i_mmap_lock = &mapping->i_mmap_lock;
>  
> -	spin_lock(&mapping->i_mmap_lock);
> +	write_lock(&mapping->i_mmap_lock);
>  
>  	/* Protect against endless unmapping loops */
>  	mapping->truncate_count++;
> @@ -1906,7 +1907,7 @@ void unmap_mapping_range(struct address_
>  		unmap_mapping_range_tree(&mapping->i_mmap, &details);
>  	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
>  		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
> -	spin_unlock(&mapping->i_mmap_lock);
> +	write_unlock(&mapping->i_mmap_lock);
>  }
>  EXPORT_SYMBOL(unmap_mapping_range);
>  
> Index: Linux/mm/migrate.c
> ===================================================================
> --- Linux.orig/mm/migrate.c	2007-09-10 11:43:11.000000000 -0400
> +++ Linux/mm/migrate.c	2007-09-10 11:43:26.000000000 -0400
> @@ -207,12 +207,12 @@ static void remove_file_migration_ptes(s
>  	if (!mapping)
>  		return;
>  
> -	spin_lock(&mapping->i_mmap_lock);
> +	read_lock(&mapping->i_mmap_lock);
>  
>  	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
>  		remove_migration_pte(vma, old, new);
>  
> -	spin_unlock(&mapping->i_mmap_lock);
> +	read_unlock(&mapping->i_mmap_lock);
>  }
>  
>  /*
> Index: Linux/mm/mmap.c
> ===================================================================
> --- Linux.orig/mm/mmap.c	2007-09-10 11:43:11.000000000 -0400
> +++ Linux/mm/mmap.c	2007-09-10 11:43:26.000000000 -0400
> @@ -182,7 +182,7 @@ error:
>  }
>  
>  /*
> - * Requires inode->i_mapping->i_mmap_lock
> + * Requires write locked inode->i_mapping->i_mmap_lock
>   */
>  static void __remove_shared_vm_struct(struct vm_area_struct *vma,
>  		struct file *file, struct address_space *mapping)
> @@ -201,7 +201,7 @@ static void __remove_shared_vm_struct(st
>  }
>  
>  /*
> - * Requires inode->i_mapping->i_mmap_lock
> + * Requires write locked inode->i_mapping->i_mmap_lock
>   */
>  void __unlink_file_vma(struct vm_area_struct *vma)
>  {
> @@ -221,9 +221,9 @@ void unlink_file_vma(struct vm_area_stru
>  
>  	if (file) {
>  		struct address_space *mapping = file->f_mapping;
> -		spin_lock(&mapping->i_mmap_lock);
> +		write_lock(&mapping->i_mmap_lock);
>  		__remove_shared_vm_struct(vma, file, mapping);
> -		spin_unlock(&mapping->i_mmap_lock);
> +		write_unlock(&mapping->i_mmap_lock);
>  	}
>  }
>  
> @@ -445,7 +445,7 @@ static void vma_link(struct mm_struct *m
>  		mapping = vma->vm_file->f_mapping;
>  
>  	if (mapping) {
> -		spin_lock(&mapping->i_mmap_lock);
> +		write_lock(&mapping->i_mmap_lock);
>  		vma->vm_truncate_count = mapping->truncate_count;
>  	}
>  	anon_vma_lock(vma);
> @@ -455,7 +455,7 @@ static void vma_link(struct mm_struct *m
>  
>  	anon_vma_unlock(vma);
>  	if (mapping)
> -		spin_unlock(&mapping->i_mmap_lock);
> +		write_unlock(&mapping->i_mmap_lock);
>  
>  	mm->map_count++;
>  	validate_mm(mm);
> @@ -542,7 +542,7 @@ again:			remove_next = 1 + (end > next->
>  		mapping = file->f_mapping;
>  		if (!(vma->vm_flags & VM_NONLINEAR))
>  			root = &mapping->i_mmap;
> -		spin_lock(&mapping->i_mmap_lock);
> +		write_lock(&mapping->i_mmap_lock);
>  		if (importer &&
>  		    vma->vm_truncate_count != next->vm_truncate_count) {
>  			/*
> @@ -626,7 +626,7 @@ again:			remove_next = 1 + (end > next->
>  	if (anon_vma)
>  		write_unlock(&anon_vma->rwlock);
>  	if (mapping)
> -		spin_unlock(&mapping->i_mmap_lock);
> +		write_unlock(&mapping->i_mmap_lock);
>  
>  	if (remove_next) {
>  		if (file)
> @@ -2064,7 +2064,7 @@ void exit_mmap(struct mm_struct *mm)
>  
>  /* Insert vm structure into process list sorted by address
>   * and into the inode's i_mmap tree.  If vm_file is non-NULL
> - * then i_mmap_lock is taken here.
> + * then i_mmap_lock is write locked here.
>   */
>  int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
>  {
> Index: Linux/mm/mremap.c
> ===================================================================
> --- Linux.orig/mm/mremap.c	2007-09-10 10:09:38.000000000 -0400
> +++ Linux/mm/mremap.c	2007-09-10 11:43:26.000000000 -0400
> @@ -83,7 +83,7 @@ static void move_ptes(struct vm_area_str
>  		 * and we propagate stale pages into the dst afterward.
>  		 */
>  		mapping = vma->vm_file->f_mapping;
> -		spin_lock(&mapping->i_mmap_lock);
> +		read_lock(&mapping->i_mmap_lock);
>  		if (new_vma->vm_truncate_count &&
>  		    new_vma->vm_truncate_count != vma->vm_truncate_count)
>  			new_vma->vm_truncate_count = 0;
> @@ -115,7 +115,7 @@ static void move_ptes(struct vm_area_str
>  	pte_unmap_nested(new_pte - 1);
>  	pte_unmap_unlock(old_pte - 1, old_ptl);
>  	if (mapping)
> -		spin_unlock(&mapping->i_mmap_lock);
> +		read_unlock(&mapping->i_mmap_lock);
>  }
>  
>  #define LATENCY_LIMIT	(64 * PAGE_SIZE)
> Index: Linux/mm/rmap.c
> ===================================================================
> --- Linux.orig/mm/rmap.c	2007-09-10 11:43:11.000000000 -0400
> +++ Linux/mm/rmap.c	2007-09-10 11:43:26.000000000 -0400
> @@ -365,7 +365,7 @@ static int page_referenced_file(struct p
>  	 */
>  	BUG_ON(!PageLocked(page));
>  
> -	spin_lock(&mapping->i_mmap_lock);
> +	read_lock(&mapping->i_mmap_lock);
>  
>  	/*
>  	 * i_mmap_lock does not stabilize mapcount at all, but mapcount
> @@ -391,7 +391,7 @@ static int page_referenced_file(struct p
>  			break;
>  	}
>  
> -	spin_unlock(&mapping->i_mmap_lock);
> +	read_unlock(&mapping->i_mmap_lock);
>  	return referenced;
>  }
>  
> @@ -472,12 +472,12 @@ static int page_mkclean_file(struct addr
>  
>  	BUG_ON(PageAnon(page));
>  
> -	spin_lock(&mapping->i_mmap_lock);
> +	read_lock(&mapping->i_mmap_lock);
>  	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
>  		if (vma->vm_flags & VM_SHARED)
>  			ret += page_mkclean_one(page, vma);
>  	}
> -	spin_unlock(&mapping->i_mmap_lock);
> +	read_unlock(&mapping->i_mmap_lock);
>  	return ret;
>  }
>  
> @@ -904,7 +904,7 @@ static int try_to_unmap_file(struct page
>  	unsigned long max_nl_size = 0;
>  	unsigned int mapcount;
>  
> -	spin_lock(&mapping->i_mmap_lock);
> +	read_lock(&mapping->i_mmap_lock);
>  	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
>  		ret = try_to_unmap_one(page, vma, migration);
>  		if (ret == SWAP_FAIL || !page_mapped(page))
> @@ -941,7 +941,7 @@ static int try_to_unmap_file(struct page
>  	mapcount = page_mapcount(page);
>  	if (!mapcount)
>  		goto out;
> -	cond_resched_lock(&mapping->i_mmap_lock);
> +	cond_resched_rwlock(&mapping->i_mmap_lock, 0);
>  
>  	max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
>  	if (max_nl_cursor == 0)
> @@ -963,7 +963,7 @@ static int try_to_unmap_file(struct page
>  			}
>  			vma->vm_private_data = (void *) max_nl_cursor;
>  		}
> -		cond_resched_lock(&mapping->i_mmap_lock);
> +		cond_resched_rwlock(&mapping->i_mmap_lock, 0);
>  		max_nl_cursor += CLUSTER_SIZE;
>  	} while (max_nl_cursor <= max_nl_size);
>  
> @@ -975,7 +975,7 @@ static int try_to_unmap_file(struct page
>  	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
>  		vma->vm_private_data = NULL;
>  out:
> -	spin_unlock(&mapping->i_mmap_lock);
> +	read_unlock(&mapping->i_mmap_lock);
>  	return ret;
>  }
>  
> Index: Linux/include/linux/sched.h
> ===================================================================
> --- Linux.orig/include/linux/sched.h	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/include/linux/sched.h	2007-09-10 11:43:26.000000000 -0400
> @@ -1823,12 +1823,23 @@ static inline int need_resched(void)
>   * cond_resched() and cond_resched_lock(): latency reduction via
>   * explicit rescheduling in places that are safe. The return
>   * value indicates whether a reschedule was done in fact.
> - * cond_resched_lock() will drop the spinlock before scheduling,
> - * cond_resched_softirq() will enable bhs before scheduling.
> + * cond_resched_softirq() will enable bhs before scheduling,
> + * cond_resched_*lock() will drop the *lock before scheduling.
>   */
>  extern int cond_resched(void);
> -extern int cond_resched_lock(spinlock_t * lock);
>  extern int cond_resched_softirq(void);
> +extern int __cond_resched_lock(void * lock, int lock_type);
> +
> +#define COND_RESCHED_SPIN  2
> +static inline int cond_resched_lock(spinlock_t * lock)
> +{
> +	return __cond_resched_lock(lock, COND_RESCHED_SPIN);
> +}
> +
> +static inline int cond_resched_rwlock(rwlock_t * lock, int write_lock)
> +{
> +	return __cond_resched_lock(lock, !!write_lock);
> +}
>  
>  /*
>   * Does a critical section need to be broken due to another
> Index: Linux/kernel/sched.c
> ===================================================================
> --- Linux.orig/kernel/sched.c	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/kernel/sched.c	2007-09-10 11:43:26.000000000 -0400
> @@ -4635,34 +4635,78 @@ int __sched cond_resched(void)
>  EXPORT_SYMBOL(cond_resched);
>  
>  /*
> - * cond_resched_lock() - if a reschedule is pending, drop the given lock,
> + * helper functions for __cond_resched_lock()
> + */
> +static int __need_lockbreak(void *lock, int type)
> +{
> +	if (likely(type == COND_RESCHED_SPIN))
> +		return need_lockbreak((spinlock_t *)lock);
> +	else
> +		return need_lockbreak((rwlock_t *)lock);
> +}
> +
> +static void __reacquire_lock(void *lock, int type)
> +{
> +	if (likely(type == COND_RESCHED_SPIN))
> +		spin_lock((spinlock_t *)lock);
> +	else if (type)
> +		write_unlock((rwlock_t *)lock);
> +	else
> +		read_unlock((rwlock_t *)lock);
> +}
> +
> +static void __drop_lock(void *lock, int type)
> +{
> +	if (likely(type == COND_RESCHED_SPIN))
> +		spin_unlock((spinlock_t *)lock);
> +	else if (type)
> +		write_unlock((rwlock_t *)lock);
> +	else
> +		read_unlock((rwlock_t *)lock);
> +}
> +
> +static void __release_lock(void *lock, int type)
> +{
> +	if (likely(type == COND_RESCHED_SPIN))
> +		spin_release(&(spinlock_t *)lock->dep_map, 1, _RET_IP_);
> +	else
> +		rwlock_release(&(rwlock_t *)lock->dep_map, 1, _RET_IP_);
> +}
> +
> +/*
> + * __cond_resched_lock() - if a reschedule is pending, drop the given lock,
>   * call schedule, and on return reacquire the lock.
>   *
> + * Lock type:
> + *  0 = rwlock held for read
> + *  1 = rwlock held for write
> + *  2 = COND_RESCHED_SPIN = spinlock
> + *
>   * This works OK both with and without CONFIG_PREEMPT.  We do strange low-level
>   * operations here to prevent schedule() from being called twice (once via
> - * spin_unlock(), once by hand).
> + * *_unlock(), once by hand).
>   */
> -int cond_resched_lock(spinlock_t *lock)
> +int __cond_resched_lock(void *lock, int type)
>  {
>  	int ret = 0;
>  
> -	if (need_lockbreak(lock)) {
> -		spin_unlock(lock);
> +	if (__need_lockbreak(lock, type)) {
> +		__drop_lock(lock, type);
>  		cpu_relax();
>  		ret = 1;
> -		spin_lock(lock);
> +		__reacquire_lock(lock, type);
>  	}
>  	if (need_resched() && system_state == SYSTEM_RUNNING) {
> -		spin_release(&lock->dep_map, 1, _THIS_IP_);
> -		_raw_spin_unlock(lock);
> +		__release_lock(lock, type);
> +		__drop_lock(lock, type);
>  		preempt_enable_no_resched();
>  		__cond_resched();
>  		ret = 1;
> -		spin_lock(lock);
> +		__reacquire_lock(lock, type);
>  	}
>  	return ret;
>  }
> -EXPORT_SYMBOL(cond_resched_lock);
> +EXPORT_SYMBOL(__cond_resched_lock);
>  

This whole block looks like it belongs in another patch. Also, is it really
worth having single functions that handle all locks with loads of branches
instead of specific versions?

>  int __sched cond_resched_softirq(void)
>  {
> 

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 2/14] Reclaim Scalability:  convert inode i_mmap_lock to reader/writer lock
  2007-09-14 20:54 ` [PATCH/RFC 2/14] Reclaim Scalability: convert inode i_mmap_lock to reader/writer lock Lee Schermerhorn
  2007-09-17 12:53   ` Mel Gorman
@ 2007-09-20  1:24   ` Andrea Arcangeli
  2007-09-20 14:10     ` Lee Schermerhorn
  1 sibling, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2007-09-20  1:24 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, riel, balbir, a.p.zijlstra,
	eric.whitney, npiggin

On Fri, Sep 14, 2007 at 04:54:12PM -0400, Lee Schermerhorn wrote:
> Note:  This patch is meant to address a situation I've seen
> running large Oracle OLTP workload--1000s of users--on an
> large HP ia64 NUMA platform.  The system hung, spitting out
> "soft lockup" messages on the console.  Stack traces showed
> that all cpus were in page_referenced(), as mentioned above.
> I let the system run overnight in this state--it never
> recovered before I decided to reboot.

Just to understand better, was that an oom condition? Can you press
SYSRQ+M to check the RAM and swap levels? If it's an oom condition the
problem may be quite different.

Still making those spinlocks rw sounds good to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 2/14] Reclaim Scalability:  convert inode i_mmap_lock to reader/writer lock
  2007-09-20  1:24   ` Andrea Arcangeli
@ 2007-09-20 14:10     ` Lee Schermerhorn
  2007-09-20 14:16       ` Andrea Arcangeli
  0 siblings, 1 reply; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-20 14:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, akpm, mel, clameter, riel, balbir, a.p.zijlstra,
	eric.whitney, npiggin

On Thu, 2007-09-20 at 03:24 +0200, Andrea Arcangeli wrote:
> On Fri, Sep 14, 2007 at 04:54:12PM -0400, Lee Schermerhorn wrote:
> > Note:  This patch is meant to address a situation I've seen
> > running large Oracle OLTP workload--1000s of users--on an
> > large HP ia64 NUMA platform.  The system hung, spitting out
> > "soft lockup" messages on the console.  Stack traces showed
> > that all cpus were in page_referenced(), as mentioned above.
> > I let the system run overnight in this state--it never
> > recovered before I decided to reboot.
> 
> Just to understand better, was that an oom condition? Can you press
> SYSRQ+M to check the RAM and swap levels? If it's an oom condition the
> problem may be quite different.

Actually, the system never went OOM.  Didn't get that far.  I was trying
to create an Oracle workload that would put me at the brink of reclaim,
and then by running some app that would eat page cache, push it over the
edge.  But, I apparently went too far--too many Oracle users for this
system--and it went into reclaim, got hung up with all cpus spinning on
the i_mmap_lock in page_referenced_file().

I just got this system back for testing.  Soon as I build a 23-rc6-mm1
kernel for it, I'll retest that with the same workload to demonstrate
the problem.  Then I'll try it with the rw_lock patch to see if that
helps.

> 
> Still making those spinlocks rw sounds good to me.

Well, except for the concern about the extra overhead of rw_locks.  I'm
more worried about this for the i_mmap_lock than the anon_vma lock.  The
only time we need to take the anon_vma lock for write is when adding a
new vma to the list, or removing one [vma_link(), et al].  But, the
i_mmap_lock is also used to protect the truncate_count, and must be
taken for write there.  I expected that a kernel build might show
something with all the forks for parallel make, mapping of libc, cc
executable, ...  but nothing.  

Thanks,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 2/14] Reclaim Scalability:  convert inode i_mmap_lock to reader/writer lock
  2007-09-20 14:10     ` Lee Schermerhorn
@ 2007-09-20 14:16       ` Andrea Arcangeli
  0 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2007-09-20 14:16 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, riel, balbir, a.p.zijlstra,
	eric.whitney, npiggin

On Thu, Sep 20, 2007 at 10:10:48AM -0400, Lee Schermerhorn wrote:
> Actually, the system never went OOM.  Didn't get that far.  I was trying
> to create an Oracle workload that would put me at the brink of reclaim,
> and then by running some app that would eat page cache, push it over the
> edge.  But, I apparently went too far--too many Oracle users for this
> system--and it went into reclaim, got hung up with all cpus spinning on
> the i_mmap_lock in page_referenced_file().
> 
> I just got this system back for testing.  Soon as I build a 23-rc6-mm1
> kernel for it, I'll retest that with the same workload to demonstrate
> the problem.  Then I'll try it with the rw_lock patch to see if that
> helps.

Ok, I guess it's a numa scalability issue. All pages belongs to that
file... and they all trash on the same spinlock. So I doubt the
rw_lock will help much, the trashing where most time is probably spent
should be the same. the rw_lock still looks a good idea, for smaller
systems with faster interconnects like dualcore ;)

> Well, except for the concern about the extra overhead of rw_locks.  I'm
> more worried about this for the i_mmap_lock than the anon_vma lock.  The
> only time we need to take the anon_vma lock for write is when adding a
> new vma to the list, or removing one [vma_link(), et al].  But, the
> i_mmap_lock is also used to protect the truncate_count, and must be
> taken for write there.  I expected that a kernel build might show
> something with all the forks for parallel make, mapping of libc, cc
> executable, ...  but nothing.  

You mean it's not actually slower? Well I doubt a few instructions
more counts these days, the major hit is the cacheline miss and
that'll be the same for rwlock or spinlock... (which is why it
probably won't help much on systems with tons of cpus and where
cacheline bouncing trashes so badly). Ironically I think it's more an
optimization for small smp with lots of ram, than big smp/numa.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH/RFC 3/14] Reclaim Scalability:  move isolate_lru_page() to vmscan.c
  2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
  2007-09-14 20:54 ` [PATCH/RFC 1/14] Reclaim Scalability: Convert anon_vma lock to read/write lock Lee Schermerhorn
  2007-09-14 20:54 ` [PATCH/RFC 2/14] Reclaim Scalability: convert inode i_mmap_lock to reader/writer lock Lee Schermerhorn
@ 2007-09-14 20:54 ` Lee Schermerhorn
  2007-09-14 21:34   ` Peter Zijlstra
  2007-09-17  9:20   ` Balbir Singh
  2007-09-14 20:54 ` [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function Lee Schermerhorn
                   ` (12 subsequent siblings)
  15 siblings, 2 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-14 20:54 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

PATCH/RFC 03/14 Reclaim Scalability:  move isolate_lru_page() to vmscan.c

Against 2.6.23-rc4-mm1

From: Nick Piggin <npiggin@suse.de>
To: Linux Memory Management <linux-mm@kvack.org>
Subject: [patch 1/4] mm: move and rework isolate_lru_page
Date:	Mon, 12 Mar 2007 07:38:44 +0100 (CET)

isolate_lru_page logically belongs to be in vmscan.c than migrate.c.

It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c. However a subsequent
patch needs to make use of it in the core mm, so we can happily move it
to vmscan.c.

Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list. Callers can do that.

	Note that we now have '__isolate_lru_page()', that does
	something quite different, visible outside of vmscan.c
	for use with memory controller.  Methinks we need to
	rationalize these names/purposes.	--lts

Signed-off-by: Nick Piggin <npiggin@suse.de>

 include/linux/migrate.h |    3 ---
 mm/internal.h           |    2 ++
 mm/mempolicy.c          |   10 ++++++++--
 mm/migrate.c            |   47 ++++++++++-------------------------------------
 mm/vmscan.c             |   41 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 61 insertions(+), 42 deletions(-)

Index: Linux/include/linux/migrate.h
===================================================================
--- Linux.orig/include/linux/migrate.h	2007-09-12 16:08:51.000000000 -0400
+++ Linux/include/linux/migrate.h	2007-09-12 16:10:11.000000000 -0400
@@ -25,7 +25,6 @@ static inline int vma_migratable(struct 
 	return 1;
 }
 
-extern int isolate_lru_page(struct page *p, struct list_head *pagelist);
 extern int putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
 			struct page *, struct page *);
@@ -42,8 +41,6 @@ extern int migrate_vmas(struct mm_struct
 static inline int vma_migratable(struct vm_area_struct *vma)
 					{ return 0; }
 
-static inline int isolate_lru_page(struct page *p, struct list_head *list)
-					{ return -ENOSYS; }
 static inline int putback_lru_pages(struct list_head *l) { return 0; }
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private) { return -ENOSYS; }
Index: Linux/mm/internal.h
===================================================================
--- Linux.orig/mm/internal.h	2007-09-12 16:08:51.000000000 -0400
+++ Linux/mm/internal.h	2007-09-14 10:17:54.000000000 -0400
@@ -34,6 +34,8 @@ static inline void __put_page(struct pag
 	atomic_dec(&page->_count);
 }
 
+extern int isolate_lru_page(struct page *page);
+
 extern void fastcall __init __free_pages_bootmem(struct page *page,
 						unsigned int order);
 
Index: Linux/mm/migrate.c
===================================================================
--- Linux.orig/mm/migrate.c	2007-09-12 16:10:11.000000000 -0400
+++ Linux/mm/migrate.c	2007-09-14 10:17:54.000000000 -0400
@@ -36,36 +36,6 @@
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 /*
- * Isolate one page from the LRU lists. If successful put it onto
- * the indicated list with elevated page count.
- *
- * Result:
- *  -EBUSY: page not on LRU list
- *  0: page removed from LRU list and added to the specified list.
- */
-int isolate_lru_page(struct page *page, struct list_head *pagelist)
-{
-	int ret = -EBUSY;
-
-	if (PageLRU(page)) {
-		struct zone *zone = page_zone(page);
-
-		spin_lock_irq(&zone->lru_lock);
-		if (PageLRU(page) && get_page_unless_zero(page)) {
-			ret = 0;
-			ClearPageLRU(page);
-			if (PageActive(page))
-				del_page_from_active_list(zone, page);
-			else
-				del_page_from_inactive_list(zone, page);
-			list_add_tail(&page->lru, pagelist);
-		}
-		spin_unlock_irq(&zone->lru_lock);
-	}
-	return ret;
-}
-
-/*
  * migrate_prep() needs to be called before we start compiling a list of pages
  * to be migrated using isolate_lru_page().
  */
@@ -850,14 +820,17 @@ static int do_move_pages(struct mm_struc
 				!migrate_all)
 			goto put_and_set;
 
-		err = isolate_lru_page(page, &pagelist);
+		err = isolate_lru_page(page);
+		if (err) {
 put_and_set:
-		/*
-		 * Either remove the duplicate refcount from
-		 * isolate_lru_page() or drop the page ref if it was
-		 * not isolated.
-		 */
-		put_page(page);
+			/*
+			 * Either remove the duplicate refcount from
+			 * isolate_lru_page() or drop the page ref if it was
+			 * not isolated.
+			 */
+			put_page(page);
+		} else
+			list_add_tail(&page->lru, &pagelist);
 set_status:
 		pp->status = err;
 	}
Index: Linux/mm/vmscan.c
===================================================================
--- Linux.orig/mm/vmscan.c	2007-09-12 16:08:51.000000000 -0400
+++ Linux/mm/vmscan.c	2007-09-14 10:21:38.000000000 -0400
@@ -810,6 +810,47 @@ static unsigned long clear_active_flags(
 	return nr_active;
 }
 
+/**
+ * isolate_lru_page(@page)
+ *
+ * Isolate one @page from the LRU lists. Must be called with an elevated
+ * refcount on the page, which is a fundamentnal difference from
+ * isolate_lru_pages (which is called without a stable reference).
+ *
+ * The returned page will have PageLru() cleared, and PageActive set,
+ * if it was found on the active list. This flag generally will need to be
+ * cleared by the caller before letting the page go.
+ *
+ * The vmstat page counts corresponding to the list on which the page was
+ * found will be decremented.
+ *
+ * lru_lock must not be held, interrupts must be enabled.
+ *
+ * Returns:
+ *  -EBUSY: page not on LRU list
+ *  0: page removed from LRU list.
+ */
+int isolate_lru_page(struct page *page)
+{
+	int ret = -EBUSY;
+
+	if (PageLRU(page)) {
+		struct zone *zone = page_zone(page);
+
+		spin_lock_irq(&zone->lru_lock);
+		if (PageLRU(page)) {
+			ret = 0;
+			ClearPageLRU(page);
+			if (PageActive(page))
+				del_page_from_active_list(zone, page);
+			else
+				del_page_from_inactive_list(zone, page);
+		}
+		spin_unlock_irq(&zone->lru_lock);
+	}
+	return ret;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-09-12 16:08:51.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-09-14 10:17:54.000000000 -0400
@@ -93,6 +93,8 @@
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
 
+#include "internal.h"
+
 /* Internal flags */
 #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0)	/* Skip checks for continuous vmas */
 #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1)		/* Invert check for nodemask */
@@ -593,8 +595,12 @@ static void migrate_page_add(struct page
 	/*
 	 * Avoid migrating a page that is shared with others.
 	 */
-	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1)
-		isolate_lru_page(page, pagelist);
+	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
+		if (!isolate_lru_page(page)) {
+			get_page(page);
+			list_add_tail(&page->lru, pagelist);
+		}
+	}
 }
 
 static struct page *new_node_page(struct page *page, unsigned long node, int **x)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 3/14] Reclaim Scalability:  move isolate_lru_page() to vmscan.c
  2007-09-14 20:54 ` [PATCH/RFC 3/14] Reclaim Scalability: move isolate_lru_page() to vmscan.c Lee Schermerhorn
@ 2007-09-14 21:34   ` Peter Zijlstra
  2007-09-15  1:55     ` Rik van Riel
  2007-09-17 14:11     ` Lee Schermerhorn
  2007-09-17  9:20   ` Balbir Singh
  1 sibling, 2 replies; 77+ messages in thread
From: Peter Zijlstra @ 2007-09-14 21:34 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, riel, balbir, andrea, eric.whitney,
	npiggin

On Fri, 2007-09-14 at 16:54 -0400, Lee Schermerhorn wrote:

> 	Note that we now have '__isolate_lru_page()', that does
> 	something quite different, visible outside of vmscan.c
> 	for use with memory controller.  Methinks we need to
> 	rationalize these names/purposes.	--lts
> 

Actually it comes from lumpy reclaim, and does something very similar to
what this one does. When one looks at the mainline version one could
write:

int isolate_lru_page(struct page *page, struct list_head *pagelist)
{
	int ret = -EBUSY;

	if (PageLRU(page)) {
		struct zone *zone = page_zone(page);

		spin_lock_irq(&zone->lru_lock);
		ret = __isolate_lru_page(page, ISOLATE_BOTH);
		if (!ret) {
			__dec_zone_state(zone, PageActive(page) 
				? NR_ACTIVE : NR_INACTIVE);
			list_move_tail(&page->lru, pagelist);
		}
		spin_unlock_irq(&zone->lru_lock);
	}

	return ret;
}

Obviously the container stuff somewhat complicates mattters in -mm.

>  /*
> - * Isolate one page from the LRU lists. If successful put it onto
> - * the indicated list with elevated page count.
> - *
> - * Result:
> - *  -EBUSY: page not on LRU list
> - *  0: page removed from LRU list and added to the specified list.
> - */
> -int isolate_lru_page(struct page *page, struct list_head *pagelist)
> -{
> -	int ret = -EBUSY;
> -
> -	if (PageLRU(page)) {
> -		struct zone *zone = page_zone(page);
> -
> -		spin_lock_irq(&zone->lru_lock);
> -		if (PageLRU(page) && get_page_unless_zero(page)) {
> -			ret = 0;
> -			ClearPageLRU(page);
> -			if (PageActive(page))
> -				del_page_from_active_list(zone, page);
> -			else
> -				del_page_from_inactive_list(zone, page);
> -			list_add_tail(&page->lru, pagelist);
> -		}
> -		spin_unlock_irq(&zone->lru_lock);
> -	}
> -	return ret;
> -}

remarcable change is the dissapearance of get_page_unless_zero() in the
new version.

> +/**
> + * isolate_lru_page(@page)
> + *
> + * Isolate one @page from the LRU lists. Must be called with an elevated
> + * refcount on the page, which is a fundamentnal difference from
> + * isolate_lru_pages (which is called without a stable reference).
> + *
> + * The returned page will have PageLru() cleared, and PageActive set,
> + * if it was found on the active list. This flag generally will need to be
> + * cleared by the caller before letting the page go.
> + *
> + * The vmstat page counts corresponding to the list on which the page was
> + * found will be decremented.
> + *
> + * lru_lock must not be held, interrupts must be enabled.
> + *
> + * Returns:
> + *  -EBUSY: page not on LRU list
> + *  0: page removed from LRU list.
> + */
> +int isolate_lru_page(struct page *page)
> +{
> +	int ret = -EBUSY;
> +
> +	if (PageLRU(page)) {
> +		struct zone *zone = page_zone(page);
> +
> +		spin_lock_irq(&zone->lru_lock);
> +		if (PageLRU(page)) {
> +			ret = 0;
> +			ClearPageLRU(page);
> +			if (PageActive(page))
> +				del_page_from_active_list(zone, page);
> +			else
> +				del_page_from_inactive_list(zone, page);
> +		}
> +		spin_unlock_irq(&zone->lru_lock);
> +	}
> +	return ret;
> +}



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 3/14] Reclaim Scalability:  move isolate_lru_page() to vmscan.c
  2007-09-14 21:34   ` Peter Zijlstra
@ 2007-09-15  1:55     ` Rik van Riel
  2007-09-17 14:11     ` Lee Schermerhorn
  1 sibling, 0 replies; 77+ messages in thread
From: Rik van Riel @ 2007-09-15  1:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lee Schermerhorn, linux-mm, akpm, mel, clameter, balbir, andrea,
	eric.whitney, npiggin

Peter Zijlstra wrote:

> remarcable change is the dissapearance of get_page_unless_zero() in the
> new version.

That can't be right.  The get_page_unless_zero() test
removes a real SMP race.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 3/14] Reclaim Scalability:  move isolate_lru_page() to vmscan.c
  2007-09-14 21:34   ` Peter Zijlstra
  2007-09-15  1:55     ` Rik van Riel
@ 2007-09-17 14:11     ` Lee Schermerhorn
  1 sibling, 0 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-17 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, akpm, mel, clameter, riel, balbir, andrea, eric.whitney,
	npiggin

On Fri, 2007-09-14 at 23:34 +0200, Peter Zijlstra wrote:
> On Fri, 2007-09-14 at 16:54 -0400, Lee Schermerhorn wrote:
> 
> > 	Note that we now have '__isolate_lru_page()', that does
> > 	something quite different, visible outside of vmscan.c
> > 	for use with memory controller.  Methinks we need to
> > 	rationalize these names/purposes.	--lts
> > 
> 
> Actually it comes from lumpy reclaim, and does something very similar to
> what this one does. 

Sorry.  My statement was a bit ambiguous.  I meant that the visibility
of __isolate_lru_pages() outside of vmscan.c comes about from the mem
controller patches.  Lumpy reclaim did add the "isolation mode" [active,
inactive, both].

> When one looks at the mainline version one could
> write:
> 
> int isolate_lru_page(struct page *page, struct list_head *pagelist)
> {
> 	int ret = -EBUSY;
> 
> 	if (PageLRU(page)) {
> 		struct zone *zone = page_zone(page);
> 
> 		spin_lock_irq(&zone->lru_lock);
> 		ret = __isolate_lru_page(page, ISOLATE_BOTH);
> 		if (!ret) {
> 			__dec_zone_state(zone, PageActive(page) 
> 				? NR_ACTIVE : NR_INACTIVE);
> 			list_move_tail(&page->lru, pagelist);
> 		}
> 		spin_unlock_irq(&zone->lru_lock);
> 	}
> 
> 	return ret;
> }
> 

In it's initial form, yes.  Later [in the first noreclaim patch] you'll
see that I hacked both isolate_lru_page and __isolate_lru_page() to
handle non-reclaimable pages.  The former to add recognize
non-reclaimable pages and isolate them from the noreclaim list; the
latter to allow isolation of non-reclaimable pages only when scanning
the active list, but not during lumpy reclaim.  

I had to allow __isolate_lru_page() to accept non-reclaimable pages from
the active list in order to splice the noreclaim list back there when we
want to scan it--as you mentioned to me was discussed at the vm summit.
I'm not very happy with the result, and think we need to revisit how we
scan the noreclaim list for various conditions.  I plan to fork off a
separate discussion on this point, real soon now.

> Obviously the container stuff somewhat complicates mattters in -mm.
> 
> >  /*
> > - * Isolate one page from the LRU lists. If successful put it onto
> > - * the indicated list with elevated page count.
> > - *
> > - * Result:
> > - *  -EBUSY: page not on LRU list
> > - *  0: page removed from LRU list and added to the specified list.
> > - */
> > -int isolate_lru_page(struct page *page, struct list_head *pagelist)
> > -{
> > -	int ret = -EBUSY;
> > -
> > -	if (PageLRU(page)) {
> > -		struct zone *zone = page_zone(page);
> > -
> > -		spin_lock_irq(&zone->lru_lock);
> > -		if (PageLRU(page) && get_page_unless_zero(page)) {
> > -			ret = 0;
> > -			ClearPageLRU(page);
> > -			if (PageActive(page))
> > -				del_page_from_active_list(zone, page);
> > -			else
> > -				del_page_from_inactive_list(zone, page);
> > -			list_add_tail(&page->lru, pagelist);
> > -		}
> > -		spin_unlock_irq(&zone->lru_lock);
> > -	}
> > -	return ret;
> > -}
> 
> remarcable change is the dissapearance of get_page_unless_zero() in the
> new version.

Good catch!  What happened here is this"

The original version of isolate_lru_page() that Nick's patch moved had a
get_page() in the "if (PageLRU(page)" block--no get_page_unless_zero().
This was fine for Christoph's migration usage, because it was always
called in task context, holding the mm semaphore.  Mel and Kame-san want
to use migration for defragmentation and hotplug from outside task
context, so one or the other of them [not sure] removed the get_page()
and added the get_page_unless_zero() into the if condition--around
mid-June.  Apparently, during resolution of a forced patch conflict, I
managed to drop the get_page(), but not pick up the
get_page_unless_zero().  So much for following "established protocol for
handling pages on the LRU lists", huh?

<snip new, botched version>

Below is a patch to add back the get_page_unless_zero().  I'll roll this
into the move_and_rework... patch for the next posting, but in the
meantime, if anyone wants to try these, here's a quick fix.

I just tested with this and my tests ran much better.  I still managed
to push my system into OOM during mbind() migration, but I am repeatedly
locking and unlocking 16G, sometimes in 8G chunks, of an 18G anon
segment to force swapping and such.  Another test is creating 256MB anon
and private file-backed segments, binding them down and migrating them
around the platform.  Eventually, this second test dies with OOM because
of CONSTRAINT_MEMORY_POLICY--insufficient memory on the target node.

The noreclaim statistics seemed to be behaving better as well, but once
the memtoy/mlock test went OOM with locked pages, quite a few pages
remained non-reclaimable after I killed off the other tests.  Still a
lot of work to do on reviving non-reclaimable pages.

Thanks,
Lee

======================

PATCH	move and rework isolate_lru_page fix

I accidently dropped the recently added "get_page_unless_zero(page)" 
from isolate_lru_page() during resolution of a forced patch
conflict.  

Put it back!!!

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/vmscan.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: Linux/mm/vmscan.c
===================================================================
--- Linux.orig/mm/vmscan.c	2007-09-17 09:06:01.000000000 -0400
+++ Linux/mm/vmscan.c	2007-09-17 09:07:37.000000000 -0400
@@ -838,7 +838,7 @@ int isolate_lru_page(struct page *page)
 		struct zone *zone = page_zone(page);

 		spin_lock_irq(&zone->lru_lock);
-		if (PageLRU(page)) {
+		if (PageLRU(page) && get_page_unless_zero(page)) {
 			ret = 0;
 			ClearPageLRU(page);
 			if (PageActive(page))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 3/14] Reclaim Scalability:  move isolate_lru_page() to vmscan.c
  2007-09-14 20:54 ` [PATCH/RFC 3/14] Reclaim Scalability: move isolate_lru_page() to vmscan.c Lee Schermerhorn
  2007-09-14 21:34   ` Peter Zijlstra
@ 2007-09-17  9:20   ` Balbir Singh
  2007-09-17 19:19     ` Lee Schermerhorn
  1 sibling, 1 reply; 77+ messages in thread
From: Balbir Singh @ 2007-09-17  9:20 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, riel, andrea, a.p.zijlstra,
	eric.whitney, npiggin

Lee Schermerhorn wrote:
> +int isolate_lru_page(struct page *page)
> +{
> +	int ret = -EBUSY;
> +
> +	if (PageLRU(page)) {
> +		struct zone *zone = page_zone(page);
> +
> +		spin_lock_irq(&zone->lru_lock);
> +		if (PageLRU(page)) {
> +			ret = 0;
> +			ClearPageLRU(page);
> +			if (PageActive(page))
> +				del_page_from_active_list(zone, page);
> +			else
> +				del_page_from_inactive_list(zone, page);
> +		}

Wouldn't using a pagelist as an argument and moving to that be easier?
Are there any cases where we just remove from the list and not move it
elsewhere?

> +		spin_unlock_irq(&zone->lru_lock);
> +	}
> +	return ret;
> +}
> +

Any chance we could merge __isolate_lru_page() and isolate_lru_page()?



-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 3/14] Reclaim Scalability:  move isolate_lru_page() to vmscan.c
  2007-09-17  9:20   ` Balbir Singh
@ 2007-09-17 19:19     ` Lee Schermerhorn
  0 siblings, 0 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-17 19:19 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, akpm, mel, clameter, riel, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Mon, 2007-09-17 at 14:50 +0530, Balbir Singh wrote:
> Lee Schermerhorn wrote:
> > +int isolate_lru_page(struct page *page)
> > +{
> > +	int ret = -EBUSY;
> > +
> > +	if (PageLRU(page)) {
> > +		struct zone *zone = page_zone(page);
> > +
> > +		spin_lock_irq(&zone->lru_lock);
> > +		if (PageLRU(page)) {
> > +			ret = 0;
> > +			ClearPageLRU(page);
> > +			if (PageActive(page))
> > +				del_page_from_active_list(zone, page);
> > +			else
> > +				del_page_from_inactive_list(zone, page);
> > +		}
> 
> Wouldn't using a pagelist as an argument and moving to that be easier?
> Are there any cases where we just remove from the list and not move it
> elsewhere?

Actually, isolate_lru_page() used to do that, and Nick removed that
aspect for use with the mlock patches--so as not to have to use a dummy
list for single pages in the mlock code.  Nick's way is probably OK
performance-wise for the current usage in migration and mlock code.
Might need to rethink this if this function gets wider usage, now that
it's globally visible.

> 
> > +		spin_unlock_irq(&zone->lru_lock);
> > +	}
> > +	return ret;
> > +}
> > +
> 
> Any chance we could merge __isolate_lru_page() and isolate_lru_page()?

I wondered about this in one of the patch descriptions.  Peter Z
proposed a wrapper around __isolate_lru_pages() to do this.  The other
changes that I made to __isolate_lru_pages complicate this a bit, but I
think it could be managable.  Note that isolate_lru_page() is called
with the zone lru_lock unlocked and it updates the list stats.
__isolate_lru_page is called from a batch *isolate_lru_pages*() function
and does not update the stats.  The wrapper can handle these extra
tasks, I think--efficiently, I hope.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function
  2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
                   ` (2 preceding siblings ...)
  2007-09-14 20:54 ` [PATCH/RFC 3/14] Reclaim Scalability: move isolate_lru_page() to vmscan.c Lee Schermerhorn
@ 2007-09-14 20:54 ` Lee Schermerhorn
  2007-09-15  2:00   ` Rik van Riel
                     ` (2 more replies)
  2007-09-14 20:54 ` [PATCH/RFC 5/14] Reclaim Scalability: Use an indexed array for LRU variables Lee Schermerhorn
                   ` (11 subsequent siblings)
  15 siblings, 3 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-14 20:54 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

PATCH/RFC 03/14 Reclaim Scalability: Define page_anon() function
	to answer the question: is page backed by swap space?

Against:  2.6.23-rc4-mm1

Originally part of Rik van Riel's split-lru patch.  Extracted
to make available for other, independent reclaim patches.

Moved page_anon() inline function to linux/mm_inline.h where it will
be needed by subsequent "split LRU" and "noreclaim" patches.  

page_anon() requires the definition of struct address_space() from
linux/fs.h.  A patch in 2.6.23-rc1-mm* removed the include of fs.h
from linux/mm.h in favor of including it where it's needed.   Add it
back to mm_inline.h, which is included from very few places.  These
include all the places where page_anon() is needed--so far.

Originally posted, but not Signed-off-by:  Rik van Riel <riel@redhat.com>
Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mm_inline.h |   29 +++++++++++++++++++++++++++++
 mm/shmem.c                |    4 ++--
 2 files changed, 31 insertions(+), 2 deletions(-)

Index: Linux/include/linux/mm_inline.h
===================================================================
--- Linux.orig/include/linux/mm_inline.h	2007-07-08 19:32:17.000000000 -0400
+++ Linux/include/linux/mm_inline.h	2007-09-10 11:45:22.000000000 -0400
@@ -1,3 +1,31 @@
+#ifndef LINUX_MM_INLINE_H
+#define LINUX_MM_INLINE_H
+
+#include <linux/fs.h> 	/* need struct address_space for page_anon() */
+
+/*
+ * Returns true if this page is anonymous, tmpfs or otherwise swap backed.
+ */
+extern const struct address_space_operations shmem_aops;
+static inline int page_anon(struct page *page)
+{
+	struct address_space *mapping;
+
+	if (PageAnon(page) || PageSwapCache(page))
+		return 1;
+	mapping = page_mapping(page);
+	if (!mapping || !mapping->a_ops)
+		return 0;
+	if (mapping->a_ops == &shmem_aops)
+		return 1;
+	/* Should ramfs pages go onto an mlocked list instead? */
+	if ((unlikely(mapping->a_ops->writepage == NULL && PageDirty(page))))
+		return 1;
+
+	/* The page is page cache backed by a normal filesystem. */
+	return 0;
+}
+
 static inline void
 add_page_to_active_list(struct zone *zone, struct page *page)
 {
@@ -38,3 +66,4 @@ del_page_from_lru(struct zone *zone, str
 	}
 }
 
+#endif
Index: Linux/mm/shmem.c
===================================================================
--- Linux.orig/mm/shmem.c	2007-09-10 10:09:47.000000000 -0400
+++ Linux/mm/shmem.c	2007-09-10 11:45:22.000000000 -0400
@@ -180,7 +180,7 @@ static inline void shmem_unacct_blocks(u
 }
 
 static const struct super_operations shmem_ops;
-static const struct address_space_operations shmem_aops;
+const struct address_space_operations shmem_aops;
 static const struct file_operations shmem_file_operations;
 static const struct inode_operations shmem_inode_operations;
 static const struct inode_operations shmem_dir_inode_operations;
@@ -2353,7 +2353,7 @@ static void destroy_inodecache(void)
 	kmem_cache_destroy(shmem_inode_cachep);
 }
 
-static const struct address_space_operations shmem_aops = {
+const struct address_space_operations shmem_aops = {
 	.writepage	= shmem_writepage,
 	.set_page_dirty	= __set_page_dirty_no_writeback,
 #ifdef CONFIG_TMPFS

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function
  2007-09-14 20:54 ` [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function Lee Schermerhorn
@ 2007-09-15  2:00   ` Rik van Riel
  2007-09-17 13:19   ` Mel Gorman
  2007-09-18  1:58   ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 77+ messages in thread
From: Rik van Riel @ 2007-09-15  2:00 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

Lee Schermerhorn wrote:
> PATCH/RFC 03/14 Reclaim Scalability: Define page_anon() function
> 	to answer the question: is page backed by swap space?
> 
> Against:  2.6.23-rc4-mm1
> 
> Originally part of Rik van Riel's split-lru patch.  Extracted
> to make available for other, independent reclaim patches.

> Originally posted, but not Signed-off-by:  Rik van Riel <riel@redhat.com>
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Here it is:

Signed-off-by: Rik van Riel <riel@redhat.com>

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function
  2007-09-14 20:54 ` [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function Lee Schermerhorn
  2007-09-15  2:00   ` Rik van Riel
@ 2007-09-17 13:19   ` Mel Gorman
  2007-09-18  1:58   ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 77+ messages in thread
From: Mel Gorman @ 2007-09-17 13:19 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On (14/09/07 16:54), Lee Schermerhorn didst pronounce:
> PATCH/RFC 03/14 Reclaim Scalability: Define page_anon() function
> 	to answer the question: is page backed by swap space?
> 
> Against:  2.6.23-rc4-mm1
> 
> Originally part of Rik van Riel's split-lru patch.  Extracted
> to make available for other, independent reclaim patches.
> 
> Moved page_anon() inline function to linux/mm_inline.h where it will
> be needed by subsequent "split LRU" and "noreclaim" patches.  
> 
> page_anon() requires the definition of struct address_space() from
> linux/fs.h.  A patch in 2.6.23-rc1-mm* removed the include of fs.h
> from linux/mm.h in favor of including it where it's needed.   Add it
> back to mm_inline.h, which is included from very few places.  These
> include all the places where page_anon() is needed--so far.
> 
> Originally posted, but not Signed-off-by:  Rik van Riel <riel@redhat.com>
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  include/linux/mm_inline.h |   29 +++++++++++++++++++++++++++++
>  mm/shmem.c                |    4 ++--
>  2 files changed, 31 insertions(+), 2 deletions(-)
> 
> Index: Linux/include/linux/mm_inline.h
> ===================================================================
> --- Linux.orig/include/linux/mm_inline.h	2007-07-08 19:32:17.000000000 -0400
> +++ Linux/include/linux/mm_inline.h	2007-09-10 11:45:22.000000000 -0400
> @@ -1,3 +1,31 @@
> +#ifndef LINUX_MM_INLINE_H
> +#define LINUX_MM_INLINE_H
> +
> +#include <linux/fs.h> 	/* need struct address_space for page_anon() */
> +
> +/*
> + * Returns true if this page is anonymous, tmpfs or otherwise swap backed.
> + */
> +extern const struct address_space_operations shmem_aops;
> +static inline int page_anon(struct page *page)
> +{
> +	struct address_space *mapping;
> +
> +	if (PageAnon(page) || PageSwapCache(page))
> +		return 1;
> +	mapping = page_mapping(page);
> +	if (!mapping || !mapping->a_ops)
> +		return 0;
> +	if (mapping->a_ops == &shmem_aops)
> +		return 1;
> +	/* Should ramfs pages go onto an mlocked list instead? */
> +	if ((unlikely(mapping->a_ops->writepage == NULL && PageDirty(page))))
> +		return 1;

Does this last check belong as part of this patch? It looks like it
should be later in the set and doesn't seem to be checking if the page
is anon or not.

Separate helper prehaps?

> +
> +	/* The page is page cache backed by a normal filesystem. */
> +	return 0;
> +}
> +
>  static inline void
>  add_page_to_active_list(struct zone *zone, struct page *page)
>  {
> @@ -38,3 +66,4 @@ del_page_from_lru(struct zone *zone, str
>  	}
>  }
>  
> +#endif
> Index: Linux/mm/shmem.c
> ===================================================================
> --- Linux.orig/mm/shmem.c	2007-09-10 10:09:47.000000000 -0400
> +++ Linux/mm/shmem.c	2007-09-10 11:45:22.000000000 -0400
> @@ -180,7 +180,7 @@ static inline void shmem_unacct_blocks(u
>  }
>  
>  static const struct super_operations shmem_ops;
> -static const struct address_space_operations shmem_aops;
> +const struct address_space_operations shmem_aops;
>  static const struct file_operations shmem_file_operations;
>  static const struct inode_operations shmem_inode_operations;
>  static const struct inode_operations shmem_dir_inode_operations;
> @@ -2353,7 +2353,7 @@ static void destroy_inodecache(void)
>  	kmem_cache_destroy(shmem_inode_cachep);
>  }
>  
> -static const struct address_space_operations shmem_aops = {
> +const struct address_space_operations shmem_aops = {
>  	.writepage	= shmem_writepage,
>  	.set_page_dirty	= __set_page_dirty_no_writeback,
>  #ifdef CONFIG_TMPFS
> 

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function
  2007-09-14 20:54 ` [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function Lee Schermerhorn
  2007-09-15  2:00   ` Rik van Riel
  2007-09-17 13:19   ` Mel Gorman
@ 2007-09-18  1:58   ` KAMEZAWA Hiroyuki
  2007-09-18  2:27     ` Rik van Riel
  2007-09-18 15:04     ` Lee Schermerhorn
  2 siblings, 2 replies; 77+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-09-18  1:58 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Fri, 14 Sep 2007 16:54:25 -0400
Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:
> +/*
> + * Returns true if this page is anonymous, tmpfs or otherwise swap backed.
> + */
> +extern const struct address_space_operations shmem_aops;
> +static inline int page_anon(struct page *page)
> +{
> +	struct address_space *mapping;
> +
> +	if (PageAnon(page) || PageSwapCache(page))
> +		return 1;
> +	mapping = page_mapping(page);
> +	if (!mapping || !mapping->a_ops)
> +		return 0;
> +	if (mapping->a_ops == &shmem_aops)
> +		return 1;
> +	/* Should ramfs pages go onto an mlocked list instead? */
> +	if ((unlikely(mapping->a_ops->writepage == NULL && PageDirty(page))))
> +		return 1;
> +
> +	/* The page is page cache backed by a normal filesystem. */
> +	return 0;
> +}
> +

Hi, it seems the name 'page_anon()' is not clear..
In my understanding, an anonymous page is a MAP_ANONYMOUS page.
Can't we have better name ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function
  2007-09-18  1:58   ` KAMEZAWA Hiroyuki
@ 2007-09-18  2:27     ` Rik van Riel
  2007-09-18  2:40       ` KAMEZAWA Hiroyuki
  2007-09-18 15:04     ` Lee Schermerhorn
  1 sibling, 1 reply; 77+ messages in thread
From: Rik van Riel @ 2007-09-18  2:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Lee Schermerhorn, linux-mm, akpm, mel, clameter, balbir, andrea,
	a.p.zijlstra, eric.whitney, npiggin

KAMEZAWA Hiroyuki wrote:
> On Fri, 14 Sep 2007 16:54:25 -0400
> Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:
>> +/*
>> + * Returns true if this page is anonymous, tmpfs or otherwise swap backed.
>> + */
>> +extern const struct address_space_operations shmem_aops;
>> +static inline int page_anon(struct page *page)
>> +{
>> +	struct address_space *mapping;
>> +
>> +	if (PageAnon(page) || PageSwapCache(page))
>> +		return 1;
>> +	mapping = page_mapping(page);
>> +	if (!mapping || !mapping->a_ops)
>> +		return 0;
>> +	if (mapping->a_ops == &shmem_aops)
>> +		return 1;
>> +	/* Should ramfs pages go onto an mlocked list instead? */
>> +	if ((unlikely(mapping->a_ops->writepage == NULL && PageDirty(page))))
>> +		return 1;
>> +
>> +	/* The page is page cache backed by a normal filesystem. */
>> +	return 0;
>> +}
>> +
> 
> Hi, it seems the name 'page_anon()' is not clear..
> In my understanding, an anonymous page is a MAP_ANONYMOUS page.
> Can't we have better name ?

The idea is to distinguish pages that are (or could be) swap backed
from pages that are filesystem backed.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function
  2007-09-18  2:27     ` Rik van Riel
@ 2007-09-18  2:40       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 77+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-09-18  2:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Lee Schermerhorn, linux-mm, akpm, mel, clameter, balbir, andrea,
	a.p.zijlstra, eric.whitney, npiggin

On Mon, 17 Sep 2007 22:27:09 -0400
Rik van Riel <riel@redhat.com> wrote:
> > 
> > Hi, it seems the name 'page_anon()' is not clear..
> > In my understanding, an anonymous page is a MAP_ANONYMOUS page.
> > Can't we have better name ?
> 
> The idea is to distinguish pages that are (or could be) swap backed
> from pages that are filesystem backed.
> 
Yes, I know the concept.

how a bout page_not_persistent() or write precise text about the difference between
page_anon() and PageAnon().

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function
  2007-09-18  1:58   ` KAMEZAWA Hiroyuki
  2007-09-18  2:27     ` Rik van Riel
@ 2007-09-18 15:04     ` Lee Schermerhorn
  2007-09-18 19:41       ` Christoph Lameter
  2007-09-19  0:30       ` KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-18 15:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Tue, 2007-09-18 at 10:58 +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 14 Sep 2007 16:54:25 -0400
> Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:
> > +/*
> > + * Returns true if this page is anonymous, tmpfs or otherwise swap backed.
> > + */
> > +extern const struct address_space_operations shmem_aops;
> > +static inline int page_anon(struct page *page)
> > +{
> > +	struct address_space *mapping;
> > +
> > +	if (PageAnon(page) || PageSwapCache(page))
> > +		return 1;
> > +	mapping = page_mapping(page);
> > +	if (!mapping || !mapping->a_ops)
> > +		return 0;
> > +	if (mapping->a_ops == &shmem_aops)
> > +		return 1;
> > +	/* Should ramfs pages go onto an mlocked list instead? */
> > +	if ((unlikely(mapping->a_ops->writepage == NULL && PageDirty(page))))
> > +		return 1;
> > +
> > +	/* The page is page cache backed by a normal filesystem. */
> > +	return 0;
> > +}
> > +
> 
> Hi, it seems the name 'page_anon()' is not clear..
> In my understanding, an anonymous page is a MAP_ANONYMOUS page.
> Can't we have better name ?

Hi, Kame-san:

I'm open to a "better name".  Probably Rik, too -- it's his original
name.

How about one of these?

- page_is_swap_backed() or page_is_backed_by_swap_space()
- page_needs_swap_space() or page_uses_swap_space()
- pageNeedSwapSpaceToBeReclaimable() [X11-style :-)]

Other ideas?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function
  2007-09-18 15:04     ` Lee Schermerhorn
@ 2007-09-18 19:41       ` Christoph Lameter
  2007-09-19  0:30       ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-09-18 19:41 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: KAMEZAWA Hiroyuki, linux-mm, akpm, mel, riel, balbir, andrea,
	a.p.zijlstra, eric.whitney, npiggin

On Tue, 18 Sep 2007, Lee Schermerhorn wrote:

> > > +static inline int page_anon(struct page *page)
> > > +{
> > > +	struct address_space *mapping;
> > > +
> > > +	if (PageAnon(page) || PageSwapCache(page))
> > > +		return 1;
> > > +	mapping = page_mapping(page);
> > > +	if (!mapping || !mapping->a_ops)
> > > +		return 0;
> > > +	if (mapping->a_ops == &shmem_aops)
> > > +		return 1;
> > > +	/* Should ramfs pages go onto an mlocked list instead? */
> > > +	if ((unlikely(mapping->a_ops->writepage == NULL && PageDirty(page))))
> > > +		return 1;
> > > +
> > > +	/* The page is page cache backed by a normal filesystem. */
> > > +	return 0;
> > > +}
> Other ideas?

page_memory_backed()?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function
  2007-09-18 15:04     ` Lee Schermerhorn
  2007-09-18 19:41       ` Christoph Lameter
@ 2007-09-19  0:30       ` KAMEZAWA Hiroyuki
  2007-09-19 16:58         ` Lee Schermerhorn
  1 sibling, 1 reply; 77+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-09-19  0:30 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Tue, 18 Sep 2007 11:04:46 -0400
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> > Hi, it seems the name 'page_anon()' is not clear..
> > In my understanding, an anonymous page is a MAP_ANONYMOUS page.
> > Can't we have better name ?
> 
> Hi, Kame-san:
> 
> I'm open to a "better name".  Probably Rik, too -- it's his original
> name.
> 
> How about one of these?
> 
> - page_is_swap_backed() or page_is_backed_by_swap_space()
> - page_needs_swap_space() or page_uses_swap_space()
> - pageNeedSwapSpaceToBeReclaimable() [X11-style :-)]
> 
My point is that the word "anonymous" is traditionally used for user's
work memory. and page_anon() page seems not to be swap-backed always.
(you includes ramfs etc..)

Hmm...how about page_anon_cache() ? 

But finally, please name it as you like. sorry for nitpicks.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function
  2007-09-19  0:30       ` KAMEZAWA Hiroyuki
@ 2007-09-19 16:58         ` Lee Schermerhorn
  2007-09-20  0:56           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-19 16:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Wed, 2007-09-19 at 09:30 +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 18 Sep 2007 11:04:46 -0400
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> 
> > > Hi, it seems the name 'page_anon()' is not clear..
> > > In my understanding, an anonymous page is a MAP_ANONYMOUS page.
> > > Can't we have better name ?
> > 
> > Hi, Kame-san:
> > 
> > I'm open to a "better name".  Probably Rik, too -- it's his original
> > name.
> > 
> > How about one of these?
> > 
> > - page_is_swap_backed() or page_is_backed_by_swap_space()
> > - page_needs_swap_space() or page_uses_swap_space()
> > - pageNeedSwapSpaceToBeReclaimable() [X11-style :-)]
> > 
> My point is that the word "anonymous" is traditionally used for user's
> work memory. and page_anon() page seems not to be swap-backed always.
> (you includes ramfs etc..)

I did understand your point.  Sorry if my response was confusing.

Next respin will remove ramfs.  That will be treated as a
non-reclaimable address space.

> 
> Hmm...how about page_anon_cache() ? 

That would work.

> 
> But finally, please name it as you like. sorry for nitpicks.

No problem.  I agree that the name doesn't precisely match the
meaning.  

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function
  2007-09-19 16:58         ` Lee Schermerhorn
@ 2007-09-20  0:56           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 77+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-09-20  0:56 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Wed, 19 Sep 2007 12:58:39 -0400
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> On Wed, 2007-09-19 at 09:30 +0900, KAMEZAWA Hiroyuki wrote:
> > On Tue, 18 Sep 2007 11:04:46 -0400
> > Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > 
> > > > Hi, it seems the name 'page_anon()' is not clear..
> > > > In my understanding, an anonymous page is a MAP_ANONYMOUS page.
> > > > Can't we have better name ?
> > > 
> > > Hi, Kame-san:
> > > 
> > > I'm open to a "better name".  Probably Rik, too -- it's his original
> > > name.
> > > 
> > > How about one of these?
> > > 
> > > - page_is_swap_backed() or page_is_backed_by_swap_space()
> > > - page_needs_swap_space() or page_uses_swap_space()
> > > - pageNeedSwapSpaceToBeReclaimable() [X11-style :-)]
> > > 
> > My point is that the word "anonymous" is traditionally used for user's
> > work memory. and page_anon() page seems not to be swap-backed always.
> > (you includes ramfs etc..)
> 
> I did understand your point.  Sorry if my response was confusing.
> 
> Next respin will remove ramfs.  That will be treated as a
> non-reclaimable address space.
> 
Ok, seems reasonable.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH/RFC 5/14] Reclaim Scalability:  Use an indexed array for LRU variables
  2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
                   ` (3 preceding siblings ...)
  2007-09-14 20:54 ` [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function Lee Schermerhorn
@ 2007-09-14 20:54 ` Lee Schermerhorn
  2007-09-17 13:40   ` Mel Gorman
  2007-09-17 18:58   ` Balbir Singh
  2007-09-14 20:54 ` [PATCH/RFC 6/14] Reclaim Scalability: "No Reclaim LRU Infrastructure" Lee Schermerhorn
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-14 20:54 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

[PATCH/RFC] 05/15  Reclaim Scalability:   Use an indexed array for LRU variables

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 5/14] Reclaim Scalability:  Use an indexed array for LRU variables
  2007-09-14 20:54 ` [PATCH/RFC 5/14] Reclaim Scalability: Use an indexed array for LRU variables Lee Schermerhorn
@ 2007-09-17 13:40   ` Mel Gorman
  2007-09-17 14:17     ` Lee Schermerhorn
  2007-09-17 18:58   ` Balbir Singh
  1 sibling, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2007-09-17 13:40 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On (14/09/07 16:54), Lee Schermerhorn didst pronounce:
> [PATCH/RFC] 05/15  Reclaim Scalability:   Use an indexed array for LRU variables
> 
> From clameter@sgi.com Wed Aug 29 11:39:51 2007
> 
> Currently we are defining explicit variables for the inactive
> and active list. An indexed array can be more generic and avoid
> repeating similar code in several places in the reclaim code.
> 
> We are saving a few bytes in terms of code size:
> 
> Before:
> 
>    text    data     bss     dec     hex filename
> 4097753  573120 4092484 8763357  85b7dd vmlinux
> 
> After:
> 
>    text    data     bss     dec     hex filename
> 4097729  573120 4092484 8763333  85b7c5 vmlinux
> 
> Having an easy way to add new lru lists may ease future work on
> the reclaim code.
> 
> [CL's signoff added by lts based on mail from CL]
> Signed-off-by:  Christoph Lameter <clameter@sgi.com>
> 
>  include/linux/mm_inline.h |   33 ++++++++---
>  include/linux/mmzone.h    |   17 +++--
>  mm/page_alloc.c           |    9 +--
>  mm/swap.c                 |    2 
>  mm/vmscan.c               |  132 ++++++++++++++++++++++------------------------
>  mm/vmstat.c               |    3 -
>  6 files changed, 107 insertions(+), 89 deletions(-)
> 
> Index: Linux/include/linux/mmzone.h
> ===================================================================
> --- Linux.orig/include/linux/mmzone.h	2007-09-10 12:21:31.000000000 -0400
> +++ Linux/include/linux/mmzone.h	2007-09-10 12:22:33.000000000 -0400
> @@ -81,8 +81,8 @@ struct zone_padding {
>  enum zone_stat_item {
>  	/* First 128 byte cacheline (assuming 64 bit words) */
>  	NR_FREE_PAGES,
> -	NR_INACTIVE,
> -	NR_ACTIVE,
> +	NR_INACTIVE,	/* must match order of LRU_[IN]ACTIVE */
> +	NR_ACTIVE,	/*  "     "     "   "       "         */
>  	NR_ANON_PAGES,	/* Mapped anonymous pages */
>  	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
>  			   only modified from process context */
> @@ -106,6 +106,13 @@ enum zone_stat_item {
>  #endif
>  	NR_VM_ZONE_STAT_ITEMS };
>  
> +enum lru_list {
> +	LRU_INACTIVE,	/* must match order of NR_[IN]ACTIVE */
> +	LRU_ACTIVE,	/*  "     "     "   "       "        */
> +	NR_LRU_LISTS };
> +
> +#define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
> +
>  struct per_cpu_pages {
>  	int count;		/* number of pages in the list */
>  	int high;		/* high watermark, emptying needed */
> @@ -259,10 +266,8 @@ struct zone {
>  
>  	/* Fields commonly accessed by the page reclaim scanner */
>  	spinlock_t		lru_lock;	
> -	struct list_head	active_list;
> -	struct list_head	inactive_list;
> -	unsigned long		nr_scan_active;
> -	unsigned long		nr_scan_inactive;
> +	struct list_head	list[NR_LRU_LISTS];
> +	unsigned long		nr_scan[NR_LRU_LISTS];
>  	unsigned long		pages_scanned;	   /* since last reclaim */
>  	int			all_unreclaimable; /* All pages pinned */
>  
> Index: Linux/include/linux/mm_inline.h
> ===================================================================
> --- Linux.orig/include/linux/mm_inline.h	2007-09-10 12:22:33.000000000 -0400
> +++ Linux/include/linux/mm_inline.h	2007-09-10 12:22:33.000000000 -0400
> @@ -27,43 +27,56 @@ static inline int page_anon(struct page 
>  }
>  
>  static inline void
> +add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
> +{
> +	list_add(&page->lru, &zone->list[l]);
> +	__inc_zone_state(zone, NR_INACTIVE + l);
> +}
> +
> +static inline void
> +del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
> +{
> +	list_del(&page->lru);
> +	__dec_zone_state(zone, NR_INACTIVE + l);
> +}
> +
> +
> +static inline void
>  add_page_to_active_list(struct zone *zone, struct page *page)
>  {
> -	list_add(&page->lru, &zone->active_list);
> -	__inc_zone_state(zone, NR_ACTIVE);
> +	add_page_to_lru_list(zone, page, LRU_ACTIVE);
>  }
>  
>  static inline void
>  add_page_to_inactive_list(struct zone *zone, struct page *page)
>  {
> -	list_add(&page->lru, &zone->inactive_list);
> -	__inc_zone_state(zone, NR_INACTIVE);
> +	add_page_to_lru_list(zone, page, LRU_INACTIVE);
>  }
>  
>  static inline void
>  del_page_from_active_list(struct zone *zone, struct page *page)
>  {
> -	list_del(&page->lru);
> -	__dec_zone_state(zone, NR_ACTIVE);
> +	del_page_from_lru_list(zone, page, LRU_ACTIVE);
>  }
>  
>  static inline void
>  del_page_from_inactive_list(struct zone *zone, struct page *page)
>  {
> -	list_del(&page->lru);
> -	__dec_zone_state(zone, NR_INACTIVE);
> +	del_page_from_lru_list(zone, page, LRU_INACTIVE);
>  }
>  
>  static inline void
>  del_page_from_lru(struct zone *zone, struct page *page)
>  {
> +	enum lru_list l = LRU_INACTIVE;
> +
>  	list_del(&page->lru);
>  	if (PageActive(page)) {
>  		__ClearPageActive(page);
>  		__dec_zone_state(zone, NR_ACTIVE);
> -	} else {
> -		__dec_zone_state(zone, NR_INACTIVE);
> +		l = LRU_ACTIVE;
>  	}
> +	__dec_zone_state(zone, NR_INACTIVE + l);

It looks like you can call __dec_zone_state() twice for active pages
here. Was it meant to be?

enum lru_list l = LRU_INACTIVE;

If (PageActive(page)) {
	__ClearPageActive(page);
	l = LRU_INACTIVE;
} else {
	l = LRU_ACTIVE;
}
__dec_zone_state(zone, NR_INACTIVE + l);

?

>  }
>  
>  #endif
> Index: Linux/mm/page_alloc.c
> ===================================================================
> --- Linux.orig/mm/page_alloc.c	2007-09-10 12:22:23.000000000 -0400
> +++ Linux/mm/page_alloc.c	2007-09-10 12:22:33.000000000 -0400
> @@ -3422,6 +3422,7 @@ static void __meminit free_area_init_cor
>  	for (j = 0; j < MAX_NR_ZONES; j++) {
>  		struct zone *zone = pgdat->node_zones + j;
>  		unsigned long size, realsize, memmap_pages;
> +		enum lru_list l;
>  
>  		size = zone_spanned_pages_in_node(nid, j, zones_size);
>  		realsize = size - zone_absent_pages_in_node(nid, j,
> @@ -3471,10 +3472,10 @@ static void __meminit free_area_init_cor
>  		zone->prev_priority = DEF_PRIORITY;
>  
>  		zone_pcp_init(zone);
> -		INIT_LIST_HEAD(&zone->active_list);
> -		INIT_LIST_HEAD(&zone->inactive_list);
> -		zone->nr_scan_active = 0;
> -		zone->nr_scan_inactive = 0;
> +		for_each_lru(l) {
> +			INIT_LIST_HEAD(&zone->list[l]);
> +			zone->nr_scan[l] = 0;
> +		}
>  		zap_zone_vm_stats(zone);
>  		atomic_set(&zone->reclaim_in_progress, 0);
>  		if (!size)
> Index: Linux/mm/swap.c
> ===================================================================
> --- Linux.orig/mm/swap.c	2007-09-10 12:21:31.000000000 -0400
> +++ Linux/mm/swap.c	2007-09-10 12:22:33.000000000 -0400
> @@ -124,7 +124,7 @@ int rotate_reclaimable_page(struct page 
>  	zone = page_zone(page);
>  	spin_lock_irqsave(&zone->lru_lock, flags);
>  	if (PageLRU(page) && !PageActive(page)) {
> -		list_move_tail(&page->lru, &zone->inactive_list);
> +		list_move_tail(&page->lru, &zone->list[LRU_INACTIVE]);
>  		__count_vm_event(PGROTATED);
>  	}
>  	if (!test_clear_page_writeback(page))
> Index: Linux/mm/vmscan.c
> ===================================================================
> --- Linux.orig/mm/vmscan.c	2007-09-10 12:22:32.000000000 -0400
> +++ Linux/mm/vmscan.c	2007-09-10 12:22:33.000000000 -0400
> @@ -785,10 +785,10 @@ static unsigned long isolate_pages_globa
>  					int active)
>  {
>  	if (active)
> -		return isolate_lru_pages(nr, &z->active_list, dst,
> +		return isolate_lru_pages(nr, &z->list[LRU_ACTIVE], dst,
>  						scanned, order, mode);
>  	else
> -		return isolate_lru_pages(nr, &z->inactive_list, dst,
> +		return isolate_lru_pages(nr, &z->list[LRU_INACTIVE], dst,
>  						scanned, order, mode);
>  }
>  
> @@ -928,10 +928,7 @@ static unsigned long shrink_inactive_lis
>  			VM_BUG_ON(PageLRU(page));
>  			SetPageLRU(page);
>  			list_del(&page->lru);
> -			if (PageActive(page))
> -				add_page_to_active_list(zone, page);
> -			else
> -				add_page_to_inactive_list(zone, page);
> +			add_page_to_lru_list(zone, page, PageActive(page));
>  			if (!pagevec_add(&pvec, page)) {
>  				spin_unlock_irq(&zone->lru_lock);
>  				__pagevec_release(&pvec);
> @@ -990,11 +987,14 @@ static void shrink_active_list(unsigned 
>  	int pgdeactivate = 0;
>  	unsigned long pgscanned;
>  	LIST_HEAD(l_hold);	/* The pages which were snipped off */
> -	LIST_HEAD(l_inactive);	/* Pages to go onto the inactive_list */
> -	LIST_HEAD(l_active);	/* Pages to go onto the active_list */
> +	struct list_head list[NR_LRU_LISTS];
>  	struct page *page;
>  	struct pagevec pvec;
>  	int reclaim_mapped = 0;
> +	enum lru_list l;
> +
> +	for_each_lru(l)
> +		INIT_LIST_HEAD(&list[l]);
>  
>  	if (sc->may_swap) {
>  		long mapped_ratio;
> @@ -1101,28 +1101,28 @@ force_reclaim_mapped:
>  			if (!reclaim_mapped ||
>  			    (total_swap_pages == 0 && PageAnon(page)) ||
>  			    page_referenced(page, 0, sc->mem_container)) {
> -				list_add(&page->lru, &l_active);
> +				list_add(&page->lru, &list[LRU_ACTIVE]);
>  				continue;
>  			}
>  		} else if (TestClearPageReferenced(page)) {
> -			list_add(&page->lru, &l_active);
> +			list_add(&page->lru, &list[LRU_ACTIVE]);
>  			continue;
>  		}
> -		list_add(&page->lru, &l_inactive);
> +		list_add(&page->lru, &list[LRU_INACTIVE]);
>  	}
>  
>  	pagevec_init(&pvec, 1);
>  	pgmoved = 0;
>  	spin_lock_irq(&zone->lru_lock);
> -	while (!list_empty(&l_inactive)) {
> -		page = lru_to_page(&l_inactive);
> -		prefetchw_prev_lru_page(page, &l_inactive, flags);
> +	while (!list_empty(&list[LRU_INACTIVE])) {
> +		page = lru_to_page(&list[LRU_INACTIVE]);
> +		prefetchw_prev_lru_page(page, &list[LRU_INACTIVE], flags);
>  		VM_BUG_ON(PageLRU(page));
>  		SetPageLRU(page);
>  		VM_BUG_ON(!PageActive(page));
>  		ClearPageActive(page);
>  
> -		list_move(&page->lru, &zone->inactive_list);
> +		list_move(&page->lru, &zone->list[LRU_INACTIVE]);
>  		mem_container_move_lists(page_get_page_container(page), false);
>  		pgmoved++;
>  		if (!pagevec_add(&pvec, page)) {
> @@ -1145,13 +1145,13 @@ force_reclaim_mapped:
>  	}
>  
>  	pgmoved = 0;
> -	while (!list_empty(&l_active)) {
> -		page = lru_to_page(&l_active);
> -		prefetchw_prev_lru_page(page, &l_active, flags);
> +	while (!list_empty(&list[LRU_ACTIVE])) {
> +		page = lru_to_page(&list[LRU_ACTIVE]);
> +		prefetchw_prev_lru_page(page, &list[LRU_ACTIVE], flags);
>  		VM_BUG_ON(PageLRU(page));
>  		SetPageLRU(page);
>  		VM_BUG_ON(!PageActive(page));
> -		list_move(&page->lru, &zone->active_list);
> +		list_move(&page->lru, &zone->list[LRU_ACTIVE]);
>  		mem_container_move_lists(page_get_page_container(page), true);
>  		pgmoved++;
>  		if (!pagevec_add(&pvec, page)) {
> @@ -1171,16 +1171,26 @@ force_reclaim_mapped:
>  	pagevec_release(&pvec);
>  }
>  
> +static unsigned long shrink_list(enum lru_list l, unsigned long nr_to_scan,
> +	struct zone *zone, struct scan_control *sc, int priority)
> +{
> +	if (l == LRU_ACTIVE) {
> +		shrink_active_list(nr_to_scan, zone, sc, priority);
> +		return 0;
> +	}
> +	return shrink_inactive_list(nr_to_scan, zone, sc);
> +}
> +
>  /*
>   * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
>   */
>  static unsigned long shrink_zone(int priority, struct zone *zone,
>  				struct scan_control *sc)
>  {
> -	unsigned long nr_active;
> -	unsigned long nr_inactive;
> +	unsigned long nr[NR_LRU_LISTS];
>  	unsigned long nr_to_scan;
>  	unsigned long nr_reclaimed = 0;
> +	enum lru_list l;
>  
>  	atomic_inc(&zone->reclaim_in_progress);
>  
> @@ -1188,36 +1198,26 @@ static unsigned long shrink_zone(int pri
>  	 * Add one to `nr_to_scan' just to make sure that the kernel will
>  	 * slowly sift through the active list.
>  	 */
> -	zone->nr_scan_active +=
> -		(zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
> -	nr_active = zone->nr_scan_active;
> -	if (nr_active >= sc->swap_cluster_max)
> -		zone->nr_scan_active = 0;
> -	else
> -		nr_active = 0;
> -
> -	zone->nr_scan_inactive +=
> -		(zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
> -	nr_inactive = zone->nr_scan_inactive;
> -	if (nr_inactive >= sc->swap_cluster_max)
> -		zone->nr_scan_inactive = 0;
> -	else
> -		nr_inactive = 0;
> +	for_each_lru(l) {
> +		zone->nr_scan[l] += (zone_page_state(zone, NR_INACTIVE + l)
> +							>> priority) + 1;
> +		nr[l] = zone->nr_scan[l];
> +		if (nr[l] >= sc->swap_cluster_max)
> +			zone->nr_scan[l] = 0;
> +		else
> +			nr[l] = 0;
> +	}
>  
> -	while (nr_active || nr_inactive) {
> -		if (nr_active) {
> -			nr_to_scan = min(nr_active,
> +	while (nr[LRU_ACTIVE] || nr[LRU_INACTIVE]) {
> +		for_each_lru(l) {
> +			if (nr[l]) {
> +				nr_to_scan = min(nr[l],
>  					(unsigned long)sc->swap_cluster_max);
> -			nr_active -= nr_to_scan;
> -			shrink_active_list(nr_to_scan, zone, sc, priority);
> -		}
> +				nr[l] -= nr_to_scan;
>  
> -		if (nr_inactive) {
> -			nr_to_scan = min(nr_inactive,
> -					(unsigned long)sc->swap_cluster_max);
> -			nr_inactive -= nr_to_scan;
> -			nr_reclaimed += shrink_inactive_list(nr_to_scan, zone,
> -								sc);
> +				nr_reclaimed += shrink_list(l, nr_to_scan,
> +							zone, sc, priority);
> +			}
>  		}
>  	}
>  
> @@ -1733,6 +1733,7 @@ static unsigned long shrink_all_zones(un
>  {
>  	struct zone *zone;
>  	unsigned long nr_to_scan, ret = 0;
> +	enum lru_list l;
>  
>  	for_each_zone(zone) {
>  
> @@ -1742,28 +1743,25 @@ static unsigned long shrink_all_zones(un
>  		if (zone->all_unreclaimable && prio != DEF_PRIORITY)
>  			continue;
>  
> -		/* For pass = 0 we don't shrink the active list */
> -		if (pass > 0) {
> -			zone->nr_scan_active +=
> -				(zone_page_state(zone, NR_ACTIVE) >> prio) + 1;
> -			if (zone->nr_scan_active >= nr_pages || pass > 3) {
> -				zone->nr_scan_active = 0;
> +		for_each_lru(l) {
> +			/* For pass = 0 we don't shrink the active list */
> +			if (pass == 0 && l == LRU_ACTIVE)
> +				continue;
> +
> +			zone->nr_scan[l] +=
> +				(zone_page_state(zone, NR_INACTIVE + l)
> +								>> prio) + 1;
> +			if (zone->nr_scan[l] >= nr_pages || pass > 3) {
> +				zone->nr_scan[l] = 0;
>  				nr_to_scan = min(nr_pages,
> -					zone_page_state(zone, NR_ACTIVE));
> -				shrink_active_list(nr_to_scan, zone, sc, prio);
> +					zone_page_state(zone,
> +							NR_INACTIVE + l));
> +				ret += shrink_list(l, nr_to_scan, zone,
> +								sc, prio);
> +				if (ret >= nr_pages)
> +					return ret;
>  			}
>  		}
> -
> -		zone->nr_scan_inactive +=
> -			(zone_page_state(zone, NR_INACTIVE) >> prio) + 1;
> -		if (zone->nr_scan_inactive >= nr_pages || pass > 3) {
> -			zone->nr_scan_inactive = 0;
> -			nr_to_scan = min(nr_pages,
> -				zone_page_state(zone, NR_INACTIVE));
> -			ret += shrink_inactive_list(nr_to_scan, zone, sc);
> -			if (ret >= nr_pages)
> -				return ret;
> -		}
>  	}
>  
>  	return ret;
> Index: Linux/mm/vmstat.c
> ===================================================================
> --- Linux.orig/mm/vmstat.c	2007-09-10 12:21:31.000000000 -0400
> +++ Linux/mm/vmstat.c	2007-09-10 12:22:33.000000000 -0400
> @@ -756,7 +756,8 @@ static void zoneinfo_show_print(struct s
>  		   zone->pages_low,
>  		   zone->pages_high,
>  		   zone->pages_scanned,
> -		   zone->nr_scan_active, zone->nr_scan_inactive,
> +		   zone->nr_scan[LRU_ACTIVE],
> +		   zone->nr_scan[LRU_INACTIVE],
>  		   zone->spanned_pages,
>  		   zone->present_pages);
>  
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 5/14] Reclaim Scalability:  Use an indexed array for LRU variables
  2007-09-17 13:40   ` Mel Gorman
@ 2007-09-17 14:17     ` Lee Schermerhorn
  2007-09-17 14:39       ` Lee Schermerhorn
  0 siblings, 1 reply; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-17 14:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, akpm, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Mon, 2007-09-17 at 14:40 +0100, Mel Gorman wrote:
> On (14/09/07 16:54), Lee Schermerhorn didst pronounce:
> > [PATCH/RFC] 05/15  Reclaim Scalability:   Use an indexed array for LRU variables
> > 
> > From clameter@sgi.com Wed Aug 29 11:39:51 2007
> > 
<snip>
> >  
> >  static inline void
> >  del_page_from_lru(struct zone *zone, struct page *page)
> >  {
> > +	enum lru_list l = LRU_INACTIVE;
> > +
> >  	list_del(&page->lru);
> >  	if (PageActive(page)) {
> >  		__ClearPageActive(page);
> >  		__dec_zone_state(zone, NR_ACTIVE);
> > -	} else {
> > -		__dec_zone_state(zone, NR_INACTIVE);
> > +		l = LRU_ACTIVE;
> >  	}
> > +	__dec_zone_state(zone, NR_INACTIVE + l);
> 
> It looks like you can call __dec_zone_state() twice for active pages
> here. Was it meant to be?
> 
> enum lru_list l = LRU_INACTIVE;
> 
> If (PageActive(page)) {
> 	__ClearPageActive(page);
> 	l = LRU_INACTIVE;
> } else {
> 	l = LRU_ACTIVE;
> }
> __dec_zone_state(zone, NR_INACTIVE + l);
> 
> ?

Yes, another botched merge :-(.  This does explain why I'm seeing the
active memory in meminfo going to zero and staying there after I kill
off the tests.  Will fix and retest.

It's great to have other eyes looking at these!

Thanks.

Lee

<snip>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 5/14] Reclaim Scalability:  Use an indexed array for LRU variables
  2007-09-17 14:17     ` Lee Schermerhorn
@ 2007-09-17 14:39       ` Lee Schermerhorn
  0 siblings, 0 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-17 14:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, akpm, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Mon, 2007-09-17 at 10:17 -0400, Lee Schermerhorn wrote:
> On Mon, 2007-09-17 at 14:40 +0100, Mel Gorman wrote:
> > On (14/09/07 16:54), Lee Schermerhorn didst pronounce:
> > > [PATCH/RFC] 05/15  Reclaim Scalability:   Use an indexed array for LRU variables
> > > 
> > > From clameter@sgi.com Wed Aug 29 11:39:51 2007
> > > 
> <snip>
> > >  
> > >  static inline void
> > >  del_page_from_lru(struct zone *zone, struct page *page)
> > >  {
> > > +	enum lru_list l = LRU_INACTIVE;
> > > +
> > >  	list_del(&page->lru);
> > >  	if (PageActive(page)) {
> > >  		__ClearPageActive(page);
> > >  		__dec_zone_state(zone, NR_ACTIVE);
> > > -	} else {
> > > -		__dec_zone_state(zone, NR_INACTIVE);
> > > +		l = LRU_ACTIVE;
> > >  	}
> > > +	__dec_zone_state(zone, NR_INACTIVE + l);
> > 
> > It looks like you can call __dec_zone_state() twice for active pages
> > here. Was it meant to be?
> > 
> > enum lru_list l = LRU_INACTIVE;
> > 
> > If (PageActive(page)) {
> > 	__ClearPageActive(page);
> > 	l = LRU_INACTIVE;
> > } else {
> > 	l = LRU_ACTIVE;
> > }
> > __dec_zone_state(zone, NR_INACTIVE + l);
> > 
> > ?
> 
> Yes, another botched merge :-(.  This does explain why I'm seeing the
> active memory in meminfo going to zero and staying there after I kill
> off the tests.  

Or maybe not...

> Will fix and retest.

Looks like I fix this in the noreclaim infrastructure patch when I add
in support for non-reclaimable pages.
> 
> It's great to have other eyes looking at these!

But, it's still great to have other eyes...  I will fix up this patch.

> 
> Thanks.
> 
> Lee
> 
> <snip>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 5/14] Reclaim Scalability:  Use an indexed array for LRU variables
  2007-09-14 20:54 ` [PATCH/RFC 5/14] Reclaim Scalability: Use an indexed array for LRU variables Lee Schermerhorn
  2007-09-17 13:40   ` Mel Gorman
@ 2007-09-17 18:58   ` Balbir Singh
  2007-09-17 19:12     ` Lee Schermerhorn
  2007-09-17 19:36     ` Rik van Riel
  1 sibling, 2 replies; 77+ messages in thread
From: Balbir Singh @ 2007-09-17 18:58 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, riel, andrea, a.p.zijlstra,
	eric.whitney, npiggin

Lee Schermerhorn wrote:
> [PATCH/RFC] 05/15  Reclaim Scalability:   Use an indexed array for LRU variables
> 
> From clameter@sgi.com Wed Aug 29 11:39:51 2007
> 
> Currently we are defining explicit variables for the inactive
> and active list. An indexed array can be more generic and avoid
> repeating similar code in several places in the reclaim code.
> 
> We are saving a few bytes in terms of code size:
> 
> Before:
> 
>    text    data     bss     dec     hex filename
> 4097753  573120 4092484 8763357  85b7dd vmlinux
> 
> After:
> 
>    text    data     bss     dec     hex filename
> 4097729  573120 4092484 8763333  85b7c5 vmlinux
> 
> Having an easy way to add new lru lists may ease future work on
> the reclaim code.
> 
> [CL's signoff added by lts based on mail from CL]
> Signed-off-by:  Christoph Lameter <clameter@sgi.com>
> 
>  include/linux/mm_inline.h |   33 ++++++++---
>  include/linux/mmzone.h    |   17 +++--
>  mm/page_alloc.c           |    9 +--
>  mm/swap.c                 |    2 
>  mm/vmscan.c               |  132 ++++++++++++++++++++++------------------------
>  mm/vmstat.c               |    3 -
>  6 files changed, 107 insertions(+), 89 deletions(-)
> 
> Index: Linux/include/linux/mmzone.h
> ===================================================================
> --- Linux.orig/include/linux/mmzone.h	2007-09-10 12:21:31.000000000 -0400
> +++ Linux/include/linux/mmzone.h	2007-09-10 12:22:33.000000000 -0400
> @@ -81,8 +81,8 @@ struct zone_padding {
>  enum zone_stat_item {
>  	/* First 128 byte cacheline (assuming 64 bit words) */
>  	NR_FREE_PAGES,
> -	NR_INACTIVE,
> -	NR_ACTIVE,
> +	NR_INACTIVE,	/* must match order of LRU_[IN]ACTIVE */
> +	NR_ACTIVE,	/*  "     "     "   "       "         */
>  	NR_ANON_PAGES,	/* Mapped anonymous pages */
>  	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
>  			   only modified from process context */
> @@ -106,6 +106,13 @@ enum zone_stat_item {
>  #endif
>  	NR_VM_ZONE_STAT_ITEMS };
> 
> +enum lru_list {
> +	LRU_INACTIVE,	/* must match order of NR_[IN]ACTIVE */
> +	LRU_ACTIVE,	/*  "     "     "   "       "        */
> +	NR_LRU_LISTS };
> +
> +#define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
> +
>  struct per_cpu_pages {
>  	int count;		/* number of pages in the list */
>  	int high;		/* high watermark, emptying needed */
> @@ -259,10 +266,8 @@ struct zone {
> 
>  	/* Fields commonly accessed by the page reclaim scanner */
>  	spinlock_t		lru_lock;	
> -	struct list_head	active_list;
> -	struct list_head	inactive_list;
> -	unsigned long		nr_scan_active;
> -	unsigned long		nr_scan_inactive;
> +	struct list_head	list[NR_LRU_LISTS];
> +	unsigned long		nr_scan[NR_LRU_LISTS];

I wonder if it makes sense to have an array of the form

struct reclaim_lists {
	struct list_head list[NR_LRU_LISTS];
	unsigned long nr_scan[NR_LRU_LISTS];
	reclaim_function_t list_reclaim_function[NR_LRU_LISTS];
}

where reclaim_function is an array of reclaim functions for each list
(in our case shrink_active_list/shrink_inactive_list).


>  static inline void
>  del_page_from_lru(struct zone *zone, struct page *page)
>  {
> +	enum lru_list l = LRU_INACTIVE;
> +
>  	list_del(&page->lru);
>  	if (PageActive(page)) {
>  		__ClearPageActive(page);
>  		__dec_zone_state(zone, NR_ACTIVE);
> -	} else {
> -		__dec_zone_state(zone, NR_INACTIVE);
> +		l = LRU_ACTIVE;
>  	}
> +	__dec_zone_state(zone, NR_INACTIVE + l);

This is unconditional, does not seem right.

>  }
> 



-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 5/14] Reclaim Scalability:  Use an indexed array for LRU variables
  2007-09-17 18:58   ` Balbir Singh
@ 2007-09-17 19:12     ` Lee Schermerhorn
  2007-09-17 19:36       ` Balbir Singh
  2007-09-17 19:36     ` Rik van Riel
  1 sibling, 1 reply; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-17 19:12 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, akpm, mel, clameter, riel, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Tue, 2007-09-18 at 00:28 +0530, Balbir Singh wrote:
> Lee Schermerhorn wrote:
> > [PATCH/RFC] 05/15  Reclaim Scalability:   Use an indexed array for LRU variables
> > 
> > From clameter@sgi.com Wed Aug 29 11:39:51 2007
> > 
> > Currently we are defining explicit variables for the inactive
> > and active list. An indexed array can be more generic and avoid
> > repeating similar code in several places in the reclaim code.
> > 
> > We are saving a few bytes in terms of code size:
> > 
> > Before:
> > 
> >    text    data     bss     dec     hex filename
> > 4097753  573120 4092484 8763357  85b7dd vmlinux
> > 
> > After:
> > 
> >    text    data     bss     dec     hex filename
> > 4097729  573120 4092484 8763333  85b7c5 vmlinux
> > 
> > Having an easy way to add new lru lists may ease future work on
> > the reclaim code.
> > 
> > [CL's signoff added by lts based on mail from CL]
> > Signed-off-by:  Christoph Lameter <clameter@sgi.com>
> > 
> >  include/linux/mm_inline.h |   33 ++++++++---
> >  include/linux/mmzone.h    |   17 +++--
> >  mm/page_alloc.c           |    9 +--
> >  mm/swap.c                 |    2 
> >  mm/vmscan.c               |  132 ++++++++++++++++++++++------------------------
> >  mm/vmstat.c               |    3 -
> >  6 files changed, 107 insertions(+), 89 deletions(-)
> > 
> > Index: Linux/include/linux/mmzone.h
> > ===================================================================
> > --- Linux.orig/include/linux/mmzone.h	2007-09-10 12:21:31.000000000 -0400
> > +++ Linux/include/linux/mmzone.h	2007-09-10 12:22:33.000000000 -0400
> > @@ -81,8 +81,8 @@ struct zone_padding {
> >  enum zone_stat_item {
> >  	/* First 128 byte cacheline (assuming 64 bit words) */
> >  	NR_FREE_PAGES,
> > -	NR_INACTIVE,
> > -	NR_ACTIVE,
> > +	NR_INACTIVE,	/* must match order of LRU_[IN]ACTIVE */
> > +	NR_ACTIVE,	/*  "     "     "   "       "         */
> >  	NR_ANON_PAGES,	/* Mapped anonymous pages */
> >  	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
> >  			   only modified from process context */
> > @@ -106,6 +106,13 @@ enum zone_stat_item {
> >  #endif
> >  	NR_VM_ZONE_STAT_ITEMS };
> > 
> > +enum lru_list {
> > +	LRU_INACTIVE,	/* must match order of NR_[IN]ACTIVE */
> > +	LRU_ACTIVE,	/*  "     "     "   "       "        */
> > +	NR_LRU_LISTS };
> > +
> > +#define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
> > +
> >  struct per_cpu_pages {
> >  	int count;		/* number of pages in the list */
> >  	int high;		/* high watermark, emptying needed */
> > @@ -259,10 +266,8 @@ struct zone {
> > 
> >  	/* Fields commonly accessed by the page reclaim scanner */
> >  	spinlock_t		lru_lock;	
> > -	struct list_head	active_list;
> > -	struct list_head	inactive_list;
> > -	unsigned long		nr_scan_active;
> > -	unsigned long		nr_scan_inactive;
> > +	struct list_head	list[NR_LRU_LISTS];
> > +	unsigned long		nr_scan[NR_LRU_LISTS];
> 
> I wonder if it makes sense to have an array of the form
> 
> struct reclaim_lists {
> 	struct list_head list[NR_LRU_LISTS];
> 	unsigned long nr_scan[NR_LRU_LISTS];
> 	reclaim_function_t list_reclaim_function[NR_LRU_LISTS];
> }
> 
> where reclaim_function is an array of reclaim functions for each list
> (in our case shrink_active_list/shrink_inactive_list).

Are you thinking that memory controller would use the reclaim functions
switch--e.g., because of it's private lru lists?  And what sort of
reclaim functions do you have in mind?   Would it add additional
indirection in the fault path where we add pages to the LRU and move
them between LRU lists in the case of page activiation?  That could be a
concern.  In any case, maybe should be named something like 'lru_lists'
and 'lru_list_functions'?

> 
> 
> >  static inline void
> >  del_page_from_lru(struct zone *zone, struct page *page)
> >  {
> > +	enum lru_list l = LRU_INACTIVE;
> > +
> >  	list_del(&page->lru);
> >  	if (PageActive(page)) {
> >  		__ClearPageActive(page);
> >  		__dec_zone_state(zone, NR_ACTIVE);
> > -	} else {
> > -		__dec_zone_state(zone, NR_INACTIVE);
> > +		l = LRU_ACTIVE;
> >  	}
> > +	__dec_zone_state(zone, NR_INACTIVE + l);
> 
> This is unconditional, does not seem right.

It's not the unconditional one that wrong.  As Mel pointed out earlier,
I forgot to remove the explicit decrement of NR_ACTIVE.  Turns out that
I unknowingly fixed this in the subsequent noreclaim infrastructure
patch, but I need to fix it in this patch so that it stands alone.  Next
respin.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 5/14] Reclaim Scalability:  Use an indexed array for LRU variables
  2007-09-17 19:12     ` Lee Schermerhorn
@ 2007-09-17 19:36       ` Balbir Singh
  0 siblings, 0 replies; 77+ messages in thread
From: Balbir Singh @ 2007-09-17 19:36 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, riel, andrea, a.p.zijlstra,
	eric.whitney, npiggin

Lee Schermerhorn wrote:
> On Tue, 2007-09-18 at 00:28 +0530, Balbir Singh wrote:
>> Lee Schermerhorn wrote:
>>> [PATCH/RFC] 05/15  Reclaim Scalability:   Use an indexed array for LRU variables
>>>
>>> From clameter@sgi.com Wed Aug 29 11:39:51 2007
>>>
>>> Currently we are defining explicit variables for the inactive
>>> and active list. An indexed array can be more generic and avoid
>>> repeating similar code in several places in the reclaim code.
>>>
>>> We are saving a few bytes in terms of code size:
>>>
>>> Before:
>>>
>>>    text    data     bss     dec     hex filename
>>> 4097753  573120 4092484 8763357  85b7dd vmlinux
>>>
>>> After:
>>>
>>>    text    data     bss     dec     hex filename
>>> 4097729  573120 4092484 8763333  85b7c5 vmlinux
>>>
>>> Having an easy way to add new lru lists may ease future work on
>>> the reclaim code.
>>>
>>> [CL's signoff added by lts based on mail from CL]
>>> Signed-off-by:  Christoph Lameter <clameter@sgi.com>
>>>
>>>  include/linux/mm_inline.h |   33 ++++++++---
>>>  include/linux/mmzone.h    |   17 +++--
>>>  mm/page_alloc.c           |    9 +--
>>>  mm/swap.c                 |    2 
>>>  mm/vmscan.c               |  132 ++++++++++++++++++++++------------------------
>>>  mm/vmstat.c               |    3 -
>>>  6 files changed, 107 insertions(+), 89 deletions(-)
>>>
>>> Index: Linux/include/linux/mmzone.h
>>> ===================================================================
>>> --- Linux.orig/include/linux/mmzone.h	2007-09-10 12:21:31.000000000 -0400
>>> +++ Linux/include/linux/mmzone.h	2007-09-10 12:22:33.000000000 -0400
>>> @@ -81,8 +81,8 @@ struct zone_padding {
>>>  enum zone_stat_item {
>>>  	/* First 128 byte cacheline (assuming 64 bit words) */
>>>  	NR_FREE_PAGES,
>>> -	NR_INACTIVE,
>>> -	NR_ACTIVE,
>>> +	NR_INACTIVE,	/* must match order of LRU_[IN]ACTIVE */
>>> +	NR_ACTIVE,	/*  "     "     "   "       "         */
>>>  	NR_ANON_PAGES,	/* Mapped anonymous pages */
>>>  	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
>>>  			   only modified from process context */
>>> @@ -106,6 +106,13 @@ enum zone_stat_item {
>>>  #endif
>>>  	NR_VM_ZONE_STAT_ITEMS };
>>>
>>> +enum lru_list {
>>> +	LRU_INACTIVE,	/* must match order of NR_[IN]ACTIVE */
>>> +	LRU_ACTIVE,	/*  "     "     "   "       "        */
>>> +	NR_LRU_LISTS };
>>> +
>>> +#define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
>>> +
>>>  struct per_cpu_pages {
>>>  	int count;		/* number of pages in the list */
>>>  	int high;		/* high watermark, emptying needed */
>>> @@ -259,10 +266,8 @@ struct zone {
>>>
>>>  	/* Fields commonly accessed by the page reclaim scanner */
>>>  	spinlock_t		lru_lock;	
>>> -	struct list_head	active_list;
>>> -	struct list_head	inactive_list;
>>> -	unsigned long		nr_scan_active;
>>> -	unsigned long		nr_scan_inactive;
>>> +	struct list_head	list[NR_LRU_LISTS];
>>> +	unsigned long		nr_scan[NR_LRU_LISTS];
>> I wonder if it makes sense to have an array of the form
>>
>> struct reclaim_lists {
>> 	struct list_head list[NR_LRU_LISTS];
>> 	unsigned long nr_scan[NR_LRU_LISTS];
>> 	reclaim_function_t list_reclaim_function[NR_LRU_LISTS];
>> }
>>
>> where reclaim_function is an array of reclaim functions for each list
>> (in our case shrink_active_list/shrink_inactive_list).
> 
> Are you thinking that memory controller would use the reclaim functions
> switch--e.g., because of it's private lru lists?  And what sort of
> reclaim functions do you have in mind?   Would it add additional
> indirection in the fault path where we add pages to the LRU and move
> them between LRU lists in the case of page activiation?  That could be a
> concern.  In any case, maybe should be named something like 'lru_lists'
> and 'lru_list_functions'?
> 

Well we already have our own isolate function and we re-use
shrink_(in)active_list() functions. The idea behind associating
a function was that shrink_list() would look cleaner. In addition,
the reclaim function for locked page lists would be NULL. I prefer
lru_list_functions.

>>
>>>  static inline void
>>>  del_page_from_lru(struct zone *zone, struct page *page)
>>>  {
>>> +	enum lru_list l = LRU_INACTIVE;
>>> +
>>>  	list_del(&page->lru);
>>>  	if (PageActive(page)) {
>>>  		__ClearPageActive(page);
>>>  		__dec_zone_state(zone, NR_ACTIVE);
>>> -	} else {
>>> -		__dec_zone_state(zone, NR_INACTIVE);
>>> +		l = LRU_ACTIVE;
>>>  	}
>>> +	__dec_zone_state(zone, NR_INACTIVE + l);
>> This is unconditional, does not seem right.
> 
> It's not the unconditional one that wrong.  As Mel pointed out earlier,
> I forgot to remove the explicit decrement of NR_ACTIVE.  Turns out that
> I unknowingly fixed this in the subsequent noreclaim infrastructure
> patch, but I need to fix it in this patch so that it stands alone.  Next
> respin.
> 

Yes, my bad. It's the NR_ACTIVE that is.

> Lee
> 


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 5/14] Reclaim Scalability:  Use an indexed array for LRU variables
  2007-09-17 18:58   ` Balbir Singh
  2007-09-17 19:12     ` Lee Schermerhorn
@ 2007-09-17 19:36     ` Rik van Riel
  2007-09-17 20:21       ` Balbir Singh
  1 sibling, 1 reply; 77+ messages in thread
From: Rik van Riel @ 2007-09-17 19:36 UTC (permalink / raw)
  To: balbir
  Cc: Lee Schermerhorn, linux-mm, akpm, mel, clameter, andrea,
	a.p.zijlstra, eric.whitney, npiggin

Balbir Singh wrote:

> I wonder if it makes sense to have an array of the form
> 
> struct reclaim_lists {
> 	struct list_head list[NR_LRU_LISTS];
> 	unsigned long nr_scan[NR_LRU_LISTS];
> 	reclaim_function_t list_reclaim_function[NR_LRU_LISTS];
> }
> 
> where reclaim_function is an array of reclaim functions for each list
> (in our case shrink_active_list/shrink_inactive_list).

I am not convinced, since that does not give us any way
to balance between the calls made to each function...

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 5/14] Reclaim Scalability:  Use an indexed array for LRU variables
  2007-09-17 19:36     ` Rik van Riel
@ 2007-09-17 20:21       ` Balbir Singh
  2007-09-17 21:01         ` Rik van Riel
  0 siblings, 1 reply; 77+ messages in thread
From: Balbir Singh @ 2007-09-17 20:21 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Lee Schermerhorn, linux-mm, akpm, mel, clameter, andrea,
	a.p.zijlstra, eric.whitney, npiggin

Rik van Riel wrote:
> Balbir Singh wrote:
> 
>> I wonder if it makes sense to have an array of the form
>>
>> struct reclaim_lists {
>>     struct list_head list[NR_LRU_LISTS];
>>     unsigned long nr_scan[NR_LRU_LISTS];
>>     reclaim_function_t list_reclaim_function[NR_LRU_LISTS];
>> }
>>
>> where reclaim_function is an array of reclaim functions for each list
>> (in our case shrink_active_list/shrink_inactive_list).
> 
> I am not convinced, since that does not give us any way
> to balance between the calls made to each function...
> 

Currently the balancing done is based on the number of pages
on each list, the priority and the pass. We could still do
that with the functions encapsulated. Am I missing something?

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 5/14] Reclaim Scalability:  Use an indexed array for LRU variables
  2007-09-17 20:21       ` Balbir Singh
@ 2007-09-17 21:01         ` Rik van Riel
  0 siblings, 0 replies; 77+ messages in thread
From: Rik van Riel @ 2007-09-17 21:01 UTC (permalink / raw)
  To: balbir
  Cc: Lee Schermerhorn, linux-mm, akpm, mel, clameter, andrea,
	a.p.zijlstra, eric.whitney, npiggin

Balbir Singh wrote:
> Rik van Riel wrote:
>> Balbir Singh wrote:
>>
>>> I wonder if it makes sense to have an array of the form
>>>
>>> struct reclaim_lists {
>>>     struct list_head list[NR_LRU_LISTS];
>>>     unsigned long nr_scan[NR_LRU_LISTS];
>>>     reclaim_function_t list_reclaim_function[NR_LRU_LISTS];
>>> }
>>>
>>> where reclaim_function is an array of reclaim functions for each list
>>> (in our case shrink_active_list/shrink_inactive_list).
>> I am not convinced, since that does not give us any way
>> to balance between the calls made to each function...
> 
> Currently the balancing done is based on the number of pages
> on each list, the priority and the pass. We could still do
> that with the functions encapsulated. Am I missing something?

Yes, that balancing does not work very well in all
workloads and will need to be changed some time.

Your scheme would remove the flexibility needed
to make such fixes.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH/RFC 6/14] Reclaim Scalability: "No Reclaim LRU Infrastructure"
  2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
                   ` (4 preceding siblings ...)
  2007-09-14 20:54 ` [PATCH/RFC 5/14] Reclaim Scalability: Use an indexed array for LRU variables Lee Schermerhorn
@ 2007-09-14 20:54 ` Lee Schermerhorn
  2007-09-14 22:47   ` Christoph Lameter
  2007-09-19  6:00   ` Balbir Singh
  2007-09-14 20:54 ` [PATCH/RFC 7/14] Reclaim Scalability: Non-reclaimable page statistics Lee Schermerhorn
                   ` (9 subsequent siblings)
  15 siblings, 2 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-14 20:54 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

PATCH/RFC 06/14 Reclaim Scalability: "No Reclaim LRU Infrastructure"

Against:  2.6.23-rc4-mm1

Infrastructure to manage pages excluded from reclaim--i.e., hidden
from vmscan.  Based on a patch by Larry Woodman of Red Hat. Reworked
to maintain "nonreclaimable" pages on a separate per-zone LRU list,
to "hide" them from vmscan.  A separate noreclaim pagevec is provided
for shrink_active_list() to move nonreclaimable pages to the noreclaim
list without over burdening the zone lru_lock.

Pages on the noreclaim list have both PG_noreclaim and PG_lru set.
Thus, PG_noreclaim is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.  

The noreclaim infrastructure is enabled by a new mm Kconfig option
[CONFIG_]NORECLAIM.

A new function 'page_reclaimable(page, vma)' in vmscan.c tests whether
or not a page is reclaimable.  Subsequent patches will add the various
!reclaimable tests.  We'll want to keep these tests light-weight for
use in shrink_active_list() and, possibly, the fault path.

Notes:

1.  for now, use bit 30 in page flags.  This restricts the no reclaim
    infrastructure to 64-bit systems.  [The mlock patch, later in this
    series, uses another of these 64-bit-system-only flags.]

    Rationale:  32-bit systems have no free page flags and are less
    likely to have the large amounts of memory that exhibit the problems
    this series attempts to solve.  [I'm sure someone will disabuse me
    of this notion.]

    Thus, NORECLAIM currently depends on [CONFIG_]64BIT.

2.  The pagevec to move pages to the noreclaim list results in another
    loop at the end of shrink_active_list().  If we ultimately adopt Rik
    van Riel's split lru approach, I think we'll need to find a way to
    factor all of these loops into some common code.

3.  Based on a suggestion from the developers at the VM summit, this
    patch adds a function--putback_all_noreclaim_pages()--to splice the
    per zone noreclaim list[s] back to the end of their respective active
    lists when conditions dictate rechecking the pages for reclaimability.
    This required some rework to '__isolate_pages()' in vmscan.c to allow
    nonreclaimable pages to be isolated from the active list, but only
    when scanning that list--i.e., not when lumpy reclaim is looking for
    adjacent pages.

    TODO:  This approach needs a lot of refinement.

4.  TODO:  Memory Controllers maintain separate active and inactive lists.
    Need to consider whether they should also maintain a noreclaim list.  
    Also, convert to use Christoph's array of indexed lru variables?

    See //TODO note in mm/memcontrol.c re:  isolating non-reclaimable
    pages. 

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mm_inline.h  |   26 ++++++-
 include/linux/mmzone.h     |   12 ++-
 include/linux/page-flags.h |   18 +++++
 include/linux/pagevec.h    |    5 +
 include/linux/swap.h       |   27 +++++++
 mm/Kconfig                 |   10 ++
 mm/memcontrol.c            |    6 +
 mm/mempolicy.c             |    2 
 mm/migrate.c               |   16 ++++
 mm/page_alloc.c            |    3 
 mm/swap.c                  |   83 +++++++++++++++++++++--
 mm/vmscan.c                |  157 ++++++++++++++++++++++++++++++++++++++++-----
 12 files changed, 332 insertions(+), 33 deletions(-)

Index: Linux/mm/Kconfig
===================================================================
--- Linux.orig/mm/Kconfig	2007-09-14 10:17:54.000000000 -0400
+++ Linux/mm/Kconfig	2007-09-14 10:22:02.000000000 -0400
@@ -194,3 +194,13 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config NORECLAIM
+	bool "Track non-reclaimable pages (EXPERIMENTAL; 64BIT only)"
+	depends on EXPERIMENTAL && 64BIT
+	help
+	  Supports tracking of non-reclaimable pages off the [in]active lists
+	  to avoid excessive reclaim overhead on large memory systems.  Pages
+	  may be non-reclaimable because:  they are locked into memory, they
+	  are anonymous pages for which no swap space exists, or they are anon
+	  pages that are expensive to unmap [long anon_vma "related vma" list.]
Index: Linux/include/linux/page-flags.h
===================================================================
--- Linux.orig/include/linux/page-flags.h	2007-09-14 10:17:54.000000000 -0400
+++ Linux/include/linux/page-flags.h	2007-09-14 10:21:48.000000000 -0400
@@ -94,6 +94,7 @@
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
 #define PG_readahead		PG_reclaim /* Reminder to do async read-ahead */
 
+
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
 #define PG_pinned		PG_owner_priv_1	/* Xen pinned pagetable */
@@ -107,6 +108,8 @@
  *         63                            32                              0
  */
 #define PG_uncached		31	/* Page has been mapped as uncached */
+
+#define PG_noreclaim		30	/* Page is "non-reclaimable"  */
 #endif
 
 /*
@@ -261,6 +264,21 @@ static inline void __ClearPageTail(struc
 #define PageSwapCache(page)	0
 #endif
 
+#ifdef CONFIG_NORECLAIM
+#define PageNoreclaim(page)	test_bit(PG_noreclaim, &(page)->flags)
+#define SetPageNoreclaim(page)	set_bit(PG_noreclaim, &(page)->flags)
+#define ClearPageNoreclaim(page) clear_bit(PG_noreclaim, &(page)->flags)
+#define __ClearPageNoreclaim(page) __clear_bit(PG_noreclaim, &(page)->flags)
+#define TestClearPageNoreclaim(page) test_and_clear_bit(PG_noreclaim, \
+							 &(page)->flags)
+#else
+#define PageNoreclaim(page)	0
+#define SetPageNoreclaim(page)
+#define ClearPageNoreclaim(page)
+#define __ClearPageNoreclaim(page)
+#define TestClearPageNoreclaim(page) 0
+#endif
+
 #define PageUncached(page)	test_bit(PG_uncached, &(page)->flags)
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
Index: Linux/include/linux/mmzone.h
===================================================================
--- Linux.orig/include/linux/mmzone.h	2007-09-14 10:21:45.000000000 -0400
+++ Linux/include/linux/mmzone.h	2007-09-14 10:21:48.000000000 -0400
@@ -81,8 +81,11 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
-	NR_INACTIVE,	/* must match order of LRU_[IN]ACTIVE */
+	NR_INACTIVE,	/* must match order of LRU_[IN]ACTIVE, ... */
 	NR_ACTIVE,	/*  "     "     "   "       "         */
+#ifdef CONFIG_NORECLAIM
+	NR_NORECLAIM,	/*  "     "     "   "       "         */
+#endif
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
@@ -107,12 +110,17 @@ enum zone_stat_item {
 	NR_VM_ZONE_STAT_ITEMS };
 
 enum lru_list {
-	LRU_INACTIVE,	/* must match order of NR_[IN]ACTIVE */
+	LRU_INACTIVE,	/* must match order of NR_[IN]ACTIVE, ... */
 	LRU_ACTIVE,	/*  "     "     "   "       "        */
+#ifdef CONFIG_NORECLAIM
+	LRU_NORECLAIM,	/*  must be last -- i.e., NR_LRU_LISTS - 1 */
+#endif
 	NR_LRU_LISTS };
 
 #define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
 
+#define for_each_reclaimable_lru(l) for (l = 0; l <= LRU_ACTIVE; l++)
+
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-09-14 10:21:45.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-09-14 10:22:05.000000000 -0400
@@ -247,6 +247,7 @@ static void bad_page(struct page *page)
 			1 << PG_private |
 			1 << PG_locked	|
 			1 << PG_active	|
+			1 << PG_noreclaim	|
 			1 << PG_dirty	|
 			1 << PG_reclaim |
 			1 << PG_slab    |
@@ -481,6 +482,7 @@ static inline int free_pages_check(struc
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_reserved |
+			1 << PG_noreclaim |
 			1 << PG_buddy ))))
 		bad_page(page);
 	if (PageDirty(page))
@@ -626,6 +628,7 @@ static int prep_new_page(struct page *pa
 			1 << PG_private	|
 			1 << PG_locked	|
 			1 << PG_active	|
+			1 << PG_noreclaim	|
 			1 << PG_dirty	|
 			1 << PG_slab    |
 			1 << PG_swapcache |
Index: Linux/include/linux/mm_inline.h
===================================================================
--- Linux.orig/include/linux/mm_inline.h	2007-09-14 10:21:45.000000000 -0400
+++ Linux/include/linux/mm_inline.h	2007-09-14 10:21:48.000000000 -0400
@@ -65,15 +65,37 @@ del_page_from_inactive_list(struct zone 
 	del_page_from_lru_list(zone, page, LRU_INACTIVE);
 }
 
+#ifdef CONFIG_NORECLAIM
+static inline void
+add_page_to_noreclaim_list(struct zone *zone, struct page *page)
+{
+	add_page_to_lru_list(zone, page, LRU_NORECLAIM);
+}
+
+static inline void
+del_page_from_noreclaim_list(struct zone *zone, struct page *page)
+{
+	del_page_from_lru_list(zone, page, LRU_NORECLAIM);
+}
+#else
+static inline void
+add_page_to_noreclaim_list(struct zone *zone, struct page *page) { }
+
+static inline void
+del_page_from_noreclaim_list(struct zone *zone, struct page *page) { }
+#endif
+
 static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
 	enum lru_list l = LRU_INACTIVE;
 
 	list_del(&page->lru);
-	if (PageActive(page)) {
+	if (PageNoreclaim(page)) {
+		__ClearPageNoreclaim(page);
+		l = NR_LRU_LISTS - 1;	/* == LRU_NORECLAIM, if config'd */
+	} else if (PageActive(page)) {
 		__ClearPageActive(page);
-		__dec_zone_state(zone, NR_ACTIVE);
 		l = LRU_ACTIVE;
 	}
 	__dec_zone_state(zone, NR_INACTIVE + l);
Index: Linux/include/linux/swap.h
===================================================================
--- Linux.orig/include/linux/swap.h	2007-09-14 10:17:54.000000000 -0400
+++ Linux/include/linux/swap.h	2007-09-14 10:22:02.000000000 -0400
@@ -187,12 +187,25 @@ extern void lru_add_drain(void);
 extern int lru_add_drain_all(void);
 extern int rotate_reclaimable_page(struct page *page);
 extern void swap_setup(void);
+#ifdef CONFIG_NORECLAIM
+extern void FASTCALL(lru_cache_add_noreclaim(struct page *page));
+extern void FASTCALL(lru_cache_add_active_or_noreclaim(struct page *page,
+					struct vm_area_struct *vma));
+#else
+static inline void lru_cache_add_noreclaim(struct page *page) { }
+static inline void lru_cache_add_active_or_noreclaim(struct page *page,
+					struct vm_area_struct *vma);
+{
+	lru_cache_add_active(page);
+}
+#endif
 
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zone **zones, int order,
 					gfp_t gfp_mask);
 extern unsigned long try_to_free_mem_container_pages(struct mem_container *mem);
-extern int __isolate_lru_page(struct page *page, int mode);
+extern int __isolate_lru_page(struct page *page, int mode,
+					int take_nonreclaimable);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
@@ -211,6 +224,18 @@ static inline int zone_reclaim(struct zo
 }
 #endif
 
+#ifdef CONFIG_NORECLAIM
+extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
+extern void putback_all_noreclaim_pages(void);
+#else
+static inline int page_reclaimable(struct page *page,
+						struct vm_area_struct *vma)
+{
+	return 1;
+}
+static inline void putback_all_noreclaim_pages(void) { }
+#endif
+
 extern int kswapd_run(int nid);
 
 #ifdef CONFIG_MMU
Index: Linux/include/linux/pagevec.h
===================================================================
--- Linux.orig/include/linux/pagevec.h	2007-09-14 10:17:54.000000000 -0400
+++ Linux/include/linux/pagevec.h	2007-09-14 10:21:48.000000000 -0400
@@ -25,6 +25,11 @@ void __pagevec_release_nonlru(struct pag
 void __pagevec_free(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec);
 void __pagevec_lru_add_active(struct pagevec *pvec);
+#ifdef CONFIG_NORECLAIM
+void __pagevec_lru_add_noreclaim(struct pagevec *pvec);
+#else
+static inline void __pagevec_lru_add_noreclaim(struct pagevec *pvec) { }
+#endif
 void pagevec_strip(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
Index: Linux/mm/swap.c
===================================================================
--- Linux.orig/mm/swap.c	2007-09-14 10:21:45.000000000 -0400
+++ Linux/mm/swap.c	2007-09-14 10:21:48.000000000 -0400
@@ -116,14 +116,14 @@ int rotate_reclaimable_page(struct page 
 		return 1;
 	if (PageDirty(page))
 		return 1;
-	if (PageActive(page))
+	if (PageActive(page) | PageNoreclaim(page))
 		return 1;
 	if (!PageLRU(page))
 		return 1;
 
 	zone = page_zone(page);
 	spin_lock_irqsave(&zone->lru_lock, flags);
-	if (PageLRU(page) && !PageActive(page)) {
+	if (PageLRU(page) && !PageActive(page) && !PageNoreclaim(page)) {
 		list_move_tail(&page->lru, &zone->list[LRU_INACTIVE]);
 		__count_vm_event(PGROTATED);
 	}
@@ -141,7 +141,7 @@ void fastcall activate_page(struct page 
 	struct zone *zone = page_zone(page);
 
 	spin_lock_irq(&zone->lru_lock);
-	if (PageLRU(page) && !PageActive(page)) {
+	if (PageLRU(page) && !PageActive(page) && !PageNoreclaim(page)) {
 		del_page_from_inactive_list(zone, page);
 		SetPageActive(page);
 		add_page_to_active_list(zone, page);
@@ -160,7 +160,8 @@ void fastcall activate_page(struct page 
  */
 void fastcall mark_page_accessed(struct page *page)
 {
-	if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
+	if (!PageActive(page) && !PageNoreclaim(page) &&
+			PageReferenced(page) && PageLRU(page)) {
 		activate_page(page);
 		ClearPageReferenced(page);
 	} else if (!PageReferenced(page)) {
@@ -197,6 +198,38 @@ void fastcall lru_cache_add_active(struc
 	put_cpu_var(lru_add_active_pvecs);
 }
 
+#ifdef CONFIG_NORECLAIM
+static DEFINE_PER_CPU(struct pagevec, lru_add_noreclaim_pvecs) = { 0, };
+
+void fastcall lru_cache_add_noreclaim(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(lru_add_noreclaim_pvecs);
+
+	page_cache_get(page);
+	if (!pagevec_add(pvec, page))
+		__pagevec_lru_add_noreclaim(pvec);
+	put_cpu_var(lru_add_noreclaim_pvecs);
+}
+
+void fastcall lru_cache_add_active_or_noreclaim(struct page *page,
+					struct vm_area_struct *vma)
+{
+	if (page_reclaimable(page, vma))
+		lru_cache_add_active(page);
+	else
+		lru_cache_add_noreclaim(page);
+}
+
+static inline void __drain_noreclaim_pvec(struct pagevec **pvec, int cpu)
+{
+	*pvec = &per_cpu(lru_add_noreclaim_pvecs, cpu);
+	if (pagevec_count(*pvec))
+		__pagevec_lru_add_noreclaim(*pvec);
+}
+#else
+static inline void __drain_noreclaim_pvec(struct pagevec **pvec, int cpu) { }
+#endif
+
 static void __lru_add_drain(int cpu)
 {
 	struct pagevec *pvec = &per_cpu(lru_add_pvecs, cpu);
@@ -207,6 +240,8 @@ static void __lru_add_drain(int cpu)
 	pvec = &per_cpu(lru_add_active_pvecs, cpu);
 	if (pagevec_count(pvec))
 		__pagevec_lru_add_active(pvec);
+
+	__drain_noreclaim_pvec(&pvec, cpu);
 }
 
 void lru_add_drain(void)
@@ -277,14 +312,18 @@ void release_pages(struct page **pages, 
 
 		if (PageLRU(page)) {
 			struct zone *pagezone = page_zone(page);
+			int is_lru_page;
+
 			if (pagezone != zone) {
 				if (zone)
 					spin_unlock_irq(&zone->lru_lock);
 				zone = pagezone;
 				spin_lock_irq(&zone->lru_lock);
 			}
-			VM_BUG_ON(!PageLRU(page));
-			__ClearPageLRU(page);
+			is_lru_page = PageLRU(page);
+			VM_BUG_ON(!(is_lru_page));
+			if (is_lru_page)
+				__ClearPageLRU(page);
 			del_page_from_lru(zone, page);
 		}
 
@@ -363,6 +402,7 @@ void __pagevec_lru_add(struct pagevec *p
 			zone = pagezone;
 			spin_lock_irq(&zone->lru_lock);
 		}
+		VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		add_page_to_inactive_list(zone, page);
@@ -392,7 +432,7 @@ void __pagevec_lru_add_active(struct pag
 		}
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
-		VM_BUG_ON(PageActive(page));
+		VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
 		SetPageActive(page);
 		add_page_to_active_list(zone, page);
 	}
@@ -402,6 +442,35 @@ void __pagevec_lru_add_active(struct pag
 	pagevec_reinit(pvec);
 }
 
+#ifdef CONFIG_NORECLAIM
+void __pagevec_lru_add_noreclaim(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+		VM_BUG_ON(PageLRU(page));
+		SetPageLRU(page);
+		VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
+		SetPageNoreclaim(page);
+		add_page_to_noreclaim_list(zone, page);
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
+#endif
+
 /*
  * Try to drop buffers from the pages in a pagevec
  */
Index: Linux/mm/migrate.c
===================================================================
--- Linux.orig/mm/migrate.c	2007-09-14 10:17:54.000000000 -0400
+++ Linux/mm/migrate.c	2007-09-14 10:21:48.000000000 -0400
@@ -52,13 +52,22 @@ int migrate_prep(void)
 	return 0;
 }
 
+/*
+ * move_to_lru() - place @page onto appropriate lru list
+ * based on preserved page flags:  active, noreclaim, none
+ */
 static inline void move_to_lru(struct page *page)
 {
-	if (PageActive(page)) {
+	if (PageNoreclaim(page)) {
+		VM_BUG_ON(PageActive(page));
+		ClearPageNoreclaim(page);
+		lru_cache_add_noreclaim(page);
+	} else if (PageActive(page)) {
 		/*
 		 * lru_cache_add_active checks that
 		 * the PG_active bit is off.
 		 */
+		VM_BUG_ON(PageNoreclaim(page));	/* race ? */
 		ClearPageActive(page);
 		lru_cache_add_active(page);
 	} else {
@@ -340,8 +349,11 @@ static void migrate_page_copy(struct pag
 		SetPageReferenced(newpage);
 	if (PageUptodate(page))
 		SetPageUptodate(newpage);
-	if (PageActive(page))
+	if (PageActive(page)) {
+		VM_BUG_ON(PageNoreclaim(page));
 		SetPageActive(newpage);
+	} else if (PageNoreclaim(page))
+		SetPageNoreclaim(newpage);
 	if (PageChecked(page))
 		SetPageChecked(newpage);
 	if (PageMappedToDisk(page))
Index: Linux/mm/vmscan.c
===================================================================
--- Linux.orig/mm/vmscan.c	2007-09-14 10:21:45.000000000 -0400
+++ Linux/mm/vmscan.c	2007-09-14 10:23:46.000000000 -0400
@@ -485,6 +485,11 @@ static unsigned long shrink_page_list(st
 
 		sc->nr_scanned++;
 
+		if (!page_reclaimable(page, NULL)) {
+			SetPageNoreclaim(page);
+			goto keep_locked;
+		}
+
 		if (!sc->may_swap && page_mapped(page))
 			goto keep_locked;
 
@@ -613,6 +618,7 @@ free_it:
 		continue;
 
 activate_locked:
+		VM_BUG_ON(PageActive(page));
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
@@ -640,10 +646,12 @@ keep:
  *
  * page:	page to consider
  * mode:	one of the LRU isolation modes defined above
+ * take_nonreclaimable:   isolate non-reclaimable pages -- i.e., from active
+ *              list
  *
  * returns 0 on success, -ve errno on failure.
  */
-int __isolate_lru_page(struct page *page, int mode)
+int __isolate_lru_page(struct page *page, int mode, int take_nonreclaimable)
 {
 	int ret = -EINVAL;
 
@@ -652,12 +660,27 @@ int __isolate_lru_page(struct page *page
 		return ret;
 
 	/*
-	 * When checking the active state, we need to be sure we are
-	 * dealing with comparible boolean values.  Take the logical not
-	 * of each.
+	 * Non-reclaimable pages shouldn't make it onto the inactive list,
+	 * so if we encounter one, we should be scanning either the active
+	 * list--e.g., after splicing noreclaim list to end of active list--
+	 * or nearby pages [lumpy reclaim].  Take it only if scanning active
+	 * list.
 	 */
-	if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
-		return ret;
+	if (PageNoreclaim(page)) {
+		if (!take_nonreclaimable)
+			return -EBUSY;	/* lumpy reclaim -- skip this page */
+		/*
+		 * else fall thru' and try to isolate
+		 */
+	} else {
+		/*
+		 * When checking the active state, we need to be sure we are
+		 * dealing with comparible boolean values.  Take the logical
+		 * not of each.
+		 */
+		if ((mode != ISOLATE_BOTH && (!PageActive(page) != !mode)))
+			return ret;
+	}
 
 	ret = -EBUSY;
 	if (likely(get_page_unless_zero(page))) {
@@ -670,6 +693,8 @@ int __isolate_lru_page(struct page *page
 		ret = 0;
 	}
 
+	if (TestClearPageNoreclaim(page))
+		SetPageActive(page);	/* will recheck in shrink_active_list */
 	return ret;
 }
 
@@ -711,7 +736,8 @@ static unsigned long isolate_lru_pages(u
 
 		VM_BUG_ON(!PageLRU(page));
 
-		switch (__isolate_lru_page(page, mode)) {
+		switch (__isolate_lru_page(page, mode,
+						 (mode == ISOLATE_ACTIVE))) {
 		case 0:
 			list_move(&page->lru, dst);
 			nr_taken++;
@@ -757,7 +783,7 @@ static unsigned long isolate_lru_pages(u
 			/* Check that we have not crossed a zone boundary. */
 			if (unlikely(page_zone_id(cursor_page) != zone_id))
 				continue;
-			switch (__isolate_lru_page(cursor_page, mode)) {
+			switch (__isolate_lru_page(cursor_page, mode, 0)) {
 			case 0:
 				list_move(&cursor_page->lru, dst);
 				nr_taken++;
@@ -817,9 +843,10 @@ static unsigned long clear_active_flags(
  * refcount on the page, which is a fundamentnal difference from
  * isolate_lru_pages (which is called without a stable reference).
  *
- * The returned page will have PageLru() cleared, and PageActive set,
- * if it was found on the active list. This flag generally will need to be
- * cleared by the caller before letting the page go.
+ * The returned page will have the PageLru() cleared, and the PageActive or
+ * PageNoreclaim will be set, if it was found on the active or noreclaim list,
+ * respectively. This flag generally will need to be cleared by the caller
+ * before letting the page go.
  *
  * The vmstat page counts corresponding to the list on which the page was
  * found will be decremented.
@@ -843,6 +870,8 @@ int isolate_lru_page(struct page *page)
 			ClearPageLRU(page);
 			if (PageActive(page))
 				del_page_from_active_list(zone, page);
+			else if (PageNoreclaim(page))
+				del_page_from_noreclaim_list(zone, page);
 			else
 				del_page_from_inactive_list(zone, page);
 		}
@@ -933,14 +962,21 @@ static unsigned long shrink_inactive_lis
 			VM_BUG_ON(PageLRU(page));
 			SetPageLRU(page);
 			list_del(&page->lru);
-			add_page_to_lru_list(zone, page, PageActive(page));
+			if (PageActive(page)) {
+				VM_BUG_ON(PageNoreclaim(page));
+				add_page_to_active_list(zone, page);
+			} else if (PageNoreclaim(page)) {
+				VM_BUG_ON(PageActive(page));
+				add_page_to_noreclaim_list(zone, page);
+			} else
+				add_page_to_inactive_list(zone, page);
 			if (!pagevec_add(&pvec, page)) {
 				spin_unlock_irq(&zone->lru_lock);
 				__pagevec_release(&pvec);
 				spin_lock_irq(&zone->lru_lock);
 			}
 		}
-  	} while (nr_scanned < max_scan);
+	} while (nr_scanned < max_scan);
 	spin_unlock(&zone->lru_lock);
 done:
 	local_irq_enable();
@@ -998,7 +1034,7 @@ static void shrink_active_list(unsigned 
 	int reclaim_mapped = 0;
 	enum lru_list l;
 
-	for_each_lru(l)
+	for_each_lru(l)				/* includes '_NORECLAIM */
 		INIT_LIST_HEAD(&list[l]);
 
 	if (sc->may_swap) {
@@ -1102,6 +1138,14 @@ force_reclaim_mapped:
 		cond_resched();
 		page = lru_to_page(&l_hold);
 		list_del(&page->lru);
+		if (!page_reclaimable(page, NULL)) {
+			/*
+			 * divert any non-reclaimable pages onto the
+			 * noreclaim list
+			 */
+			list_add(&page->lru, &list[LRU_NORECLAIM]);
+			continue;
+		}
 		if (page_mapped(page)) {
 			if (!reclaim_mapped ||
 			    (total_swap_pages == 0 && PageAnon(page)) ||
@@ -1169,6 +1213,30 @@ force_reclaim_mapped:
 	}
 	__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
 
+#ifdef CONFIG_NORECLAIM
+	pgmoved = 0;
+	while (!list_empty(&list[LRU_NORECLAIM])) {
+		page = lru_to_page(&list[LRU_NORECLAIM]);
+		prefetchw_prev_lru_page(page, &list[LRU_NORECLAIM], flags);
+		VM_BUG_ON(PageLRU(page));
+		SetPageLRU(page);
+		VM_BUG_ON(!PageActive(page));
+		ClearPageActive(page);
+		VM_BUG_ON(PageNoreclaim(page));
+		SetPageNoreclaim(page);
+		list_move(&page->lru, &zone->list[LRU_NORECLAIM]);
+		pgmoved++;
+		if (!pagevec_add(&pvec, page)) {
+			__mod_zone_page_state(zone, NR_NORECLAIM, pgmoved);
+			pgmoved = 0;
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
+	}
+	__mod_zone_page_state(zone, NR_NORECLAIM, pgmoved);
+#endif
+
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
 	__count_vm_events(PGDEACTIVATE, pgdeactivate);
 	spin_unlock_irq(&zone->lru_lock);
@@ -1203,7 +1271,7 @@ static unsigned long shrink_zone(int pri
 	 * Add one to `nr_to_scan' just to make sure that the kernel will
 	 * slowly sift through the active list.
 	 */
-	for_each_lru(l) {
+	for_each_reclaimable_lru(l) {
 		zone->nr_scan[l] += (zone_page_state(zone, NR_INACTIVE + l)
 							>> priority) + 1;
 		nr[l] = zone->nr_scan[l];
@@ -1214,7 +1282,7 @@ static unsigned long shrink_zone(int pri
 	}
 
 	while (nr[LRU_ACTIVE] || nr[LRU_INACTIVE]) {
-		for_each_lru(l) {
+		for_each_reclaimable_lru(l) {
 			if (nr[l]) {
 				nr_to_scan = min(nr[l],
 					(unsigned long)sc->swap_cluster_max);
@@ -1748,7 +1816,7 @@ static unsigned long shrink_all_zones(un
 		if (zone->all_unreclaimable && prio != DEF_PRIORITY)
 			continue;
 
-		for_each_lru(l) {
+		for_each_reclaimable_lru(l) {
 			/* For pass = 0 we don't shrink the active list */
 			if (pass == 0 && l == LRU_ACTIVE)
 				continue;
@@ -2084,3 +2152,58 @@ int zone_reclaim(struct zone *zone, gfp_
 	return __zone_reclaim(zone, gfp_mask, order);
 }
 #endif
+
+#ifdef CONFIG_NORECLAIM
+/*
+ * page_reclaimable(struct page *page, struct vm_area_struct *vma)
+ * Test whether page is reclaimable--i.e., should be placed on active/inactive
+ * lists vs noreclaim list.
+ *
+ * @page       - page to test
+ * @vma        - vm area in which page is/will be mapped.  May be NULL.
+ *               If !NULL, called from fault path.
+ *
+ * Reasons page might not be reclaimable:
+ * TODO - later patches
+ *
+ * TODO:  specify locking assumptions
+ */
+int page_reclaimable(struct page *page, struct vm_area_struct *vma)
+{
+
+	VM_BUG_ON(PageNoreclaim(page));
+
+	/* TODO:  test page [!]reclaimable conditions */
+
+	return 1;
+}
+
+/*
+ * putback_all_noreclaim_pages()
+ *
+ * A really big hammer:  put back all pages on each zone's noreclaim list
+ * to the zone's active list to give vmscan a chance to re-evaluate the
+ * reclaimability of the pages.  This occurs when, e.g., we have
+ * unswappable pages on the noreclaim lists, and we add swap to the
+ * system.
+//TODO:  or as a last resort under extreme memory pressure--before OOM?
+ */
+void putback_all_noreclaim_pages(void)
+{
+	struct zone *zone;
+
+	for_each_zone(zone) {
+		spin_lock(&zone->lru_lock);
+
+		list_splice(&zone->list[LRU_NORECLAIM],
+				&zone->list[LRU_ACTIVE]);
+		INIT_LIST_HEAD(&zone->list[LRU_NORECLAIM]);
+
+		zone_page_state_add(zone_page_state(zone, NR_NORECLAIM), zone,
+								NR_ACTIVE);
+		atomic_long_set(&zone->vm_stat[NR_NORECLAIM], 0);
+
+		spin_unlock(&zone->lru_lock);
+	}
+}
+#endif
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-09-14 10:17:54.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-09-14 10:21:48.000000000 -0400
@@ -1831,7 +1831,7 @@ static void gather_stats(struct page *pa
 	if (PageSwapCache(page))
 		md->swapcache++;
 
-	if (PageActive(page))
+	if (PageActive(page) || PageNoreclaim(page))
 		md->active++;
 
 	if (PageWriteback(page))
Index: Linux/mm/memcontrol.c
===================================================================
--- Linux.orig/mm/memcontrol.c	2007-09-14 10:17:54.000000000 -0400
+++ Linux/mm/memcontrol.c	2007-09-14 10:21:48.000000000 -0400
@@ -242,7 +242,11 @@ unsigned long mem_container_isolate_page
 		else
 			continue;
 
-		if (__isolate_lru_page(page, mode) == 0) {
+//TODO:  for now, don't isolate non-reclaimable pages.  When/if
+// mem controller supports a noreclaim list, we'll need to make
+// at least ISOLATE_ACTIVE visible outside of vm_scan and pass
+// the 'take_nonreclaimable' flag accordingly.
+		if (__isolate_lru_page(page, mode, 0) == 0) {
 			list_move(&page->lru, dst);
 			nr_taken++;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 6/14] Reclaim Scalability: "No Reclaim LRU Infrastructure"
  2007-09-14 20:54 ` [PATCH/RFC 6/14] Reclaim Scalability: "No Reclaim LRU Infrastructure" Lee Schermerhorn
@ 2007-09-14 22:47   ` Christoph Lameter
  2007-09-17 15:17     ` Lee Schermerhorn
  2007-09-19  6:00   ` Balbir Singh
  1 sibling, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-09-14 22:47 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Fri, 14 Sep 2007, Lee Schermerhorn wrote:

> 1.  for now, use bit 30 in page flags.  This restricts the no reclaim
>     infrastructure to 64-bit systems.  [The mlock patch, later in this
>     series, uses another of these 64-bit-system-only flags.]
> 
>     Rationale:  32-bit systems have no free page flags and are less
>     likely to have the large amounts of memory that exhibit the problems
>     this series attempts to solve.  [I'm sure someone will disabuse me
>     of this notion.]
> 
>     Thus, NORECLAIM currently depends on [CONFIG_]64BIT.

Hmmm.. Good a creative solution to the page flag dilemma.

> +#ifdef CONFIG_NORECLAIM
> +static inline void
> +add_page_to_noreclaim_list(struct zone *zone, struct page *page)
> +{
> +	add_page_to_lru_list(zone, page, LRU_NORECLAIM);
> +}
> +
> +static inline void
> +del_page_from_noreclaim_list(struct zone *zone, struct page *page)
> +{
> +	del_page_from_lru_list(zone, page, LRU_NORECLAIM);
> +}
> +#else
> +static inline void
> +add_page_to_noreclaim_list(struct zone *zone, struct page *page) { }
> +
> +static inline void
> +del_page_from_noreclaim_list(struct zone *zone, struct page *page) { }
> +#endif
> +

Do we really need to spell these out separately? 

> Index: Linux/mm/migrate.c
> ===================================================================
> --- Linux.orig/mm/migrate.c	2007-09-14 10:17:54.000000000 -0400
> +++ Linux/mm/migrate.c	2007-09-14 10:21:48.000000000 -0400
> @@ -52,13 +52,22 @@ int migrate_prep(void)
>  	return 0;
>  }
>  
> +/*
> + * move_to_lru() - place @page onto appropriate lru list
> + * based on preserved page flags:  active, noreclaim, none
> + */
>  static inline void move_to_lru(struct page *page)
>  {
> -	if (PageActive(page)) {
> +	if (PageNoreclaim(page)) {
> +		VM_BUG_ON(PageActive(page));
> +		ClearPageNoreclaim(page);
> +		lru_cache_add_noreclaim(page);
> +	} else if (PageActive(page)) {
>  		/*
>  		 * lru_cache_add_active checks that
>  		 * the PG_active bit is off.
>  		 */
> +		VM_BUG_ON(PageNoreclaim(page));	/* race ? */
>  		ClearPageActive(page);
>  		lru_cache_add_active(page);
>  	} else {

Could this be unified with the generic LRU handling in mm_inline.h? If you 
have a function that determines the LRU_xxx from the page flags then you 
can target the right list by indexing.

Maybe also create a generic lru_cache_add(page, list) function?

> +	 * Non-reclaimable pages shouldn't make it onto the inactive list,
> +	 * so if we encounter one, we should be scanning either the active
> +	 * list--e.g., after splicing noreclaim list to end of active list--
> +	 * or nearby pages [lumpy reclaim].  Take it only if scanning active
> +	 * list.


One fleeting thought here: It may be useful to *not* allocate known 
unreclaimable pages with __GFP_MOVABLE.

> @@ -670,6 +693,8 @@ int __isolate_lru_page(struct page *page
>  		ret = 0;
>  	}
>  
> +	if (TestClearPageNoreclaim(page))
> +		SetPageActive(page);	/* will recheck in shrink_active_list */
>  	return ret;
>  }

Failing to do the isoilation in vmscan.c is not an option?

> @@ -843,6 +870,8 @@ int isolate_lru_page(struct page *page)
>  			ClearPageLRU(page);
>  			if (PageActive(page))
>  				del_page_from_active_list(zone, page);
> +			else if (PageNoreclaim(page))
> +				del_page_from_noreclaim_list(zone, page);
>  			else
>  				del_page_from_inactive_list(zone, page);
>  		}

Another place where an indexing function from page flags to type of LRU 
list could simplify code.

> @@ -933,14 +962,21 @@ static unsigned long shrink_inactive_lis
>  			VM_BUG_ON(PageLRU(page));
>  			SetPageLRU(page);
>  			list_del(&page->lru);
> -			add_page_to_lru_list(zone, page, PageActive(page));
> +			if (PageActive(page)) {
> +				VM_BUG_ON(PageNoreclaim(page));
> +				add_page_to_active_list(zone, page);
> +			} else if (PageNoreclaim(page)) {
> +				VM_BUG_ON(PageActive(page));
> +				add_page_to_noreclaim_list(zone, page);
> +			} else
> +				add_page_to_inactive_list(zone, page);
>  			if (!pagevec_add(&pvec, page)) {

Ditto.

> +void putback_all_noreclaim_pages(void)
> +{
> +	struct zone *zone;
> +
> +	for_each_zone(zone) {
> +		spin_lock(&zone->lru_lock);
> +
> +		list_splice(&zone->list[LRU_NORECLAIM],
> +				&zone->list[LRU_ACTIVE]);
> +		INIT_LIST_HEAD(&zone->list[LRU_NORECLAIM]);
> +
> +		zone_page_state_add(zone_page_state(zone, NR_NORECLAIM), zone,
> +								NR_ACTIVE);
> +		atomic_long_set(&zone->vm_stat[NR_NORECLAIM], 0);

Racy if multiple reclaims are ongoing. Better subtract the value via 
mod_zone_page_state

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 6/14] Reclaim Scalability: "No Reclaim LRU Infrastructure"
  2007-09-14 22:47   ` Christoph Lameter
@ 2007-09-17 15:17     ` Lee Schermerhorn
  2007-09-17 18:41       ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-17 15:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, akpm, mel, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Fri, 2007-09-14 at 15:47 -0700, Christoph Lameter wrote:
> On Fri, 14 Sep 2007, Lee Schermerhorn wrote:
> 
> > 1.  for now, use bit 30 in page flags.  This restricts the no reclaim
> >     infrastructure to 64-bit systems.  [The mlock patch, later in this
> >     series, uses another of these 64-bit-system-only flags.]
> > 
> >     Rationale:  32-bit systems have no free page flags and are less
> >     likely to have the large amounts of memory that exhibit the problems
> >     this series attempts to solve.  [I'm sure someone will disabuse me
> >     of this notion.]
> > 
> >     Thus, NORECLAIM currently depends on [CONFIG_]64BIT.
> 
> Hmmm.. Good a creative solution to the page flag dilemma.

Still not sure I can get away with it tho' :-).

> 
> > +#ifdef CONFIG_NORECLAIM
> > +static inline void
> > +add_page_to_noreclaim_list(struct zone *zone, struct page *page)
> > +{
> > +	add_page_to_lru_list(zone, page, LRU_NORECLAIM);
> > +}
> > +
> > +static inline void
> > +del_page_from_noreclaim_list(struct zone *zone, struct page *page)
> > +{
> > +	del_page_from_lru_list(zone, page, LRU_NORECLAIM);
> > +}
> > +#else
> > +static inline void
> > +add_page_to_noreclaim_list(struct zone *zone, struct page *page) { }
> > +
> > +static inline void
> > +del_page_from_noreclaim_list(struct zone *zone, struct page *page) { }
> > +#endif
> > +
> 
> Do we really need to spell these out separately? 

Well, you left the "{add|del}_page_to_[in]active_list() functions, so I
kept these separate as well.  We could make another cleanup pass and
replace all of these with calls to the "{add|del}_page_to_lru_list()"
functions with the appropriate list enum.  

Also, nothing calls del_page_from_noreclaim_list() right now, so we can
probably lose it.

> 
> > Index: Linux/mm/migrate.c
> > ===================================================================
> > --- Linux.orig/mm/migrate.c	2007-09-14 10:17:54.000000000 -0400
> > +++ Linux/mm/migrate.c	2007-09-14 10:21:48.000000000 -0400
> > @@ -52,13 +52,22 @@ int migrate_prep(void)
> >  	return 0;
> >  }
> >  
> > +/*
> > + * move_to_lru() - place @page onto appropriate lru list
> > + * based on preserved page flags:  active, noreclaim, none
> > + */
> >  static inline void move_to_lru(struct page *page)
> >  {
> > -	if (PageActive(page)) {
> > +	if (PageNoreclaim(page)) {
> > +		VM_BUG_ON(PageActive(page));
> > +		ClearPageNoreclaim(page);
> > +		lru_cache_add_noreclaim(page);
> > +	} else if (PageActive(page)) {
> >  		/*
> >  		 * lru_cache_add_active checks that
> >  		 * the PG_active bit is off.
> >  		 */
> > +		VM_BUG_ON(PageNoreclaim(page));	/* race ? */
> >  		ClearPageActive(page);
> >  		lru_cache_add_active(page);
> >  	} else {
> 
> Could this be unified with the generic LRU handling in mm_inline.h? If you 
> have a function that determines the LRU_xxx from the page flags then you 
> can target the right list by indexing.
> 
> Maybe also create a generic lru_cache_add(page, list) function?

Possibly.  When you created the migration facility, you had these as
separate.  It's still private to migrate.c.  There are a number of
different variants of this.  Here, we put the page back onto the
appropriate list based on the Active|Noreclaim|<none> flag, preserved by
isolate_lru_page().  In mm/mlock.c, I have a similar
function--putback_lru_page() that just clears the flags and calls
lru_cache_add_active_or_noreclaim() to retest page_reclaimable() and
chose the appropriate list.  It never chooses the inactive list, tho'
Maybe that's a mistake?.  

Neither of these is a particularly hot path, I think.  So, maybe we can
come up with one that serves both purposes with a steering argument.
I'd want to test it for performance regression, of course.

> 
> > +	 * Non-reclaimable pages shouldn't make it onto the inactive list,
> > +	 * so if we encounter one, we should be scanning either the active
> > +	 * list--e.g., after splicing noreclaim list to end of active list--
> > +	 * or nearby pages [lumpy reclaim].  Take it only if scanning active
> > +	 * list.
> 
> 
> One fleeting thought here: It may be useful to *not* allocate known 
> unreclaimable pages with __GFP_MOVABLE.

Sorry, I don't understand where you're coming from here.
Non-reclaimable pages should be migratable, but maybe __GFP_MOVABLE
means something else?

> 
> > @@ -670,6 +693,8 @@ int __isolate_lru_page(struct page *page
> >  		ret = 0;
> >  	}
> >  
> > +	if (TestClearPageNoreclaim(page))
> > +		SetPageActive(page);	/* will recheck in shrink_active_list */
> >  	return ret;
> >  }
> 
> Failing to do the isoilation in vmscan.c is not an option?

1) This test doesn't fail the isolation.  It just ensures that the
noreclaim flag is cleared and, if it was set, replaces it with Active.
I think this is OK because I only accept non-reclaimable pages if we're
scanning the active list.  This is in support of splicing the noreclaim
list onto the active list when we want to rescan.  As I mentioned in
mail to Peter, I'm not too happy with this approach--my current
implementation anyway.  Need to revisit/discuss this...

2) Since lumpy reclaim, __isolate_lru_page() CAN return -EBUSY and
isolate_lru_pages() will just stick the page back on the list being
scanned.  We do this if the page state [active/inactive] doesn't match
the isolation "mode"--i.e., when lumpy reclaim is looking for physically
adjacent pages.  I also do this if the page is non-reclaimable and
isolate_lru_pages() doesn't specify that it's OK to take them.  As
mentioned above, it's only OK if we're scanning the active list from
shrink_active_list().  I think this whole thing is fragile--thus my
dissatisfaction...

> 
> > @@ -843,6 +870,8 @@ int isolate_lru_page(struct page *page)
> >  			ClearPageLRU(page);
> >  			if (PageActive(page))
> >  				del_page_from_active_list(zone, page);
> > +			else if (PageNoreclaim(page))
> > +				del_page_from_noreclaim_list(zone, page);
> >  			else
> >  				del_page_from_inactive_list(zone, page);
> >  		}
> 
> Another place where an indexing function from page flags to type of LRU 
> list could simplify code.

Agreed.  Need another pass...

> 
> > @@ -933,14 +962,21 @@ static unsigned long shrink_inactive_lis
> >  			VM_BUG_ON(PageLRU(page));
> >  			SetPageLRU(page);
> >  			list_del(&page->lru);
> > -			add_page_to_lru_list(zone, page, PageActive(page));
> > +			if (PageActive(page)) {
> > +				VM_BUG_ON(PageNoreclaim(page));
> > +				add_page_to_active_list(zone, page);
> > +			} else if (PageNoreclaim(page)) {
> > +				VM_BUG_ON(PageActive(page));
> > +				add_page_to_noreclaim_list(zone, page);
> > +			} else
> > +				add_page_to_inactive_list(zone, page);
> >  			if (!pagevec_add(&pvec, page)) {
> 
> Ditto.
> 
> > +void putback_all_noreclaim_pages(void)
> > +{
> > +	struct zone *zone;
> > +
> > +	for_each_zone(zone) {
> > +		spin_lock(&zone->lru_lock);
> > +
> > +		list_splice(&zone->list[LRU_NORECLAIM],
> > +				&zone->list[LRU_ACTIVE]);
> > +		INIT_LIST_HEAD(&zone->list[LRU_NORECLAIM]);
> > +
> > +		zone_page_state_add(zone_page_state(zone, NR_NORECLAIM), zone,
> > +								NR_ACTIVE);
> > +		atomic_long_set(&zone->vm_stat[NR_NORECLAIM], 0);
> 
> Racy if multiple reclaims are ongoing. Better subtract the value via 
> mod_zone_page_state

OK.  I'll make that change.  But, again, I need to revisit this entire
concept--splicing the noreclaim list back to the active.

Thanks for the review.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 6/14] Reclaim Scalability: "No Reclaim LRU Infrastructure"
  2007-09-17 15:17     ` Lee Schermerhorn
@ 2007-09-17 18:41       ` Christoph Lameter
  2007-09-18  9:54         ` Mel Gorman
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-09-17 18:41 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Mon, 17 Sep 2007, Lee Schermerhorn wrote:

> > One fleeting thought here: It may be useful to *not* allocate known 
> > unreclaimable pages with __GFP_MOVABLE.
> 
> Sorry, I don't understand where you're coming from here.
> Non-reclaimable pages should be migratable, but maybe __GFP_MOVABLE
> means something else?

True. __GFP_MOVABLE + MLOCK is movable. Also the ramfs/shmem pages. There 
may be uses though that require a page to stay put because it is used for 
some nefarious I/O purpose by a driver. RDMA comes to mind. Maybe we need 
some additional option that works like MLOCK but forbids migration. Those 
would then be unreclaimable and not __GFP_MOVABLE. I know some of our 
applications create huge amount of these.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 6/14] Reclaim Scalability: "No Reclaim LRU Infrastructure"
  2007-09-17 18:41       ` Christoph Lameter
@ 2007-09-18  9:54         ` Mel Gorman
  2007-09-18 19:45           ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2007-09-18  9:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee Schermerhorn, linux-mm, akpm, riel, balbir, andrea,
	a.p.zijlstra, eric.whitney, npiggin

On (17/09/07 11:41), Christoph Lameter didst pronounce:
> On Mon, 17 Sep 2007, Lee Schermerhorn wrote:
> 
> > > One fleeting thought here: It may be useful to *not* allocate known 
> > > unreclaimable pages with __GFP_MOVABLE.
> > 
> > Sorry, I don't understand where you're coming from here.
> > Non-reclaimable pages should be migratable, but maybe __GFP_MOVABLE
> > means something else?
> 
> True. __GFP_MOVABLE + MLOCK is movable.

Yes. Right now, it's rare we actually move them but the patches exist to
make it possible when we find it to be a real problem.

> Also the ramfs/shmem pages. There 
> may be uses though that require a page to stay put because it is used for 
> some nefarious I/O purpose by a driver. RDMA comes to mind.

Yeah :/

> Maybe we need 
> some additional option that works like MLOCK but forbids migration.

The problem with RDMA that I recall is that we don't know at allocation
time that they may be unmovable sometimes in the future. I didn't think
of a way around that problem.

> Those 
> would then be unreclaimable and not __GFP_MOVABLE. I know some of our 
> applications create huge amount of these.
> 

Can you think of a way that pages that will be later pinned by something
like RDMA can be identified in advance?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 6/14] Reclaim Scalability: "No Reclaim LRU Infrastructure"
  2007-09-18  9:54         ` Mel Gorman
@ 2007-09-18 19:45           ` Christoph Lameter
  2007-09-19 11:11             ` Mel Gorman
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2007-09-18 19:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, linux-mm, akpm, riel, balbir, andrea,
	a.p.zijlstra, eric.whitney, npiggin

On Tue, 18 Sep 2007, Mel Gorman wrote:

> > Also the ramfs/shmem pages. There 
> > may be uses though that require a page to stay put because it is used for 
> > some nefarious I/O purpose by a driver. RDMA comes to mind.
> 
> Yeah :/
> 
> > Maybe we need 
> > some additional option that works like MLOCK but forbids migration.
> 
> The problem with RDMA that I recall is that we don't know at allocation
> time that they may be unmovable sometimes in the future. I didn't think
> of a way around that problem.

The current way that we have around the problem is to increase the page 
count. With that all attempts to unmap the page by migration or otherwise 
fail and the page stays put.

RDMA is probably only temporarily pinning these while I/O is in progress?. 
Our applications (XPMEM) 
may pins them for good.
 
> > Those 
> > would then be unreclaimable and not __GFP_MOVABLE. I know some of our 
> > applications create huge amount of these.
> > 
> 
> Can you think of a way that pages that will be later pinned by something
> like RDMA can be identified in advance?

No. Nor in our XPMEM situation. We could move them at the point when they 
are pinned to another section?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 6/14] Reclaim Scalability: "No Reclaim LRU Infrastructure"
  2007-09-18 19:45           ` Christoph Lameter
@ 2007-09-19 11:11             ` Mel Gorman
  2007-09-19 18:03               ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2007-09-19 11:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee Schermerhorn, linux-mm, akpm, riel, balbir, andrea,
	a.p.zijlstra, eric.whitney, npiggin

On (18/09/07 12:45), Christoph Lameter didst pronounce:
> On Tue, 18 Sep 2007, Mel Gorman wrote:
> 
> > > Also the ramfs/shmem pages. There 
> > > may be uses though that require a page to stay put because it is used for 
> > > some nefarious I/O purpose by a driver. RDMA comes to mind.
> > 
> > Yeah :/
> > 
> > > Maybe we need 
> > > some additional option that works like MLOCK but forbids migration.
> > 
> > The problem with RDMA that I recall is that we don't know at allocation
> > time that they may be unmovable sometimes in the future. I didn't think
> > of a way around that problem.
> 
> The current way that we have around the problem is to increase the page 
> count. With that all attempts to unmap the page by migration or otherwise 
> fail and the page stays put.
> 
> RDMA is probably only temporarily pinning these while I/O is in progress?. 
> Our applications (XPMEM) 
> may pins them for good.
>  

I'm not that familiar with XPMEM. What is it doing that can pin memory
permanently?

> > > Those 
> > > would then be unreclaimable and not __GFP_MOVABLE. I know some of our 
> > > applications create huge amount of these.
> > > 
> > 
> > Can you think of a way that pages that will be later pinned by something
> > like RDMA can be identified in advance?
> 
> No. Nor in our XPMEM situation. We could move them at the point when they 
> are pinned to another section?
> 

XPMEM could do that all right. Allocate a non-movable page, copy and
pin.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 6/14] Reclaim Scalability: "No Reclaim LRU Infrastructure"
  2007-09-19 11:11             ` Mel Gorman
@ 2007-09-19 18:03               ` Christoph Lameter
  0 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-09-19 18:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, linux-mm, akpm, riel, balbir, andrea,
	a.p.zijlstra, eric.whitney, npiggin

On Wed, 19 Sep 2007, Mel Gorman wrote:

> > RDMA is probably only temporarily pinning these while I/O is in progress?. 
> > Our applications (XPMEM) 
> > may pins them for good.
> >  
> 
> I'm not that familiar with XPMEM. What is it doing that can pin memory
> permanently?

It exports an process address space to another Linux instance over a 
network or coherent memory.

> > No. Nor in our XPMEM situation. We could move them at the point when they 
> > are pinned to another section?
> > 
> 
> XPMEM could do that all right. Allocate a non-movable page, copy and
> pin.

I think we need a general mechanism that also covers RDMA and other uses.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 6/14] Reclaim Scalability: "No Reclaim LRU Infrastructure"
  2007-09-14 20:54 ` [PATCH/RFC 6/14] Reclaim Scalability: "No Reclaim LRU Infrastructure" Lee Schermerhorn
  2007-09-14 22:47   ` Christoph Lameter
@ 2007-09-19  6:00   ` Balbir Singh
  2007-09-19 14:47     ` Lee Schermerhorn
  1 sibling, 1 reply; 77+ messages in thread
From: Balbir Singh @ 2007-09-19  6:00 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, riel, andrea, a.p.zijlstra,
	eric.whitney, npiggin

Lee Schermerhorn wrote:
> PATCH/RFC 06/14 Reclaim Scalability: "No Reclaim LRU Infrastructure"
> 
> Against:  2.6.23-rc4-mm1
> 
> Infrastructure to manage pages excluded from reclaim--i.e., hidden
> from vmscan.  Based on a patch by Larry Woodman of Red Hat. Reworked
> to maintain "nonreclaimable" pages on a separate per-zone LRU list,
> to "hide" them from vmscan.  A separate noreclaim pagevec is provided
> for shrink_active_list() to move nonreclaimable pages to the noreclaim
> list without over burdening the zone lru_lock.
> 
> Pages on the noreclaim list have both PG_noreclaim and PG_lru set.
> Thus, PG_noreclaim is analogous to and mutually exclusive with
> PG_active--it specifies which LRU list the page is on.  
> 
> The noreclaim infrastructure is enabled by a new mm Kconfig option
> [CONFIG_]NORECLAIM.
> 

Could we use a different name. CONFIG_NORECLAIM could be misunderstood
to be that reclaim is disabled on the system all together.


> 
> 4.  TODO:  Memory Controllers maintain separate active and inactive lists.
>     Need to consider whether they should also maintain a noreclaim list.  
>     Also, convert to use Christoph's array of indexed lru variables?
> 
>     See //TODO note in mm/memcontrol.c re:  isolating non-reclaimable
>     pages. 
> 

Thanks, I'll look into exploiting this in the memory controller.

> Index: Linux/mm/swap.c
> ===================================================================
> --- Linux.orig/mm/swap.c	2007-09-14 10:21:45.000000000 -0400
> +++ Linux/mm/swap.c	2007-09-14 10:21:48.000000000 -0400
> @@ -116,14 +116,14 @@ int rotate_reclaimable_page(struct page 
>  		return 1;
>  	if (PageDirty(page))
>  		return 1;
> -	if (PageActive(page))
> +	if (PageActive(page) | PageNoreclaim(page))

Did you intend to make this bitwise or?

> -	if (PageLRU(page) && !PageActive(page)) {
> +	if (PageLRU(page) && !PageActive(page) && !PageNoreclaim(page)) {

Since we use this even below, does it make sense to wrap it into an
inline function and call it check_page_lru_inactive_reclaimable()?


>  void lru_add_drain(void)
> @@ -277,14 +312,18 @@ void release_pages(struct page **pages, 
> 
>  		if (PageLRU(page)) {
>  			struct zone *pagezone = page_zone(page);
> +			int is_lru_page;
> +
>  			if (pagezone != zone) {
>  				if (zone)
>  					spin_unlock_irq(&zone->lru_lock);
>  				zone = pagezone;
>  				spin_lock_irq(&zone->lru_lock);
>  			}
> -			VM_BUG_ON(!PageLRU(page));
> -			__ClearPageLRU(page);
> +			is_lru_page = PageLRU(page);
> +			VM_BUG_ON(!(is_lru_page));
> +			if (is_lru_page)

This is a little confusing, after asserting that the page
is indeed in LRU, why add the check for is_lru_page again?
Comments will be helpful here.


> +#ifdef CONFIG_NORECLAIM
> +void __pagevec_lru_add_noreclaim(struct pagevec *pvec)
> +{
> +	int i;
> +	struct zone *zone = NULL;
> +
> +	for (i = 0; i < pagevec_count(pvec); i++) {
> +		struct page *page = pvec->pages[i];
> +		struct zone *pagezone = page_zone(page);
> +
> +		if (pagezone != zone) {
> +			if (zone)
> +				spin_unlock_irq(&zone->lru_lock);
> +			zone = pagezone;
> +			spin_lock_irq(&zone->lru_lock);
> +		}
> +		VM_BUG_ON(PageLRU(page));
> +		SetPageLRU(page);

> +		VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
> +		SetPageNoreclaim(page);
> +		add_page_to_noreclaim_list(zone, page);

These two calls seem to be the only difference between __pagevec_lru_add
and this routine, any chance we could refactor to reuse most of the
code? Something like __pagevec_lru_add_prepare(), do the stuff and
then call __pagevec_lru_add_finish()


> +/*
> + * move_to_lru() - place @page onto appropriate lru list
> + * based on preserved page flags:  active, noreclaim, none
> + */
>  static inline void move_to_lru(struct page *page)
>  {
> -	if (PageActive(page)) {
> +	if (PageNoreclaim(page)) {
> +		VM_BUG_ON(PageActive(page));
> +		ClearPageNoreclaim(page);
> +		lru_cache_add_noreclaim(page);

I know that lru_cache_add_noreclaim() does the right thing
by looking at PageNoReclaim(), but the sequence is a little
confusing to read.


> -int __isolate_lru_page(struct page *page, int mode)
> +int __isolate_lru_page(struct page *page, int mode, int take_nonreclaimable)
>  {
>  	int ret = -EINVAL;
> 
> @@ -652,12 +660,27 @@ int __isolate_lru_page(struct page *page
>  		return ret;
> 
>  	/*
> -	 * When checking the active state, we need to be sure we are
> -	 * dealing with comparible boolean values.  Take the logical not
> -	 * of each.
> +	 * Non-reclaimable pages shouldn't make it onto the inactive list,
> +	 * so if we encounter one, we should be scanning either the active
> +	 * list--e.g., after splicing noreclaim list to end of active list--
> +	 * or nearby pages [lumpy reclaim].  Take it only if scanning active
> +	 * list.
>  	 */
> -	if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
> -		return ret;
> +	if (PageNoreclaim(page)) {
> +		if (!take_nonreclaimable)
> +			return -EBUSY;	/* lumpy reclaim -- skip this page */
> +		/*
> +		 * else fall thru' and try to isolate
> +		 */

I think we need to distinguish between the types of nonreclaimable
pages. Is it the heavily mapped pages that you pass on further?
A casual reader like me finds it hard to understand how lumpy reclaim
might try to reclaim a non-reclaimable page :-)

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 6/14] Reclaim Scalability: "No Reclaim LRU Infrastructure"
  2007-09-19  6:00   ` Balbir Singh
@ 2007-09-19 14:47     ` Lee Schermerhorn
  0 siblings, 0 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-19 14:47 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, akpm, mel, clameter, riel, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Wed, 2007-09-19 at 11:30 +0530, Balbir Singh wrote:
> Lee Schermerhorn wrote:
> > PATCH/RFC 06/14 Reclaim Scalability: "No Reclaim LRU Infrastructure"
> > 
> > Against:  2.6.23-rc4-mm1
> > 
> > Infrastructure to manage pages excluded from reclaim--i.e., hidden
> > from vmscan.  Based on a patch by Larry Woodman of Red Hat. Reworked
> > to maintain "nonreclaimable" pages on a separate per-zone LRU list,
> > to "hide" them from vmscan.  A separate noreclaim pagevec is provided
> > for shrink_active_list() to move nonreclaimable pages to the noreclaim
> > list without over burdening the zone lru_lock.
> > 
> > Pages on the noreclaim list have both PG_noreclaim and PG_lru set.
> > Thus, PG_noreclaim is analogous to and mutually exclusive with
> > PG_active--it specifies which LRU list the page is on.  
> > 
> > The noreclaim infrastructure is enabled by a new mm Kconfig option
> > [CONFIG_]NORECLAIM.
> > 
> 
> Could we use a different name. CONFIG_NORECLAIM could be misunderstood
> to be that reclaim is disabled on the system all together.

Sure.  When this settles down, if something like it gets accepted, we
can choose a different name--if we still want it to be configurable.

<snip>
> > Index: Linux/mm/swap.c
> > ===================================================================
> > --- Linux.orig/mm/swap.c	2007-09-14 10:21:45.000000000 -0400
> > +++ Linux/mm/swap.c	2007-09-14 10:21:48.000000000 -0400
> > @@ -116,14 +116,14 @@ int rotate_reclaimable_page(struct page 
> >  		return 1;
> >  	if (PageDirty(page))
> >  		return 1;
> > -	if (PageActive(page))
> > +	if (PageActive(page) | PageNoreclaim(page))
> 
> Did you intend to make this bitwise or?

Uh, no...  Thanks.  will fix.

> 
> > -	if (PageLRU(page) && !PageActive(page)) {
> > +	if (PageLRU(page) && !PageActive(page) && !PageNoreclaim(page)) {
> 
> Since we use this even below, does it make sense to wrap it into an
> inline function and call it check_page_lru_inactive_reclaimable()?

Perhaps.  But sometimes folks complain that the kernel is programmed in
C not in cpp.   So, I tend to err on the side of open coding..

> 
> 
> >  void lru_add_drain(void)
> > @@ -277,14 +312,18 @@ void release_pages(struct page **pages, 
> > 
> >  		if (PageLRU(page)) {
> >  			struct zone *pagezone = page_zone(page);
> > +			int is_lru_page;
> > +
> >  			if (pagezone != zone) {
> >  				if (zone)
> >  					spin_unlock_irq(&zone->lru_lock);
> >  				zone = pagezone;
> >  				spin_lock_irq(&zone->lru_lock);
> >  			}
> > -			VM_BUG_ON(!PageLRU(page));
> > -			__ClearPageLRU(page);
> > +			is_lru_page = PageLRU(page);
> > +			VM_BUG_ON(!(is_lru_page));
> > +			if (is_lru_page)
> 
> This is a little confusing, after asserting that the page
> is indeed in LRU, why add the check for is_lru_page again?
> Comments will be helpful here.

Not sure why I did this.  Maybe a hold over from previous code.  I'll
check and fix or comment.

> 
> 
> > +#ifdef CONFIG_NORECLAIM
> > +void __pagevec_lru_add_noreclaim(struct pagevec *pvec)
> > +{
> > +	int i;
> > +	struct zone *zone = NULL;
> > +
> > +	for (i = 0; i < pagevec_count(pvec); i++) {
> > +		struct page *page = pvec->pages[i];
> > +		struct zone *pagezone = page_zone(page);
> > +
> > +		if (pagezone != zone) {
> > +			if (zone)
> > +				spin_unlock_irq(&zone->lru_lock);
> > +			zone = pagezone;
> > +			spin_lock_irq(&zone->lru_lock);
> > +		}
> > +		VM_BUG_ON(PageLRU(page));
> > +		SetPageLRU(page);
> 
> > +		VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
> > +		SetPageNoreclaim(page);
> > +		add_page_to_noreclaim_list(zone, page);
> 
> These two calls seem to be the only difference between __pagevec_lru_add
> and this routine, any chance we could refactor to reuse most of the
> code? Something like __pagevec_lru_add_prepare(), do the stuff and
> then call __pagevec_lru_add_finish()

Yeah.   There is a lot of duplicated code in the lru pagevec management.
I assumed this was intentional because it's fast path [fault path] code.
As the number of lru lists increases [Rik has a patch to double the
number of active/inactive lists per zone] we may want to factor this
area and the loops at the end of shrink_active_list() that distribute
pages back to the appropriate lists.  Have to be careful of performance
regression, tho'.

> 
> 
> > +/*
> > + * move_to_lru() - place @page onto appropriate lru list
> > + * based on preserved page flags:  active, noreclaim, none
> > + */
> >  static inline void move_to_lru(struct page *page)
> >  {
> > -	if (PageActive(page)) {
> > +	if (PageNoreclaim(page)) {
> > +		VM_BUG_ON(PageActive(page));
> > +		ClearPageNoreclaim(page);
> > +		lru_cache_add_noreclaim(page);
> 
> I know that lru_cache_add_noreclaim() does the right thing
> by looking at PageNoReclaim(), but the sequence is a little
> confusing to read.

I don't understand your comment.  I was just following the pre-existing
pattern in move_to_lru().  This function just moves pages that have been
isolated for migration back to the appropriate lru list based on the
page flags via the pagevec.  The page flag that determines the
"appropriate list" must be cleared to avoid a VM_BUG_ON later.  The
VM_BUG_ON here is just my paranoia--to ensure that I don't have both
PG_active and PG_noreclaim set at the same time.

> 
> 
> > -int __isolate_lru_page(struct page *page, int mode)
> > +int __isolate_lru_page(struct page *page, int mode, int take_nonreclaimable)
> >  {
> >  	int ret = -EINVAL;
> > 
> > @@ -652,12 +660,27 @@ int __isolate_lru_page(struct page *page
> >  		return ret;
> > 
> >  	/*
> > -	 * When checking the active state, we need to be sure we are
> > -	 * dealing with comparible boolean values.  Take the logical not
> > -	 * of each.
> > +	 * Non-reclaimable pages shouldn't make it onto the inactive list,
> > +	 * so if we encounter one, we should be scanning either the active
> > +	 * list--e.g., after splicing noreclaim list to end of active list--
> > +	 * or nearby pages [lumpy reclaim].  Take it only if scanning active
> > +	 * list.
> >  	 */
> > -	if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
> > -		return ret;
> > +	if (PageNoreclaim(page)) {
> > +		if (!take_nonreclaimable)
> > +			return -EBUSY;	/* lumpy reclaim -- skip this page */
> > +		/*
> > +		 * else fall thru' and try to isolate
> > +		 */
> 
> I think we need to distinguish between the types of nonreclaimable
> pages. Is it the heavily mapped pages that you pass on further?
> A casual reader like me finds it hard to understand how lumpy reclaim
> might try to reclaim a non-reclaimable page :-)

If you look at isolate_lru_pages(), after it calls __isolate_lru_page()
to isolate a page off the list that it's scanning, if order is non-zero,
it attempts to isolate other pages that are part of the same higher
order page, w/o regard to what list they're on.  If it succeeded in
taking a non-reclaimable page here, this page would have to be tested
again in shrink_active_list() where it would probably be found to still
be non-reclaimable [but maybe not--more on this below] and we would have
wasted part of this batch.

I explicitly allowed isolating non-reclaimable pages from the active
list, because currently I have a function to splice the zones' noreclaim
lists to the end of the active list for another check for
reclaimability.  This would occur when swap was added, etc.  Note that
when I do take a non-reclaimable page, I reset PG_noreclaim [via
TestClear...] and set PG_active if it was non-reclaimable.

Now, I don't have a feel for how frequent lumpy reclaim will be, but
maybe it's not so bad to just allow non-reclaimable pages to be
unconditionally isolated and rechecked for reclaimability in
shrink_[in]active_list().  Just, need to ensure that the page flags
[active or not] are set correctly depending on which list we're
scanning.

Thanks,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH/RFC 7/14] Reclaim Scalability: Non-reclaimable page statistics
  2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
                   ` (5 preceding siblings ...)
  2007-09-14 20:54 ` [PATCH/RFC 6/14] Reclaim Scalability: "No Reclaim LRU Infrastructure" Lee Schermerhorn
@ 2007-09-14 20:54 ` Lee Schermerhorn
  2007-09-17  1:56   ` Rik van Riel
  2007-09-14 20:54 ` [PATCH/RFC 8/14] Reclaim Scalability: Ram Disk Pages are non-reclaimable Lee Schermerhorn
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-14 20:54 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

PATCH/RFC 07/14 Reclaim Scalability: Non-reclaimable page statistics

Against:  2.6.23-rc4-mm1

Report non-reclaimable pages per zone and system wide.

Note:  may want to track/report some specific reasons for 
nonreclaimability for deciding when to splice the noreclaim
lists back to the normal lru.  That will be tricky,
especially in shrink_active_list(), where we'd need someplace
to save the per page reason for non-reclaimability until the
pages are dumped back onto the noreclaim list from the pagevec.

Note:  my tests indicate that NR_NORECLAIM and probably the
other LRU stats aren't being maintained properly--especially
with large amounts of mlocked memory and the mlock patch in
this series installed.  Can't be sure of this, as I don't 
know why the pages are on the noreclaim list. Needs further
investigation.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 drivers/base/node.c |    6 ++++++
 fs/proc/proc_misc.c |    6 ++++++
 mm/page_alloc.c     |   16 +++++++++++++++-
 mm/vmstat.c         |    3 +++
 4 files changed, 30 insertions(+), 1 deletion(-)

Index: Linux/mm/page_alloc.c
===================================================================
--- Linux.orig/mm/page_alloc.c	2007-09-14 10:22:05.000000000 -0400
+++ Linux/mm/page_alloc.c	2007-09-14 10:23:49.000000000 -0400
@@ -1913,10 +1913,18 @@ void show_free_areas(void)
 		}
 	}
 
-	printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu\n"
+//TODO:  check/adjust line lengths
+	printk("Active:%lu inactive:%lu"
+#ifdef CONFIG_NORECLAIM
+		" noreclaim:%lu"
+#endif
+		" dirty:%lu writeback:%lu unstable:%lu\n"
 		" free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n",
 		global_page_state(NR_ACTIVE),
 		global_page_state(NR_INACTIVE),
+#ifdef CONFIG_NORECLAIM
+		global_page_state(NR_NORECLAIM),
+#endif
 		global_page_state(NR_FILE_DIRTY),
 		global_page_state(NR_WRITEBACK),
 		global_page_state(NR_UNSTABLE_NFS),
@@ -1941,6 +1949,9 @@ void show_free_areas(void)
 			" high:%lukB"
 			" active:%lukB"
 			" inactive:%lukB"
+#ifdef CONFIG_NORECLAIM
+			" noreclaim:%lukB"
+#endif
 			" present:%lukB"
 			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
@@ -1952,6 +1963,9 @@ void show_free_areas(void)
 			K(zone->pages_high),
 			K(zone_page_state(zone, NR_ACTIVE)),
 			K(zone_page_state(zone, NR_INACTIVE)),
+#ifdef CONFIG_NORECLAIM
+			K(zone_page_state(zone, NR_NORECLAIM)),
+#endif
 			K(zone->present_pages),
 			zone->pages_scanned,
 			(zone->all_unreclaimable ? "yes" : "no")
Index: Linux/mm/vmstat.c
===================================================================
--- Linux.orig/mm/vmstat.c	2007-09-14 10:22:05.000000000 -0400
+++ Linux/mm/vmstat.c	2007-09-14 10:23:49.000000000 -0400
@@ -686,6 +686,9 @@ static const char * const vmstat_text[] 
 	"nr_free_pages",
 	"nr_inactive",
 	"nr_active",
+#ifdef CONFIG_NORECLAIM
+	"nr_noreclaim",
+#endif
 	"nr_anon_pages",
 	"nr_mapped",
 	"nr_file_pages",
Index: Linux/drivers/base/node.c
===================================================================
--- Linux.orig/drivers/base/node.c	2007-09-14 10:22:05.000000000 -0400
+++ Linux/drivers/base/node.c	2007-09-14 10:23:49.000000000 -0400
@@ -49,6 +49,9 @@ static ssize_t node_read_meminfo(struct 
 		       "Node %d MemUsed:      %8lu kB\n"
 		       "Node %d Active:       %8lu kB\n"
 		       "Node %d Inactive:     %8lu kB\n"
+#ifdef CONFIG_NORECLAIM
+		       "Node %d Noreclaim:    %8lu kB\n"
+#endif
 #ifdef CONFIG_HIGHMEM
 		       "Node %d HighTotal:    %8lu kB\n"
 		       "Node %d HighFree:     %8lu kB\n"
@@ -71,6 +74,9 @@ static ssize_t node_read_meminfo(struct 
 		       nid, K(i.totalram - i.freeram),
 		       nid, node_page_state(nid, NR_ACTIVE),
 		       nid, node_page_state(nid, NR_INACTIVE),
+#ifdef CONFIG_NORECLAIM
+		       nid, node_page_state(nid, NR_NORECLAIM),
+#endif
 #ifdef CONFIG_HIGHMEM
 		       nid, K(i.totalhigh),
 		       nid, K(i.freehigh),
Index: Linux/fs/proc/proc_misc.c
===================================================================
--- Linux.orig/fs/proc/proc_misc.c	2007-09-14 10:22:05.000000000 -0400
+++ Linux/fs/proc/proc_misc.c	2007-09-14 10:23:49.000000000 -0400
@@ -157,6 +157,9 @@ static int meminfo_read_proc(char *page,
 		"SwapCached:   %8lu kB\n"
 		"Active:       %8lu kB\n"
 		"Inactive:     %8lu kB\n"
+#ifdef CONFIG_NORECLAIM
+		"Noreclaim:    %8lu kB\n"
+#endif
 #ifdef CONFIG_HIGHMEM
 		"HighTotal:    %8lu kB\n"
 		"HighFree:     %8lu kB\n"
@@ -187,6 +190,9 @@ static int meminfo_read_proc(char *page,
 		K(total_swapcache_pages),
 		K(global_page_state(NR_ACTIVE)),
 		K(global_page_state(NR_INACTIVE)),
+#ifdef CONFIG_NORECLAIM
+		K(global_page_state(NR_NORECLAIM)),
+#endif
 #ifdef CONFIG_HIGHMEM
 		K(i.totalhigh),
 		K(i.freehigh),

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 7/14] Reclaim Scalability: Non-reclaimable page statistics
  2007-09-14 20:54 ` [PATCH/RFC 7/14] Reclaim Scalability: Non-reclaimable page statistics Lee Schermerhorn
@ 2007-09-17  1:56   ` Rik van Riel
  0 siblings, 0 replies; 77+ messages in thread
From: Rik van Riel @ 2007-09-17  1:56 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

Lee Schermerhorn wrote:

> Note:  my tests indicate that NR_NORECLAIM and probably the
> other LRU stats aren't being maintained properly

Interesting, I have had the same suspicion when testing
my split LRU patch.  Somewhere something seems to be
going wrong...

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH/RFC 8/14] Reclaim Scalability:  Ram Disk Pages are non-reclaimable
  2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
                   ` (6 preceding siblings ...)
  2007-09-14 20:54 ` [PATCH/RFC 7/14] Reclaim Scalability: Non-reclaimable page statistics Lee Schermerhorn
@ 2007-09-14 20:54 ` Lee Schermerhorn
  2007-09-17  1:57   ` Rik van Riel
  2007-09-14 20:54 ` [PATCH/RFC 9/14] Reclaim Scalability: SHM_LOCKED pages are nonreclaimable Lee Schermerhorn
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-14 20:54 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

PATCH/RFC 08/14 Reclaim Scalability:  Ram Disk Pages are non-reclaimable

Against:  2.6.23-rc4-mm1

Christoph Lameter pointed out that ram disk pages also clutter the
LRU lists.  When vmscan finds them dirty and tries to clean them,
the ram disk writeback function just redirties the page so that it
goes back onto the active list.  Round and round she goes...

Define new address_space flag [shares address_space flags member
with mapping's gfp mask] to indicate that the address space contains
all non-reclaimable pages.  This will provide for efficient testing
of ramdisk pages in page_reclaimable().

Also provide wrapper functions to set/test the noreclaim state to
minimize #ifdefs in ramdisk driver and any other users of this
facility.

Set the noreclaim state on address_space structures for new
ramdisk inodes.  Test the noreclaim state in page_reclaimable()
to cull non-reclaimable pages.

These changes depend on [CONFIG_]NORECLAIM.

TODO:  see Rik's note in mm_inline.h:page_anon() re: ramfs pages.
Should they be "wired into memory"--i.e., marked non-reclaimable
like ramdisk pages?  If so, just call mapping_set_noreclaim() on
the mapping in fs/ramfs/inode.c:ramfs_get_inode().  Then we could
remove the explicit test from page_anon().

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 drivers/block/rd.c      |    5 +++++
 include/linux/pagemap.h |   22 ++++++++++++++++++++++
 mm/vmscan.c             |    4 ++++
 3 files changed, 31 insertions(+)

Index: Linux/drivers/block/rd.c
===================================================================
--- Linux.orig/drivers/block/rd.c	2007-09-14 10:22:04.000000000 -0400
+++ Linux/drivers/block/rd.c	2007-09-14 10:23:50.000000000 -0400
@@ -381,6 +381,11 @@ static int rd_open(struct inode *inode, 
 		gfp_mask &= ~(__GFP_FS|__GFP_IO);
 		gfp_mask |= __GFP_HIGH;
 		mapping_set_gfp_mask(mapping, gfp_mask);
+
+		/*
+		 * ram disk pages are not reclaimable
+		 */
+		mapping_set_noreclaim(mapping);
 	}
 
 	return 0;
Index: Linux/include/linux/pagemap.h
===================================================================
--- Linux.orig/include/linux/pagemap.h	2007-09-14 10:22:04.000000000 -0400
+++ Linux/include/linux/pagemap.h	2007-09-14 10:23:50.000000000 -0400
@@ -30,6 +30,28 @@ static inline void mapping_set_error(str
 	}
 }
 
+#ifdef CONFIG_NORECLAIM
+#define AS_NORECLAIM	(__GFP_BITS_SHIFT + 2)	/* e.g., ramdisk, SHM_LOCK */
+
+static inline void mapping_set_noreclaim(struct address_space *mapping)
+{
+	set_bit(AS_NORECLAIM, &mapping->flags);
+}
+
+static inline int mapping_non_reclaimable(struct address_space *mapping)
+{
+	if (mapping && (mapping->flags & AS_NORECLAIM))
+		return 1;
+	return 0;
+}
+#else
+static inline void mapping_set_noreclaim(struct address_space *mapping) { }
+static inline int mapping_non_reclaimable(struct address_space *mapping)
+{
+	return 0;
+}
+#endif
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
Index: Linux/mm/vmscan.c
===================================================================
--- Linux.orig/mm/vmscan.c	2007-09-14 10:23:46.000000000 -0400
+++ Linux/mm/vmscan.c	2007-09-14 10:23:50.000000000 -0400
@@ -2164,6 +2164,7 @@ int zone_reclaim(struct zone *zone, gfp_
  *               If !NULL, called from fault path.
  *
  * Reasons page might not be reclaimable:
+ * + page's mapping marked non-reclaimable
  * TODO - later patches
  *
  * TODO:  specify locking assumptions
@@ -2173,6 +2174,9 @@ int page_reclaimable(struct page *page, 
 
 	VM_BUG_ON(PageNoreclaim(page));
 
+	if (mapping_non_reclaimable(page_mapping(page)))
+		return 0;
+
 	/* TODO:  test page [!]reclaimable conditions */
 
 	return 1;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 8/14] Reclaim Scalability:  Ram Disk Pages are non-reclaimable
  2007-09-14 20:54 ` [PATCH/RFC 8/14] Reclaim Scalability: Ram Disk Pages are non-reclaimable Lee Schermerhorn
@ 2007-09-17  1:57   ` Rik van Riel
  2007-09-17 14:40     ` Lee Schermerhorn
  0 siblings, 1 reply; 77+ messages in thread
From: Rik van Riel @ 2007-09-17  1:57 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

Lee Schermerhorn wrote:
> PATCH/RFC 08/14 Reclaim Scalability:  Ram Disk Pages are non-reclaimable
> 
> Against:  2.6.23-rc4-mm1
> 
> Christoph Lameter pointed out that ram disk pages also clutter the
> LRU lists. 

Agreed, these should be moved out of the way to a nonreclaimable
list.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 8/14] Reclaim Scalability:  Ram Disk Pages are non-reclaimable
  2007-09-17  1:57   ` Rik van Riel
@ 2007-09-17 14:40     ` Lee Schermerhorn
  2007-09-17 18:42       ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-17 14:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, akpm, mel, clameter, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Sun, 2007-09-16 at 21:57 -0400, Rik van Riel wrote:
> Lee Schermerhorn wrote:
> > PATCH/RFC 08/14 Reclaim Scalability:  Ram Disk Pages are non-reclaimable
> > 
> > Against:  2.6.23-rc4-mm1
> > 
> > Christoph Lameter pointed out that ram disk pages also clutter the
> > LRU lists. 
> 
> Agreed, these should be moved out of the way to a nonreclaimable
> list.

Should we also treat ramfs pages the same way?  In your page_anon()
function, which I use in this series, you return '1' for ramfs pages,
indicating that they are swap-backed.  But, this doesn't seem to be the
case.  Looking at the ramfs code, I see that the ramfs
address_space_operations have no writepage op, so pageout() will just
reactivate the page.  You do have a comment/question there about whether
these should be treated as mlocked.  Mel also questions this test in a
later message.

So, I think I should just mark ramfs address space as nonreclaimable,
similar to ram disk.  Do you agree?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 8/14] Reclaim Scalability:  Ram Disk Pages are non-reclaimable
  2007-09-17 14:40     ` Lee Schermerhorn
@ 2007-09-17 18:42       ` Christoph Lameter
  0 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2007-09-17 18:42 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Rik van Riel, linux-mm, akpm, mel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Mon, 17 Sep 2007, Lee Schermerhorn wrote:

> So, I think I should just mark ramfs address space as nonreclaimable,
> similar to ram disk.  Do you agree?

Ack.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH/RFC 9/14] Reclaim Scalability: SHM_LOCKED pages are nonreclaimable
  2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
                   ` (7 preceding siblings ...)
  2007-09-14 20:54 ` [PATCH/RFC 8/14] Reclaim Scalability: Ram Disk Pages are non-reclaimable Lee Schermerhorn
@ 2007-09-14 20:54 ` Lee Schermerhorn
  2007-09-17  2:18   ` Rik van Riel
  2007-09-14 20:55 ` [PATCH/RFC 10/14] Reclaim Scalability: track anon_vma "related vmas" Lee Schermerhorn
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-14 20:54 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

PATCH/RFC 09/14 Reclaim Scalability: SHM_LOCKED pages are nonreclaimable

Against:  2.6.23-rc4-mm1

While working with Nick Piggin's mlock patches, I noticed that
shmem segments locked via shmctl(SHM_LOCKED) were not being handled.
SHM_LOCKed pages work like ramdisk pages--the writeback function
just redirties the page so that it can't be reclaimed.  Deal with
these using the same approach as for ram disk pages.

Use the AS_NORECLAIM flag to mark address_space of SHM_LOCKed
shared memory regions as non-reclaimable.  Then these pages
will be culled off the normal LRU lists during vmscan.

Add new wrapper function to clear the mapping's noreclaim state
when/if shared memory segment is munlocked.

Changes depend on [CONFIG_]NORECLAIM.

TODO:  patch currently splices all zones' noreclaim lists back
onto normal LRU lists when shmem region unlocked.  Could just
putback pages from this region/file--e.g., by scanning the
address space's radix tree using find_get_pages().

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/pagemap.h |   10 ++++++++--
 mm/shmem.c              |    5 +++++
 2 files changed, 13 insertions(+), 2 deletions(-)

Index: Linux/mm/shmem.c
===================================================================
--- Linux.orig/mm/shmem.c	2007-09-14 10:22:03.000000000 -0400
+++ Linux/mm/shmem.c	2007-09-14 10:23:51.000000000 -0400
@@ -1357,10 +1357,15 @@ int shmem_lock(struct file *file, int lo
 		if (!user_shm_lock(inode->i_size, user))
 			goto out_nomem;
 		info->flags |= VM_LOCKED;
+		mapping_set_noreclaim(file->f_mapping);
 	}
 	if (!lock && (info->flags & VM_LOCKED) && user) {
 		user_shm_unlock(inode->i_size, user);
 		info->flags &= ~VM_LOCKED;
+		mapping_clear_noreclaim(file->f_mapping);
+		putback_all_noreclaim_pages();
+//TODO:  could just putback pages from this file.
+//       e.g., "munlock_file_pages()" using find_get_pages() ?
 	}
 	retval = 0;
 out_nomem:
Index: Linux/include/linux/pagemap.h
===================================================================
--- Linux.orig/include/linux/pagemap.h	2007-09-14 10:23:50.000000000 -0400
+++ Linux/include/linux/pagemap.h	2007-09-14 10:23:51.000000000 -0400
@@ -38,14 +38,20 @@ static inline void mapping_set_noreclaim
 	set_bit(AS_NORECLAIM, &mapping->flags);
 }
 
+static inline void mapping_clear_noreclaim(struct address_space *mapping)
+{
+	clear_bit(AS_NORECLAIM, &mapping->flags);
+}
+
 static inline int mapping_non_reclaimable(struct address_space *mapping)
 {
-	if (mapping && (mapping->flags & AS_NORECLAIM))
-		return 1;
+	if (mapping)
+		return test_bit(AS_NORECLAIM, &mapping->flags);
 	return 0;
 }
 #else
 static inline void mapping_set_noreclaim(struct address_space *mapping) { }
+static inline void mapping_clear_noreclaim(struct address_space *mapping) { }
 static inline int mapping_non_reclaimable(struct address_space *mapping)
 {
 	return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 9/14] Reclaim Scalability: SHM_LOCKED pages are nonreclaimable
  2007-09-14 20:54 ` [PATCH/RFC 9/14] Reclaim Scalability: SHM_LOCKED pages are nonreclaimable Lee Schermerhorn
@ 2007-09-17  2:18   ` Rik van Riel
  0 siblings, 0 replies; 77+ messages in thread
From: Rik van Riel @ 2007-09-17  2:18 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

Lee Schermerhorn wrote:
> PATCH/RFC 09/14 Reclaim Scalability: SHM_LOCKED pages are nonreclaimable
> 
> Against:  2.6.23-rc4-mm1
> 
> While working with Nick Piggin's mlock patches, I noticed that
> shmem segments locked via shmctl(SHM_LOCKED) were not being handled.
> SHM_LOCKed pages work like ramdisk pages--the writeback function
> just redirties the page so that it can't be reclaimed.  Deal with
> these using the same approach as for ram disk pages.

Agreed, that needs to be done.

> TODO:  patch currently splices all zones' noreclaim lists back
> onto normal LRU lists when shmem region unlocked.  Could just
> putback pages from this region/file--e.g., by scanning the
> address space's radix tree using find_get_pages().

Yeah, I guess we'll want this :)

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH/RFC 10/14] Reclaim Scalability:  track anon_vma "related vmas"
  2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
                   ` (8 preceding siblings ...)
  2007-09-14 20:54 ` [PATCH/RFC 9/14] Reclaim Scalability: SHM_LOCKED pages are nonreclaimable Lee Schermerhorn
@ 2007-09-14 20:55 ` Lee Schermerhorn
  2007-09-17  2:52   ` Rik van Riel
  2007-09-14 20:55 ` [PATCH/RFC 11/14] Reclaim Scalability: swap backed pages are nonreclaimable when no swap space available Lee Schermerhorn
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-14 20:55 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

PATCH/RFC 10/14 Reclaim Scalability:  track anon_vma "related vmas"

Against:  2.6.23-rc4-mm1

When a single parent forks a large number [thousands, 10s of thousands]
of children, the anon_vma list of related vmas becomes very long.  In
reclaim, this list must be traversed twice--once in page_referenced_anon()
and once in try_to_unmap_anon()--under a spin lock to reclaim the page.
Multiple cpus can end up spinning behind the same anon_vma spinlock and
traversing the lists.  This patch, part of the "noreclaim" series, treats
anon pages with list lengths longer than a tunable threshold as non-
reclaimable.

1) add mm Kconfig option NORECLAIM_ANON_VMA, dependent on NORECLAIM.

2) add a counter of related vmas to the anon_vma structure.  This won't
   increase the size of the structure on 64-bit systems, as it will fit
   in a padding slot.  
   TODO:  do we need a ref count > 4 billion?

3) In [__]anon_vma_[un]link(), track number of related vmas.  The
   count is only incremented/decremented while the anon_vma lock
   is held, so regular, non-atomic, increment/decrement is used.

4) in page_reclaimable(), check anon_vma count in vma's anon_vma, if
   vma supplied, or in page's anon_vma.  In fault path, new anon pages are
   placed on the LRU before adding the anon rmap, so we need to check
   the vma's anon_vma.  Fortunately, the vma is available at that point.
   In vmscan, we can just check the page's anon_vma for any anon pages
   that made it onto the [in]active list before the anon_vma list length
   became "excessive".

5) make the threshold tunable via /proc/sys/vm/anon_vma_reclaim_limit.
   Default value of 64 is totally arbitrary, but should be high enough
   that most applications won't hit it.

Notes:

1) a separate patch makes the anon_vma lock a reader/writer lock.
   This allows some parallelism--different cpus can work on different 
   pages that reference the same anon_vma--but this does not address the
   problem of long lists and potentially many pte's to unmap.

2) TODO:  do same for file rmap in address_space with excessive number
   of mapping vmas?

3) Treating what are theortically reclaimable pages as nonreclaimable
   [in practice they ARE nonreclaimable] will result in oom-kill of some
   tasks rather than system hang/livelock.  We can debate which is
   preferrable.  However, with these patches, Andrea Arcangeli's oom-kill
   cleanups may become more important.

4) an alternate approach:  rather than treat these pages as nonreclaimable,
   we could track the anon_vma references and in fork() [dup_mmap()], when
   the count reaches some limit, give the anon_vma to the child and its
   siblings and their descendants, and allocate a new one for the parent.
   This requires breaking COW sharing of all anon pages [only the parent
   has complete enough state to do this at this point], as tasks can't
   share pages using different anon_vmas.  This will increase memory
   pressure and hasten the onset of reclaim.  I was working on this 
   alternate approach, but shelved it to try the noreclaim list approach.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/rmap.h |   61 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/swap.h |    3 ++
 kernel/sysctl.c      |   12 ++++++++++
 mm/Kconfig           |   11 +++++++++
 mm/rmap.c            |   12 ++++++++--
 mm/vmscan.c          |   23 +++++++++++++++++--
 6 files changed, 117 insertions(+), 5 deletions(-)

Index: Linux/mm/Kconfig
===================================================================
--- Linux.orig/mm/Kconfig	2007-09-14 10:22:02.000000000 -0400
+++ Linux/mm/Kconfig	2007-09-14 10:23:52.000000000 -0400
@@ -204,3 +204,14 @@ config NORECLAIM
 	  may be non-reclaimable because:  they are locked into memory, they
 	  are anonymous pages for which no swap space exists, or they are anon
 	  pages that are expensive to unmap [long anon_vma "related vma" list.]
+
+config NORECLAIM_ANON_VMA
+	bool "Exclude pages with excessively long anon_vma lists"
+	depends on NORECLAIM
+	help
+	  Treats anonymous pages with excessively long anon_vma lists as
+	  non-reclaimable.  Long anon_vma lists results from fork()ing
+	  many [hundreds, thousands] of children from a single parent.  The
+	  anonymous pages in such tasks are very expensive [sometimes almost
+	  impossible] to reclaim.  Treating them as non-reclaimable avoids
+	  the overhead of attempting to reclaim them.
Index: Linux/include/linux/rmap.h
===================================================================
--- Linux.orig/include/linux/rmap.h	2007-09-14 10:22:02.000000000 -0400
+++ Linux/include/linux/rmap.h	2007-09-14 10:23:52.000000000 -0400
@@ -11,6 +11,20 @@
 #include <linux/memcontrol.h>
 
 /*
+ * Optionally, limit the growth of the anon_vma list of "related" vmas
+ * to ANON_VMA_LIST_LIMIT.  Add a count member
+ * to the anon_vma structure where we'd have padding on a 64-bit
+ * system w/o lock debugging.
+ */
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+#define DEFAULT_ANON_VMA_RECLAIM_LIMIT 64
+#define TRACK_ANON_VMA_COUNT 1
+#else
+#define DEFAULT_ANON_VMA_RECLAIM_LIMIT 0
+#define TRACK_ANON_VMA_COUNT 0
+#endif
+
+/*
  * The anon_vma heads a list of private "related" vmas, to scan if
  * an anonymous page pointing to this anon_vma needs to be unmapped:
  * the vmas on the list will be related by forking, or by splitting.
@@ -26,6 +40,9 @@
  */
 struct anon_vma {
 	rwlock_t rwlock;	/* Serialize access to vma list */
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+	int count;	/* number of "related" vmas */
+#endif
 	struct list_head head;	/* List of private "related" vmas */
 };
 
@@ -35,11 +52,20 @@ extern struct kmem_cache *anon_vma_cache
 
 static inline struct anon_vma *anon_vma_alloc(void)
 {
-	return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+	struct anon_vma *anon_vma;
+
+	anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+	if (anon_vma)
+		anon_vma->count = 0;
+#endif
+	return anon_vma;
 }
 
 static inline void anon_vma_free(struct anon_vma *anon_vma)
 {
+	if (TRACK_ANON_VMA_COUNT)
+		VM_BUG_ON(anon_vma->count);
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
@@ -60,6 +86,39 @@ static inline void anon_vma_unlock(struc
 		write_unlock(&anon_vma->rwlock);
 }
 
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+
+/*
+ * Track number of "related" vmas on anon_vma list.
+ * Only called with anon_vma lock held.
+ * Note:  we track related vmas on fork() and splits, but
+ * only enforce the limit on fork().
+ */
+static inline void add_related_vma(struct anon_vma *anon_vma)
+{
+	++anon_vma->count;
+}
+
+static inline void remove_related_vma(struct anon_vma *anon_vma)
+{
+	--anon_vma->count;
+	VM_BUG_ON(anon_vma->count < 0);
+}
+
+static inline struct anon_vma *page_anon_vma(struct page *page)
+{
+	VM_BUG_ON(!PageAnon(page));
+	return (struct anon_vma *)((unsigned long)page->mapping &
+						~PAGE_MAPPING_ANON);
+}
+
+#else
+
+#define add_related_vma(A)
+#define remove_related_vma(A)
+
+#endif
+
 /*
  * anon_vma helper functions.
  */
Index: Linux/mm/rmap.c
===================================================================
--- Linux.orig/mm/rmap.c	2007-09-14 10:22:02.000000000 -0400
+++ Linux/mm/rmap.c	2007-09-14 10:23:52.000000000 -0400
@@ -82,6 +82,7 @@ int anon_vma_prepare(struct vm_area_stru
 		if (likely(!vma->anon_vma)) {
 			vma->anon_vma = anon_vma;
 			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+			add_related_vma(anon_vma);
 			allocated = NULL;
 		}
 		spin_unlock(&mm->page_table_lock);
@@ -96,16 +97,21 @@ int anon_vma_prepare(struct vm_area_stru
 
 void __anon_vma_merge(struct vm_area_struct *vma, struct vm_area_struct *next)
 {
-	BUG_ON(vma->anon_vma != next->anon_vma);
+	struct anon_vma *anon_vma = vma->anon_vma;
+
+	BUG_ON(anon_vma != next->anon_vma);
 	list_del(&next->anon_vma_node);
+	remove_related_vma(anon_vma);
 }
 
 void __anon_vma_link(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 
-	if (anon_vma)
+	if (anon_vma) {
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+		add_related_vma(anon_vma);
+	}
 }
 
 void anon_vma_link(struct vm_area_struct *vma)
@@ -115,6 +121,7 @@ void anon_vma_link(struct vm_area_struct
 	if (anon_vma) {
 		write_lock(&anon_vma->rwlock);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+		add_related_vma(anon_vma);
 		write_unlock(&anon_vma->rwlock);
 	}
 }
@@ -129,6 +136,7 @@ void anon_vma_unlink(struct vm_area_stru
 
 	write_lock(&anon_vma->rwlock);
 	list_del(&vma->anon_vma_node);
+	remove_related_vma(anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
 	empty = list_empty(&anon_vma->head);
Index: Linux/include/linux/swap.h
===================================================================
--- Linux.orig/include/linux/swap.h	2007-09-14 10:22:02.000000000 -0400
+++ Linux/include/linux/swap.h	2007-09-14 10:23:52.000000000 -0400
@@ -227,6 +227,9 @@ static inline int zone_reclaim(struct zo
 #ifdef CONFIG_NORECLAIM
 extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
 extern void putback_all_noreclaim_pages(void);
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+extern int anon_vma_reclaim_limit;
+#endif
 #else
 static inline int page_reclaimable(struct page *page,
 						struct vm_area_struct *vma)
Index: Linux/mm/vmscan.c
===================================================================
--- Linux.orig/mm/vmscan.c	2007-09-14 10:23:50.000000000 -0400
+++ Linux/mm/vmscan.c	2007-09-14 10:23:52.000000000 -0400
@@ -2154,6 +2154,10 @@ int zone_reclaim(struct zone *zone, gfp_
 #endif
 
 #ifdef CONFIG_NORECLAIM
+
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+int anon_vma_reclaim_limit = DEFAULT_ANON_VMA_RECLAIM_LIMIT;
+#endif
 /*
  * page_reclaimable(struct page *page, struct vm_area_struct *vma)
  * Test whether page is reclaimable--i.e., should be placed on active/inactive
@@ -2164,8 +2168,9 @@ int zone_reclaim(struct zone *zone, gfp_
  *               If !NULL, called from fault path.
  *
  * Reasons page might not be reclaimable:
- * + page's mapping marked non-reclaimable
- * TODO - later patches
+ * 1) page's mapping marked non-reclaimable
+ * 2) anon_vma [if any] has too many related vmas
+ * [more TBD.  e.g., anon page and no swap available, page mlocked, ...]
  *
  * TODO:  specify locking assumptions
  */
@@ -2177,6 +2182,20 @@ int page_reclaimable(struct page *page, 
 	if (mapping_non_reclaimable(page_mapping(page)))
 		return 0;
 
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+	if (PageAnon(page)) {
+		struct anon_vma *anon_vma;
+
+		/*
+		 * anon page with too many related vmas?
+		 */
+		anon_vma = page_anon_vma(page);
+		VM_BUG_ON(!anon_vma);
+		if (anon_vma_reclaim_limit &&
+			anon_vma->count > anon_vma_reclaim_limit)
+			return 0;
+	}
+#endif
 	/* TODO:  test page [!]reclaimable conditions */
 
 	return 1;
Index: Linux/kernel/sysctl.c
===================================================================
--- Linux.orig/kernel/sysctl.c	2007-09-14 10:22:02.000000000 -0400
+++ Linux/kernel/sysctl.c	2007-09-14 10:23:52.000000000 -0400
@@ -1060,6 +1060,18 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 	},
 #endif
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "anon_vma_reclaim_limit",
+		.data		= &anon_vma_reclaim_limit,
+		.maxlen		= sizeof(anon_vma_reclaim_limit),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
+#endif
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 10/14] Reclaim Scalability:  track anon_vma "related vmas"
  2007-09-14 20:55 ` [PATCH/RFC 10/14] Reclaim Scalability: track anon_vma "related vmas" Lee Schermerhorn
@ 2007-09-17  2:52   ` Rik van Riel
  2007-09-17 15:52     ` Lee Schermerhorn
  0 siblings, 1 reply; 77+ messages in thread
From: Rik van Riel @ 2007-09-17  2:52 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

Lee Schermerhorn wrote:
> PATCH/RFC 10/14 Reclaim Scalability:  track anon_vma "related vmas"
> 
> Against:  2.6.23-rc4-mm1
> 
> When a single parent forks a large number [thousands, 10s of thousands]
> of children, the anon_vma list of related vmas becomes very long.  In
> reclaim, this list must be traversed twice--once in page_referenced_anon()
> and once in try_to_unmap_anon()--under a spin lock to reclaim the page.
> Multiple cpus can end up spinning behind the same anon_vma spinlock and
> traversing the lists.  This patch, part of the "noreclaim" series, treats
> anon pages with list lengths longer than a tunable threshold as non-
> reclaimable.

I do not agree with this approach and think it is somewhat
dangerous.

If the threshold is set too high, this code has no effect.

If the threshold is too low, or an unexpectedly high number
of processes get forked (hey, now we *need* to swap something
out), the system goes out of memory.

I would rather we reduce the amount of work we need to do in
selecting what to page out in a different way, eg. by doing
SEQ replacement for anonymous pages.

I will cook up a patch implementing that other approach in a
way that it will fit into your patch series, since the rest
of the series (so far) looks good to me.

*takes out the chainsaw to cut up his patch*

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 10/14] Reclaim Scalability:  track anon_vma "related vmas"
  2007-09-17  2:52   ` Rik van Riel
@ 2007-09-17 15:52     ` Lee Schermerhorn
  0 siblings, 0 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-17 15:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, akpm, mel, clameter, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Sun, 2007-09-16 at 22:52 -0400, Rik van Riel wrote:
> Lee Schermerhorn wrote:
> > PATCH/RFC 10/14 Reclaim Scalability:  track anon_vma "related vmas"
> > 
> > Against:  2.6.23-rc4-mm1
> > 
> > When a single parent forks a large number [thousands, 10s of thousands]
> > of children, the anon_vma list of related vmas becomes very long.  In
> > reclaim, this list must be traversed twice--once in page_referenced_anon()
> > and once in try_to_unmap_anon()--under a spin lock to reclaim the page.
> > Multiple cpus can end up spinning behind the same anon_vma spinlock and
> > traversing the lists.  This patch, part of the "noreclaim" series, treats
> > anon pages with list lengths longer than a tunable threshold as non-
> > reclaimable.
> 
> I do not agree with this approach and think it is somewhat
> dangerous.
> 
> If the threshold is set too high, this code has no effect.
> 
> If the threshold is too low, or an unexpectedly high number
> of processes get forked (hey, now we *need* to swap something
> out), the system goes out of memory.
> 
> I would rather we reduce the amount of work we need to do in
> selecting what to page out in a different way, eg. by doing
> SEQ replacement for anonymous pages.
> 
> I will cook up a patch implementing that other approach in a
> way that it will fit into your patch series, since the rest
> of the series (so far) looks good to me.
> 
> *takes out the chainsaw to cut up his patch*
> 

I do understand your revulsion to this patch.  In our testing [AIM7], it
behaves exactly as you say--instead of spinning trying to unmap the anon
pages whose anon_vma lists are "excessive"--the system starts killing
off tasks.  It would be nice to have a better way to handle these.

While you're thinking about it, a couple of things to consider:

1) I think we don't want vmscan to spend a lot of time trying to reclaim
these pages when/if there are other, more easily reclaimable pages on
the lists.  That is sort of my rationale for stuffing them on the
noreclaim list.  I think any approach should stick these pages aside
somewhere--maybe just back on the end of the list, but that's behavior
I'm trying to elimate/reduce--and only attempt to reclaim them as a last
resort.

2) If the system gets into enough trouble that these are the only
reclaimable pages, I think we're pretty close to totally hosed anyway.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH/RFC 11/14] Reclaim Scalability: swap backed pages are nonreclaimable when no swap space available
  2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
                   ` (9 preceding siblings ...)
  2007-09-14 20:55 ` [PATCH/RFC 10/14] Reclaim Scalability: track anon_vma "related vmas" Lee Schermerhorn
@ 2007-09-14 20:55 ` Lee Schermerhorn
  2007-09-17  2:53   ` Rik van Riel
  2007-09-18  2:59   ` KAMEZAWA Hiroyuki
  2007-09-14 20:55 ` [PATCH/RFC 12/14] Reclaim Scalability: Non-reclaimable Mlock'ed pages Lee Schermerhorn
                   ` (4 subsequent siblings)
  15 siblings, 2 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-14 20:55 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

PATCH/RFC  11/14 Reclaim Scalability: treat swap backed pages as
	non-reclaimable when no swap space is available.

Against:  2.6.23-rc4-mm1

Move swap backed pages [anon, shmem/tmpfs] to noreclaim list when
nr_swap_pages goes to zero.   Use Rik van Riel's page_anon() 
function in page_reclaimable() to detect swap backed pages.

Depends on NORECLAIM_NO_SWAP Kconfig sub-option of NORECLAIM

TODO:   Splice zones' noreclaim list when "sufficient" swap becomes
available--either by being freed by other pages or by additional 
swap being added.  How much is "sufficient" swap?  Don't want to
splice huge noreclaim lists every time a swap page gets freed.

Might want to track per zone "unswappable" pages as a separate
statistic to make intelligent decisions here.  That will complicate
page_reclaimable() and non-reclaimable page culling in vmscan.  E.g.,
where to keep "reason" while page is on the "hold" list?  Not
necessary if we don't cull in shrink_active_list(), but then we get
to visit non-reclaimable pages more often.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/Kconfig  |   11 +++++++++++
 mm/vmscan.c |    9 +++++++--
 2 files changed, 18 insertions(+), 2 deletions(-)

Index: Linux/mm/Kconfig
===================================================================
--- Linux.orig/mm/Kconfig	2007-09-14 10:23:52.000000000 -0400
+++ Linux/mm/Kconfig	2007-09-14 10:23:53.000000000 -0400
@@ -215,3 +215,14 @@ config NORECLAIM_ANON_VMA
 	  anonymous pages in such tasks are very expensive [sometimes almost
 	  impossible] to reclaim.  Treating them as non-reclaimable avoids
 	  the overhead of attempting to reclaim them.
+
+config NORECLAIM_NO_SWAP
+	bool "Exclude anon/shmem pages when no swap space available"
+	depends on NORECLAIM
+	help
+	  Treats swap backed pages [anonymous, shmem, tmpfs] as non-reclaimable
+	  when no swap space exists.  Removing these pages from the LRU lists
+	  avoids the overhead of attempting to reclaim them.  Pages marked
+	  non-reclaimable for this reason will become reclaimable again when/if
+	  sufficient swap space is added to the system.
+
Index: Linux/mm/vmscan.c
===================================================================
--- Linux.orig/mm/vmscan.c	2007-09-14 10:23:52.000000000 -0400
+++ Linux/mm/vmscan.c	2007-09-14 10:23:53.000000000 -0400
@@ -2169,8 +2169,9 @@ int anon_vma_reclaim_limit = DEFAULT_ANO
  *
  * Reasons page might not be reclaimable:
  * 1) page's mapping marked non-reclaimable
- * 2) anon_vma [if any] has too many related vmas
- * [more TBD.  e.g., anon page and no swap available, page mlocked, ...]
+ * 2) anon/shmem/tmpfs page, but no swap space avail
+ * 3) anon_vma [if any] has too many related vmas
+ * [more TBD.  e.g., page mlocked, ...]
  *
  * TODO:  specify locking assumptions
  */
@@ -2182,6 +2183,10 @@ int page_reclaimable(struct page *page, 
 	if (mapping_non_reclaimable(page_mapping(page)))
 		return 0;
 
+#ifdef CONFIG_NORECLAIM_NO_SWAP
+	if (page_anon(page) && !PageSwapCache(page) && !nr_swap_pages)
+		return 0;
+#endif
 #ifdef CONFIG_NORECLAIM_ANON_VMA
 	if (PageAnon(page)) {
 		struct anon_vma *anon_vma;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 11/14] Reclaim Scalability: swap backed pages are nonreclaimable when no swap space available
  2007-09-14 20:55 ` [PATCH/RFC 11/14] Reclaim Scalability: swap backed pages are nonreclaimable when no swap space available Lee Schermerhorn
@ 2007-09-17  2:53   ` Rik van Riel
  2007-09-18 17:46     ` Lee Schermerhorn
  2007-09-18  2:59   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 77+ messages in thread
From: Rik van Riel @ 2007-09-17  2:53 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

Lee Schermerhorn wrote:
> PATCH/RFC  11/14 Reclaim Scalability: treat swap backed pages as
> 	non-reclaimable when no swap space is available.
> 
> Against:  2.6.23-rc4-mm1
> 
> Move swap backed pages [anon, shmem/tmpfs] to noreclaim list when
> nr_swap_pages goes to zero.   Use Rik van Riel's page_anon() 
> function in page_reclaimable() to detect swap backed pages.
> 
> Depends on NORECLAIM_NO_SWAP Kconfig sub-option of NORECLAIM
> 
> TODO:   Splice zones' noreclaim list when "sufficient" swap becomes
> available--either by being freed by other pages or by additional 
> swap being added.  How much is "sufficient" swap?  Don't want to
> splice huge noreclaim lists every time a swap page gets freed.

Yet another reason for my LRU list split between filesystem
backed and swap backed pages: we can simply stop scanning the
anon lists when swap space is full and resume scanning when
swap space becomes available.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 11/14] Reclaim Scalability: swap backed pages are nonreclaimable when no swap space available
  2007-09-17  2:53   ` Rik van Riel
@ 2007-09-18 17:46     ` Lee Schermerhorn
  2007-09-18 20:01       ` Rik van Riel
  0 siblings, 1 reply; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-18 17:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, akpm, mel, clameter, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Sun, 2007-09-16 at 22:53 -0400, Rik van Riel wrote:
> Lee Schermerhorn wrote:
> > PATCH/RFC  11/14 Reclaim Scalability: treat swap backed pages as
> > 	non-reclaimable when no swap space is available.
> > 
> > Against:  2.6.23-rc4-mm1
> > 
> > Move swap backed pages [anon, shmem/tmpfs] to noreclaim list when
> > nr_swap_pages goes to zero.   Use Rik van Riel's page_anon() 
> > function in page_reclaimable() to detect swap backed pages.
> > 
> > Depends on NORECLAIM_NO_SWAP Kconfig sub-option of NORECLAIM
> > 
> > TODO:   Splice zones' noreclaim list when "sufficient" swap becomes
> > available--either by being freed by other pages or by additional 
> > swap being added.  How much is "sufficient" swap?  Don't want to
> > splice huge noreclaim lists every time a swap page gets freed.
> 
> Yet another reason for my LRU list split between filesystem
> backed and swap backed pages: we can simply stop scanning the
> anon lists when swap space is full and resume scanning when
> swap space becomes available.


Hi, Rik:

It occurs to me that we probably don't want to stop scanning the anon
lists [active/inactive] when swap space is full.  We might have LOTS of
anon pages that already have swap space allocated to them that can be
reclaimed.  It's just those that don't already have swap space that
aren't reclaimable until more swap space becomes available.

Or did I misunderstand you?

Hmmm,  swap in [do_swap_page()] will free swap if it's > 1/2 full, so
that would reenable scanning the anon lists.  Still, wouldn't it be
possible to have resident PageSwapCache() pages that we'd want to find
and reclaim under memory pressure, even when !nr_swap_pages?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 11/14] Reclaim Scalability: swap backed pages are nonreclaimable when no swap space available
  2007-09-18 17:46     ` Lee Schermerhorn
@ 2007-09-18 20:01       ` Rik van Riel
  2007-09-19 14:55         ` Lee Schermerhorn
  0 siblings, 1 reply; 77+ messages in thread
From: Rik van Riel @ 2007-09-18 20:01 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

Lee Schermerhorn wrote:
> On Sun, 2007-09-16 at 22:53 -0400, Rik van Riel wrote:
>> Lee Schermerhorn wrote:
>>> PATCH/RFC  11/14 Reclaim Scalability: treat swap backed pages as
>>> 	non-reclaimable when no swap space is available.
>>>
>>> Against:  2.6.23-rc4-mm1
>>>
>>> Move swap backed pages [anon, shmem/tmpfs] to noreclaim list when
>>> nr_swap_pages goes to zero.   Use Rik van Riel's page_anon() 
>>> function in page_reclaimable() to detect swap backed pages.
>>>
>>> Depends on NORECLAIM_NO_SWAP Kconfig sub-option of NORECLAIM
>>>
>>> TODO:   Splice zones' noreclaim list when "sufficient" swap becomes
>>> available--either by being freed by other pages or by additional 
>>> swap being added.  How much is "sufficient" swap?  Don't want to
>>> splice huge noreclaim lists every time a swap page gets freed.
>> Yet another reason for my LRU list split between filesystem
>> backed and swap backed pages: we can simply stop scanning the
>> anon lists when swap space is full and resume scanning when
>> swap space becomes available.
> 
> 
> Hi, Rik:
> 
> It occurs to me that we probably don't want to stop scanning the anon
> lists [active/inactive] when swap space is full.  We might have LOTS of
> anon pages that already have swap space allocated to them that can be
> reclaimed.  It's just those that don't already have swap space that
> aren't reclaimable until more swap space becomes available.

Well, "lots" is a relative thing.

If we run into those pages in our normal course of scanning,
we should free the swap space.

If swap space finally ran out, I suspect we should just give
up.

If you have a system with 128GB RAM and 2GB swap, it really
does not make a lot of sense to scan all the way through 90GB
of anonymous pages to free maybe 1GB of swap...

If swap is large, we can free swap space during the normal
LRU scanning, before we completely run out.

-- 
All Rights Reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 11/14] Reclaim Scalability: swap backed pages are nonreclaimable when no swap space available
  2007-09-18 20:01       ` Rik van Riel
@ 2007-09-19 14:55         ` Lee Schermerhorn
  0 siblings, 0 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-19 14:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, akpm, mel, clameter, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Tue, 2007-09-18 at 16:01 -0400, Rik van Riel wrote:
> Lee Schermerhorn wrote:
> > On Sun, 2007-09-16 at 22:53 -0400, Rik van Riel wrote:
> >> Lee Schermerhorn wrote:
> >>> PATCH/RFC  11/14 Reclaim Scalability: treat swap backed pages as
> >>> 	non-reclaimable when no swap space is available.
> >>>
> >>> Against:  2.6.23-rc4-mm1
> >>>
> >>> Move swap backed pages [anon, shmem/tmpfs] to noreclaim list when
> >>> nr_swap_pages goes to zero.   Use Rik van Riel's page_anon() 
> >>> function in page_reclaimable() to detect swap backed pages.
> >>>
> >>> Depends on NORECLAIM_NO_SWAP Kconfig sub-option of NORECLAIM
> >>>
> >>> TODO:   Splice zones' noreclaim list when "sufficient" swap becomes
> >>> available--either by being freed by other pages or by additional 
> >>> swap being added.  How much is "sufficient" swap?  Don't want to
> >>> splice huge noreclaim lists every time a swap page gets freed.
> >> Yet another reason for my LRU list split between filesystem
> >> backed and swap backed pages: we can simply stop scanning the
> >> anon lists when swap space is full and resume scanning when
> >> swap space becomes available.
> > 
> > 
> > Hi, Rik:
> > 
> > It occurs to me that we probably don't want to stop scanning the anon
> > lists [active/inactive] when swap space is full.  We might have LOTS of
> > anon pages that already have swap space allocated to them that can be
> > reclaimed.  It's just those that don't already have swap space that
> > aren't reclaimable until more swap space becomes available.
> 
> Well, "lots" is a relative thing.

Agreed.  See below.

> 
> If we run into those pages in our normal course of scanning,
> we should free the swap space.
> 
> If swap space finally ran out, I suspect we should just give
> up.
> 
> If you have a system with 128GB RAM and 2GB swap, it really
> does not make a lot of sense to scan all the way through 90GB
> of anonymous pages to free maybe 1GB of swap...

I agree.  However:

1) that's the reason I'm putting swap-backed pages that are in excess of
available swap space on a noreclaim list.  So that only reclaimable
pages end up on the [anon] lru list.

2) consider the case of 128GB RAM and 64GB swap:  that's plenty of swap
space to make scanning of anon pages worthwhile.  But, if we can avoid
scanning the other 26GB [your "90GB" of anon less the 64GB of swappable
anon] in the process, scanning will be more efficient, I think.  Theory
needs testing, of course.

> 
> If swap is large, we can free swap space during the normal
> LRU scanning, before we completely run out.

If this works--we keep sufficient swap free during scanning--we'll never
declare anon/shmem/tmpfs pages non-reclaimable due to lack of swap
space.  If it doesn't we can still move the non-reclaimable ones
aside--if that's a performance win overall.  This depends on how
efficiently we can bring "unswappable" pages back from noreclaim-land.

Later,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 11/14] Reclaim Scalability: swap backed pages are nonreclaimable when no swap space available
  2007-09-14 20:55 ` [PATCH/RFC 11/14] Reclaim Scalability: swap backed pages are nonreclaimable when no swap space available Lee Schermerhorn
  2007-09-17  2:53   ` Rik van Riel
@ 2007-09-18  2:59   ` KAMEZAWA Hiroyuki
  2007-09-18 15:47     ` Lee Schermerhorn
  1 sibling, 1 reply; 77+ messages in thread
From: KAMEZAWA Hiroyuki @ 2007-09-18  2:59 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Fri, 14 Sep 2007 16:55:12 -0400
Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:

> +#ifdef CONFIG_NORECLAIM_NO_SWAP
> +	if (page_anon(page) && !PageSwapCache(page) && !nr_swap_pages)
> +		return 0;
> +#endif

nr_swap_pages depends on CONFIG_SWAP. 

So I recommend you to use total_swap_pages. (if CONFIG_SWAP=n, total_swap_pages is
compield to be 0.)

==
if (!total_swap_pages && page_anon(page)) 
	return 0;
==
By the way, nr_swap_pages is "# of currently available swap pages".
Should this page will be put into NORECLAIM list ? This number can be
changed to be > 0 easily.

Cheers,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 11/14] Reclaim Scalability: swap backed pages are nonreclaimable when no swap space available
  2007-09-18  2:59   ` KAMEZAWA Hiroyuki
@ 2007-09-18 15:47     ` Lee Schermerhorn
  0 siblings, 0 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-18 15:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

On Tue, 2007-09-18 at 11:59 +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 14 Sep 2007 16:55:12 -0400
> Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:
> 
> > +#ifdef CONFIG_NORECLAIM_NO_SWAP
> > +	if (page_anon(page) && !PageSwapCache(page) && !nr_swap_pages)
> > +		return 0;
> > +#endif
> 
> nr_swap_pages depends on CONFIG_SWAP. 

I didn't think that was the case [see definition in page_alloc.c and
extern declaration in swap.h].  If this is the case, I'll have to change
that.  

> 
> So I recommend you to use total_swap_pages. (if CONFIG_SWAP=n, total_swap_pages is
> compield to be 0.)

total_swap_pages is not appropriate in this context.  total_swap_pages
can be non-zero, but we can have MANY more swap-backed pages than we
have room for.  So, we want to declare any such pages as non-reclaimable
once the existing swap space has been exhausted.  That's the theory,
anyway.

> 
> ==
> if (!total_swap_pages && page_anon(page)) 
> 	return 0;
> ==
> By the way, nr_swap_pages is "# of currently available swap pages".
> Should this page will be put into NORECLAIM list ? This number can be
> changed to be > 0 easily.

Right.  This means we need to come up with a way to bring pages back
from the noreclaim list when swap becomes available.  This is currently
and unsolved problem--the noreclaim series IS a work in progress :-).  

Now, Rik vR has a patch that I've kept around in another tree, that
frees swap space when pages are swapped in, if we're under "swap
pressure"  [swap space > 1/2 full].  We might want to add this patch
into the mix and, perhaps, keep the various types of non-reclaimable
pages on different lists--e.g., in this case, the unswappable list.
Then, if the list is non-empty when we free a page of swap space, we can
bring back one page from the "unswappable" list.  Thus, we'd rotate
pages through the unswappable noreclaim state so as to not penalize
late-comers after swap space has all been allocated.

Again, I have got there yet, and am open to suggestions, patches, ...

Thanks,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH/RFC 12/14] Reclaim Scalability:  Non-reclaimable Mlock'ed pages
  2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
                   ` (10 preceding siblings ...)
  2007-09-14 20:55 ` [PATCH/RFC 11/14] Reclaim Scalability: swap backed pages are nonreclaimable when no swap space available Lee Schermerhorn
@ 2007-09-14 20:55 ` Lee Schermerhorn
  2007-09-14 20:55 ` [PATCH/RFC 13/14] Reclaim Scalability: Handle Mlock'ed pages during map/unmap and truncate Lee Schermerhorn
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-14 20:55 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

PATCH/RFC 12/14 Reclaim Scalability:  Non-reclaimable Mlock'ed pages

Against:  2.6.23-rc4-mm1

Rework of a patch by Nick Piggin -- part 1 of 2.

This patch:

1) defines the [CONFIG_]NORECLAIM_MLOCK sub-option and the
   stub version of the mlock/noreclaim APIs when it's
   not configured.  Depends on [CONFIG_]NORECLAIM.

2) add yet another page flag--PG_mlocked--to indicate that
   the page is locked for efficient testing in vmscan and,
   optionally, fault path.  This allows early culling of
   nonreclaimable pages, preventing them from getting to
   page_referenced()/try_to_unmap().

   Uses a bit available only to 64-bit systems.

3) add the mlock/noreclaim infrastructure to mm/mlock.c,
   with internal APIs in mm/internal.h.  This is a rework
   of Nick's original patch to these files, taking into
   account that mlocked pages are now kept on noreclaim
   LRU list.

4) update vmscan.c:page_reclaimable() to check PageMlocked()
   and, if vma passed in, the vm_flags.  Note that the vma
   will only be passed in for new pages in the fault path;
   and then only if the "cull nonreclaimable pages in fault
   path" patch is included.

5) add try_to_unlock() to rmap.c to walk a page's rmap and
   ClearPageMlocked() if no other vmas have it mlocked.  
   Reuses as much of try_to_unmap() as possible.  This
   effectively replaces the use of one of the lru list links
   as an mlock count.  If this mechanism let's pages in mlocked
   vmas leak through w/o PG_mlocked set [I don't know that it
   does], we should catch them later in try_to_unmap().  One
   hopes this will be rare, as it will be relatively expensive.

Note: most of my recent testing of the noreclaim infrastructure
has been with mlocked pages.  I'm seeing gigabytes of memory
left nonreclaimable, according to my vmstats, when the tests
finish.  I don't know if this is just a statistics glitch, or
if I'm leaking mlocked pages.  Under investigation.

mm/internal.h and mm/mlock.c changes:
Originally Signed-off-by: Nick Piggin <npiggin@suse.de>

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/page-flags.h |   21 +++-
 include/linux/rmap.h       |   12 ++
 mm/Kconfig                 |   13 ++
 mm/internal.h              |   50 +++++++++
 mm/migrate.c               |    2 
 mm/mlock.c                 |  227 ++++++++++++++++++++++++++++++++++++++++++---
 mm/rmap.c                  |  167 ++++++++++++++++++++++++++++-----
 mm/vmscan.c                |    9 +
 8 files changed, 459 insertions(+), 42 deletions(-)

Index: Linux/mm/Kconfig
===================================================================
--- Linux.orig/mm/Kconfig	2007-09-14 10:23:53.000000000 -0400
+++ Linux/mm/Kconfig	2007-09-14 10:23:55.000000000 -0400
@@ -226,3 +226,16 @@ config NORECLAIM_NO_SWAP
 	  non-reclaimable for this reason will become reclaimable again when/if
 	  sufficient swap space is added to the system.
 
+config NORECLAIM_MLOCK
+	bool "Exclude mlock'ed pages from reclaim"
+	depends on NORECLAIM
+	help
+	  Treats mlock'ed pages as no-reclaimable.  Removing these pages from
+	  the LRU [in]active lists avoids the overhead of attempting to reclaim
+	  them.  Pages marked non-reclaimable for this reason will become
+	  reclaimable again when the last mlock is removed.
+	  when no swap space exists.  Removing these pages from the LRU lists
+	  avoids the overhead of attempting to reclaim them.  Pages marked
+	  non-reclaimable for this reason will become reclaimable again when/if
+	  sufficient swap space is added to the system.
+
Index: Linux/mm/internal.h
===================================================================
--- Linux.orig/mm/internal.h	2007-09-14 10:17:54.000000000 -0400
+++ Linux/mm/internal.h	2007-09-14 10:23:55.000000000 -0400
@@ -36,6 +36,56 @@ static inline void __put_page(struct pag
 
 extern int isolate_lru_page(struct page *page);
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+/*
+ * called only for new pages in fault path
+ */
+extern int is_mlocked_vma(struct vm_area_struct *, struct page *);
+
+/*
+ * must be called with vma's mmap_sem held for read, and page locked.
+ */
+extern void mlock_vma_page(struct page *page);
+
+extern int __mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end, int lock);
+
+/*
+ * mlock all pages in this vma range.  For mmap()/mremap()/...
+ */
+static inline void mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+	__mlock_vma_pages_range(vma, start, end, 1);
+}
+
+/*
+ * munlock range of pages.   For munmap() and exit().
+ */
+static inline void munlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+	vma->vm_flags &= ~VM_LOCKED;	/* try_to_unlock() needs this */
+	__mlock_vma_pages_range(vma, start, end, 0);
+}
+
+extern void clear_page_mlock(struct page *page);
+
+#else /* CONFIG_NORECLAIM_MLOCK */
+static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
+{
+	return 0;
+}
+static inline void clear_page_mlock(struct page *page) { }
+static inline void mlock_vma_page(struct page *page) { }
+static inline void mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end) { }
+static inline void munlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end) { }
+
+#endif /* CONFIG_NORECLAIM_MLOCK */
+
+
 extern void fastcall __init __free_pages_bootmem(struct page *page,
 						unsigned int order);
 
Index: Linux/mm/mlock.c
===================================================================
--- Linux.orig/mm/mlock.c	2007-09-14 10:17:54.000000000 -0400
+++ Linux/mm/mlock.c	2007-09-14 10:23:55.000000000 -0400
@@ -8,10 +8,16 @@
 #include <linux/capability.h>
 #include <linux/mman.h>
 #include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/pagemap.h>
 #include <linux/mempolicy.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
 #include <linux/module.h>
+#include <linux/rmap.h>
+#include <linux/mmzone.h>
+
+#include "internal.h"
 
 int can_do_mlock(void)
 {
@@ -23,19 +29,213 @@ int can_do_mlock(void)
 }
 EXPORT_SYMBOL(can_do_mlock);
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+/*
+ * Mlocked pages are marked with PageMlocked() flag for efficient testing
+ * in vmscan and, possibly, the fault path.
+ *
+ * An mlocked page [PageMlocked(page)] is non-reclaimable.  As such, it will
+ * be placed on the LRU "noreclaim" list, rather than the [in]active lists.
+ * The noreclaim list is an LRU sibling list to the [in]active lists.
+ * PageNoreclaim is set to indicate the non-reclaimable state.
+ *
+//TODO:  no longer counting, but does this still apply to lazy setting
+// of PageMlocked() ??
+ * When lazy incrementing via vmscan, it is important to ensure that the
+ * vma's VM_LOCKED status is not concurrently being modified, otherwise we
+ * may have elevated mlock_count of a page that is being munlocked. So lazy
+ * mlocked must take the mmap_sem for read, and verify that the vma really
+ * is locked (see mm/rmap.c).
+ */
+
+/*
+ * add isolated page to appropriate LRU list, adjusting stats as needed.
+ * Page may still be non-reclaimable for other reasons.
+//TODO:  move to vmscan.c as global along with isolate_lru_page()?
+ */
+static void putback_lru_page(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	VM_BUG_ON(PageLRU(page));
+
+	ClearPageNoreclaim(page);
+	ClearPageActive(page);
+	lru_cache_add_active_or_noreclaim(page, NULL);
+}
+
+/*
+ * Clear the page's PageMlocked().  This can be useful in a situation where
+ * we want to unconditionally remove a page from the pagecache.
+ *
+ * It is legal to call this function for any page, mlocked or not.
+ * If called for a page that is still mapped by mlocked vmas, all we do
+ * is revert to lazy LRU behaviour -- semantics are not broken.
+ */
+void clear_page_mlock(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+
+	if (likely(!PageMlocked(page)))
+		return;
+	ClearPageMlocked(page);
+	if (!isolate_lru_page(page))
+		putback_lru_page(page);
+}
+
+/*
+ * Mark page as mlocked if not already.
+ * If page on LRU, isolate and putback to move to noreclaim list.
+ */
+void mlock_vma_page(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+
+	if (!TestSetPageMlocked(page) && !isolate_lru_page(page))
+			putback_lru_page(page);
+}
+
+/*
+ * called from munlock()/munmap() path.
+ * If page on LRU, isolate and putback to move from noreclaim list.
+ */
+static void munlock_vma_page(struct page *page)
+{
+	int ret;
+	BUG_ON(!PageLocked(page));
+
+	if (PageMlocked(page)) {
+		ret = try_to_unlock(page);	/* walks rmap */
+		if (ret != SWAP_MLOCK && !isolate_lru_page(page))
+				putback_lru_page(page);
+	}
+}
+
+/*
+ * Called in fault path via page_reclaimable() for a new page
+ * to determine if it's being mapped into a LOCKED vma.
+ * If so, mark page as mlocked.
+ * TODO:  do I really need to try to lock the page?  We have added
+ *        the new page to the rmap before calling page_reclaimable().
+ *        Could another task have found it?  If not, no need to
+ *        [try to] lock page here.
+ *        Also, we're just setting a page flag now.
+ */
+int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
+{
+	VM_BUG_ON(PageMlocked(page));	// TODO:  needed?
+	VM_BUG_ON(PageLRU(page));
+
+	if (likely(!(vma->vm_flags & VM_LOCKED)) || TestSetPageLocked(page))
+		return 0;
+
+	SetPageMlocked(page);
+	unlock_page(page);
+	return 1;
+}
+
+/*
+ * mlock or munlock a range of pages in the vma depending on whether
+ * @lock is 1 or 0, respectively.  @lock must match vm_flags VM_LOCKED
+ * state.
+TODO:   we don't really need @lock, as we can determine it from vm_flags
+ *
+ * This takes care of making the pages present too.
+ *
+ * vma->vm_mm->mmap_sem must be held for write.
+ */
+int __mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end, int lock)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long addr = start;
+	struct page *pages[16]; /* 16 gives a reasonable batch */
+	int write = !!(vma->vm_flags & VM_WRITE);
+	int nr_pages;
+	int ret = 0;
+
+	BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK);
+	VM_BUG_ON(lock != !!(vma->vm_flags & VM_LOCKED));
+
+	if (vma->vm_flags & VM_IO)
+		return ret;
+
+	nr_pages = (end - start) / PAGE_SIZE;
+
+	while (nr_pages > 0) {
+		int i;
+
+		cond_resched();
+
+		/*
+		 * get_user_pages makes pages present if we are
+		 * setting mlock.
+		 */
+		ret = get_user_pages(current, mm, addr,
+				min_t(int, nr_pages, ARRAY_SIZE(pages)),
+				write, 0, pages, NULL);
+		if (ret < 0)
+			break;
+		if (ret == 0) {
+			/*
+			 * We know the vma is there, so the only time
+			 * we cannot get a single page should be an
+			 * error (ret < 0) case.
+			 */
+			WARN_ON(1);
+			ret = -EFAULT;
+			break;
+		}
+
+		for (i = 0; i < ret; i++) {
+			struct page *page = pages[i];
+
+			lock_page(page);
+			if (lock)
+				mlock_vma_page(page);
+			else
+				munlock_vma_page(page);
+			unlock_page(page);
+			put_page(page);		/* ref from get_user_pages() */
+
+			addr += PAGE_SIZE;
+			nr_pages--;
+		}
+	}
+	return ret;
+}
+
+#else /* CONFIG_NORECLAIM_MLOCK */
+
+/*
+ * Just make pages present
+ */
+void __mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end, int lock)
+{
+	int ret = 0;
+
+	if (vma->vm_flags & VM_IO)
+		return ret;
+
+	return make_pages_present(start, end);
+}
+#endif /* CONFIG_NORECLAIM_MLOCK */
+
 static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	unsigned long start, unsigned long end, unsigned int newflags)
 {
-	struct mm_struct * mm = vma->vm_mm;
+	struct mm_struct *mm = vma->vm_mm;
 	pgoff_t pgoff;
-	int pages;
+	int nr_pages;
 	int ret = 0;
+	int lock;
 
 	if (newflags == vma->vm_flags) {
 		*prev = vma;
 		goto out;
 	}
 
+//TODO:  linear_page_index() ?   non-linear pages?
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
 			  vma->vm_file, pgoff, vma_policy(vma));
@@ -59,24 +259,25 @@ static int mlock_fixup(struct vm_area_st
 	}
 
 success:
+	lock = !!(newflags & VM_LOCKED);
+
+	/*
+	 * Keep track of amount of locked VM.
+	 */
+	nr_pages = (end - start) >> PAGE_SHIFT;
+	if (!lock)
+		nr_pages = -nr_pages;
+	mm->locked_vm += nr_pages;
+
 	/*
 	 * vm_flags is protected by the mmap_sem held in write mode.
 	 * It's okay if try_to_unmap_one unmaps a page just after we
-	 * set VM_LOCKED, make_pages_present below will bring it back.
+	 * set VM_LOCKED, __mlock_vma_pages_range will bring it back.
 	 */
 	vma->vm_flags = newflags;
 
-	/*
-	 * Keep track of amount of locked VM.
-	 */
-	pages = (end - start) >> PAGE_SHIFT;
-	if (newflags & VM_LOCKED) {
-		pages = -pages;
-		if (!(newflags & VM_IO))
-			ret = make_pages_present(start, end);
-	}
+	__mlock_vma_pages_range(vma, start, end, lock);
 
-	mm->locked_vm -= pages;
 out:
 	if (ret == -ENOMEM)
 		ret = -EAGAIN;
Index: Linux/mm/vmscan.c
===================================================================
--- Linux.orig/mm/vmscan.c	2007-09-14 10:23:53.000000000 -0400
+++ Linux/mm/vmscan.c	2007-09-14 10:23:55.000000000 -0400
@@ -2165,13 +2165,13 @@ int anon_vma_reclaim_limit = DEFAULT_ANO
  *
  * @page       - page to test
  * @vma        - vm area in which page is/will be mapped.  May be NULL.
- *               If !NULL, called from fault path.
+ *               If !NULL, called from fault path for a new page.
  *
  * Reasons page might not be reclaimable:
  * 1) page's mapping marked non-reclaimable
  * 2) anon/shmem/tmpfs page, but no swap space avail
  * 3) anon_vma [if any] has too many related vmas
- * [more TBD.  e.g., page mlocked, ...]
+ * 4) page is mlock'ed into memory.
  *
  * TODO:  specify locking assumptions
  */
@@ -2201,7 +2201,10 @@ int page_reclaimable(struct page *page, 
 			return 0;
 	}
 #endif
-	/* TODO:  test page [!]reclaimable conditions */
+#ifdef CONFIG_NORECLAIM_MLOCK
+	if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page)))
+		return 0;
+#endif
 
 	return 1;
 }
Index: Linux/include/linux/page-flags.h
===================================================================
--- Linux.orig/include/linux/page-flags.h	2007-09-14 10:21:48.000000000 -0400
+++ Linux/include/linux/page-flags.h	2007-09-14 10:23:55.000000000 -0400
@@ -110,6 +110,7 @@
 #define PG_uncached		31	/* Page has been mapped as uncached */
 
 #define PG_noreclaim		30	/* Page is "non-reclaimable"  */
+#define PG_mlocked		29	/* Page is vma mlocked */
 #endif
 
 /*
@@ -163,6 +164,8 @@ static inline void SetPageUptodate(struc
 #define SetPageActive(page)	set_bit(PG_active, &(page)->flags)
 #define ClearPageActive(page)	clear_bit(PG_active, &(page)->flags)
 #define __ClearPageActive(page)	__clear_bit(PG_active, &(page)->flags)
+#define TestSetPageActive(page) test_and_set_bit(PG_active, &(page)->flags)
+#define TestClearPageActive(page) test_and_clear_bit(PG_active, &(page)->flags)
 
 #define PageSlab(page)		test_bit(PG_slab, &(page)->flags)
 #define __SetPageSlab(page)	__set_bit(PG_slab, &(page)->flags)
@@ -269,8 +272,15 @@ static inline void __ClearPageTail(struc
 #define SetPageNoreclaim(page)	set_bit(PG_noreclaim, &(page)->flags)
 #define ClearPageNoreclaim(page) clear_bit(PG_noreclaim, &(page)->flags)
 #define __ClearPageNoreclaim(page) __clear_bit(PG_noreclaim, &(page)->flags)
-#define TestClearPageNoreclaim(page) test_and_clear_bit(PG_noreclaim, \
-							 &(page)->flags)
+#define TestClearPageNoreclaim(page) \
+				test_and_clear_bit(PG_noreclaim, &(page)->flags)
+#ifdef CONFIG_NORECLAIM_MLOCK
+#define PageMlocked(page)	test_bit(PG_mlocked, &(page)->flags)
+#define SetPageMlocked(page)	set_bit(PG_mlocked, &(page)->flags)
+#define ClearPageMlocked(page) clear_bit(PG_mlocked, &(page)->flags)
+#define __ClearPageMlocked(page) __clear_bit(PG_mlocked, &(page)->flags)
+#define TestSetPageMlocked(page) test_and_set_bit(PG_mlocked, &(page)->flags)
+#endif
 #else
 #define PageNoreclaim(page)	0
 #define SetPageNoreclaim(page)
@@ -278,6 +288,13 @@ static inline void __ClearPageTail(struc
 #define __ClearPageNoreclaim(page)
 #define TestClearPageNoreclaim(page) 0
 #endif
+#ifndef CONFIG_NORECLAIM_MLOCK
+#define PageMlock(page)	0
+#define SetPageMlock(page)
+#define ClearPageMlock(page)
+#define __ClearPageMlock(page)
+#define TestSetPageMlocked(page) 0
+#endif
 
 #define PageUncached(page)	test_bit(PG_uncached, &(page)->flags)
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
Index: Linux/include/linux/rmap.h
===================================================================
--- Linux.orig/include/linux/rmap.h	2007-09-14 10:23:52.000000000 -0400
+++ Linux/include/linux/rmap.h	2007-09-14 10:23:55.000000000 -0400
@@ -171,6 +171,17 @@ unsigned long page_address_in_vma(struct
  */
 int page_mkclean(struct page *);
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+/*
+ * called in munlock()/munmap() path to check for other vmas holding
+ * the page mlocked.
+ */
+int try_to_unlock(struct page *);
+#define TRY_TO_UNLOCK 1
+#else
+#define TRY_TO_UNLOCK 0		/* for compiler -- dead code elimination */
+#endif
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
@@ -194,5 +205,6 @@ static inline int page_mkclean(struct pa
 #define SWAP_SUCCESS	0
 #define SWAP_AGAIN	1
 #define SWAP_FAIL	2
+#define SWAP_MLOCK	3
 
 #endif	/* _LINUX_RMAP_H */
Index: Linux/mm/rmap.c
===================================================================
--- Linux.orig/mm/rmap.c	2007-09-14 10:23:52.000000000 -0400
+++ Linux/mm/rmap.c	2007-09-14 10:23:55.000000000 -0400
@@ -52,6 +52,8 @@
 
 #include <asm/tlbflush.h>
 
+#include "internal.h"
+
 struct kmem_cache *anon_vma_cachep;
 
 /* This must be called under the mmap_sem. */
@@ -292,6 +294,14 @@ static int page_referenced_one(struct pa
 	if (!pte)
 		goto out;
 
+	/*
+	 * Don't want to elevate referenced for mlocked page that gets this far,
+	 * in order that it progresses to try_to_unmap and is moved to the
+	 * noreclaim list.
+	 */
+	if (vma->vm_flags & VM_LOCKED)
+		goto out_unmap;
+
 	if (ptep_clear_flush_young(vma, address, pte))
 		referenced++;
 
@@ -301,6 +311,7 @@ static int page_referenced_one(struct pa
 			rwsem_is_locked(&mm->mmap_sem))
 		referenced++;
 
+out_unmap:
 	(*mapcount)--;
 	pte_unmap_unlock(pte, ptl);
 out:
@@ -389,11 +400,6 @@ static int page_referenced_file(struct p
 		 */
 		if (mem_cont && (mm_container(vma->vm_mm) != mem_cont))
 			continue;
-		if ((vma->vm_flags & (VM_LOCKED|VM_MAYSHARE))
-				  == (VM_LOCKED|VM_MAYSHARE)) {
-			referenced++;
-			break;
-		}
 		referenced += page_referenced_one(page, vma, &mapcount);
 		if (!mapcount)
 			break;
@@ -715,10 +721,15 @@ static int try_to_unmap_one(struct page 
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
 	 */
-	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young(vma, address, pte)))) {
-		ret = SWAP_FAIL;
-		goto out_unmap;
+	if (!migration) {
+		if (vma->vm_flags & VM_LOCKED) {
+			ret = SWAP_MLOCK;
+			goto out_unmap;
+		}
+		if (ptep_clear_flush_young(vma, address, pte)) {
+			ret = SWAP_FAIL;
+			goto out_unmap;
+		}
 	}
 
 	/* Nuke the page table entry. */
@@ -800,6 +811,10 @@ out:
  * For very sparsely populated VMAs this is a little inefficient - chances are
  * there there won't be many ptes located within the scan cluster.  In this case
  * maybe we could scan further - to the end of the pte page, perhaps.
+ *
+TODO:  still accurate with noreclaim infrastructure?
+ * Mlocked pages also aren't handled very well at the moment: they aren't
+ * moved off the LRU like they are for linear pages.
  */
 #define CLUSTER_SIZE	min(32*PAGE_SIZE, PMD_SIZE)
 #define CLUSTER_MASK	(~(CLUSTER_SIZE - 1))
@@ -871,10 +886,28 @@ static void try_to_unmap_cluster(unsigne
 	pte_unmap_unlock(pte - 1, ptl);
 }
 
-static int try_to_unmap_anon(struct page *page, int migration)
+/**
+ * try_to_unmap_anon - unmap or unlock anonymous page using the object-based
+ * rmap method
+ * @page: the page to unmap/unlock
+ * @unlock:  request for unlock rather than unmap [unlikely]
+ * @migration:  unmapping for migration - ignored if @unlock
+ *
+ * Find all the mappings of a page using the mapping pointer and the vma chains
+ * contained in the anon_vma struct it points to.
+ *
+ * This function is only called from try_to_unmap/try_to_unlock for
+ * anonymous pages.
+ * When called from try_to_unlock(), the mmap_sem of the mm containing the vma
+ * where the page was found will be held for write.  So, we won't recheck
+ * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
+ * 'LOCKED.
+ */
+static int try_to_unmap_anon(struct page *page, int unlock, int migration)
 {
 	struct anon_vma *anon_vma;
 	struct vm_area_struct *vma;
+	unsigned int mlocked = 0;
 	int ret = SWAP_AGAIN;
 
 	anon_vma = page_lock_anon_vma(page);
@@ -882,25 +915,53 @@ static int try_to_unmap_anon(struct page
 		return ret;
 
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
-		ret = try_to_unmap_one(page, vma, migration);
+		if (TRY_TO_UNLOCK && unlikely(unlock)) {
+			if (!(vma->vm_flags & VM_LOCKED))
+				continue;	/* must visit all vmas */
+			mlocked++;
+			break;			/* no need to look further */
+		} else
+			ret = try_to_unmap_one(page, vma, migration);
 		if (ret == SWAP_FAIL || !page_mapped(page))
 			break;
+		if (ret == SWAP_MLOCK) {
+			if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
+				if (vma->vm_flags & VM_LOCKED) {
+					mlock_vma_page(page);
+					mlocked++;
+				}
+				up_read(&vma->vm_mm->mmap_sem);
+			}
+		}
 	}
-
 	page_unlock_anon_vma(anon_vma);
+
+	if (mlocked)
+		ret = SWAP_MLOCK;
+	else if (ret == SWAP_MLOCK)
+		ret = SWAP_AGAIN;
+
 	return ret;
 }
 
 /**
- * try_to_unmap_file - unmap file page using the object-based rmap method
- * @page: the page to unmap
+ * try_to_unmap_file - unmap or unlock file page using the object-based
+ * rmap method
+ * @page: the page to unmap/unlock
+ * @unlock:  request for unlock rather than unmap [unlikely]
+ * @migration:  unmapping for migration - ignored if @unlock
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
  *
- * This function is only called from try_to_unmap for object-based pages.
+ * This function is only called from try_to_unmap/try_to_unlock for
+ * object-based pages.
+ * When called from try_to_unlock(), the mmap_sem of the mm containing the vma
+ * where the page was found will be held for write.  So, we won't recheck
+ * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
+ * 'LOCKED.
  */
-static int try_to_unmap_file(struct page *page, int migration)
+static int try_to_unmap_file(struct page *page, int unlock, int migration)
 {
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -911,20 +972,47 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_cursor = 0;
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
+	unsigned int mlocked = 0;
 
 	read_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
-		ret = try_to_unmap_one(page, vma, migration);
+		if (TRY_TO_UNLOCK && unlikely(unlock)) {
+			if (!(vma->vm_flags & VM_LOCKED))
+				continue;	/* must visit all vmas */
+			mlocked++;
+			break;			/* no need to look further */
+		} else
+			ret = try_to_unmap_one(page, vma, migration);
 		if (ret == SWAP_FAIL || !page_mapped(page))
 			goto out;
+		if (ret == SWAP_MLOCK) {
+			if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
+				if (vma->vm_flags & VM_LOCKED) {
+					mlock_vma_page(page);
+					mlocked++;
+				}
+				up_read(&vma->vm_mm->mmap_sem);
+			}
+			if (unlikely(unlock))
+				break;	/* stop on 1st mlocked vma */
+		}
 	}
 
+	if (mlocked)
+		goto out;
+
 	if (list_empty(&mapping->i_mmap_nonlinear))
 		goto out;
 
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
 						shared.vm_set.list) {
-		if ((vma->vm_flags & VM_LOCKED) && !migration)
+		if (TRY_TO_UNLOCK && unlikely(unlock)) {
+			if (!(vma->vm_flags & VM_LOCKED))
+				continue;	/* must visit all vmas */
+			mlocked++;
+			goto out;		/* no need to look further */
+		}
+		if (!migration && (vma->vm_flags & VM_LOCKED))
 			continue;
 		cursor = (unsigned long) vma->vm_private_data;
 		if (cursor > max_nl_cursor)
@@ -958,8 +1046,6 @@ static int try_to_unmap_file(struct page
 	do {
 		list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
 						shared.vm_set.list) {
-			if ((vma->vm_flags & VM_LOCKED) && !migration)
-				continue;
 			cursor = (unsigned long) vma->vm_private_data;
 			while ( cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
@@ -984,6 +1070,10 @@ static int try_to_unmap_file(struct page
 		vma->vm_private_data = NULL;
 out:
 	read_unlock(&mapping->i_mmap_lock);
+	if (mlocked)
+		ret = SWAP_MLOCK;
+	else if (ret == SWAP_MLOCK)
+		ret = SWAP_AGAIN;
 	return ret;
 }
 
@@ -998,6 +1088,7 @@ out:
  * SWAP_SUCCESS	- we succeeded in removing all mappings
  * SWAP_AGAIN	- we missed a mapping, try again later
  * SWAP_FAIL	- the page is unswappable
+ * SWAP_MLOCK	- page is mlocked.
  */
 int try_to_unmap(struct page *page, int migration)
 {
@@ -1006,12 +1097,40 @@ int try_to_unmap(struct page *page, int 
 	BUG_ON(!PageLocked(page));
 
 	if (PageAnon(page))
-		ret = try_to_unmap_anon(page, migration);
+		ret = try_to_unmap_anon(page, 0, migration);
 	else
-		ret = try_to_unmap_file(page, migration);
-
-	if (!page_mapped(page))
+		ret = try_to_unmap_file(page, 0, migration);
+	if (ret != SWAP_MLOCK && !page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;
 }
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+/**
+ * try_to_unlock - Check page's rmap for other vma's holding page locked.
+ * @page: the page to be unlocked.   will be returned with PG_mlocked
+ * cleared if no vmas are VM_LOCKED.
+ *
+ * Return values are:
+ *
+ * SWAP_SUCCESS	- no vma's holding page locked.
+ * SWAP_MLOCK	- page is mlocked.
+ */
+int try_to_unlock(struct page *page)
+{
+	int ret;
+
+	BUG_ON(!PageLocked(page));
+
+	if (PageAnon(page))
+		ret = try_to_unmap_anon(page, 1, 0);
+	else
+		ret = try_to_unmap_file(page, 1, 0);
+
+	if (ret != SWAP_MLOCK) {
+		ClearPageMlocked(page);	/* no VM_LOCKED vmas */
+		ret = SWAP_SUCCESS;
+	}
+	return ret;
+}
+#endif
Index: Linux/mm/migrate.c
===================================================================
--- Linux.orig/mm/migrate.c	2007-09-14 10:21:48.000000000 -0400
+++ Linux/mm/migrate.c	2007-09-14 10:23:55.000000000 -0400
@@ -354,6 +354,8 @@ static void migrate_page_copy(struct pag
 		SetPageActive(newpage);
 	} else if (PageNoreclaim(page))
 		SetPageNoreclaim(newpage);
+	if (PageMlocked(page))
+		SetPageMlocked(newpage);
 	if (PageChecked(page))
 		SetPageChecked(newpage);
 	if (PageMappedToDisk(page))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH/RFC 13/14] Reclaim Scalability:  Handle Mlock'ed pages during map/unmap and truncate
  2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
                   ` (11 preceding siblings ...)
  2007-09-14 20:55 ` [PATCH/RFC 12/14] Reclaim Scalability: Non-reclaimable Mlock'ed pages Lee Schermerhorn
@ 2007-09-14 20:55 ` Lee Schermerhorn
  2007-09-14 20:55 ` [PATCH/RFC 14/14] Reclaim Scalability: cull non-reclaimable anon pages in fault path Lee Schermerhorn
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-14 20:55 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

PATCH/RFC  13/14 Reclaim Scalability:  Handle Mlock'ed pages during map/unmap and truncate

Against:  2.6.23-rc4-mm1

Rework of Nick Piggins's "mm: move mlocked pages off the LRU" patch
-- part 2 0f 2.

Remove mlocked pages from the LRU using "NoReclaim infrastructure"
during mmap()/mremap().  Try to move back to normal LRU lists on
munmap() when last locked mapping removed.  Removed PageMlocked()
status when page truncated from file.


Originally Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/filemap.c  |   10 +++++++++-
 mm/mmap.c     |   34 +++++++++++++++++++++++++++++++---
 mm/mremap.c   |    8 +++++---
 mm/truncate.c |    4 ++++
 mm/vmscan.c   |    4 ++++
 5 files changed, 53 insertions(+), 7 deletions(-)

Index: Linux/mm/mmap.c
===================================================================
--- Linux.orig/mm/mmap.c	2007-09-14 10:09:41.000000000 -0400
+++ Linux/mm/mmap.c	2007-09-14 10:24:14.000000000 -0400
@@ -32,6 +32,8 @@
 #include <asm/tlb.h>
 #include <asm/mmu_context.h>
 
+#include "internal.h"
+
 #ifndef arch_mmap_check
 #define arch_mmap_check(addr, len, flags)	(0)
 #endif
@@ -1211,7 +1213,7 @@ out:	
 	vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		mm->locked_vm += len >> PAGE_SHIFT;
-		make_pages_present(addr, addr + len);
+		mlock_vma_pages_range(vma, addr, addr + len);
 	}
 	if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
 		make_pages_present(addr, addr + len);
@@ -1892,6 +1894,19 @@ int do_munmap(struct mm_struct *mm, unsi
 	vma = prev? prev->vm_next: mm->mmap;
 
 	/*
+	 * unlock any mlock()ed ranges before detaching vmas
+	 */
+	if (mm->locked_vm) {
+		struct vm_area_struct *tmp = vma;
+		while (tmp && tmp->vm_start < end) {
+			if (tmp->vm_flags & VM_LOCKED)
+				munlock_vma_pages_range(tmp,
+						 tmp->vm_start, tmp->vm_end);
+			tmp = tmp->vm_next;
+		}
+	}
+
+	/*
 	 * Remove the vma's, and unmap the actual pages
 	 */
 	detach_vmas_to_be_unmapped(mm, vma, prev, end);
@@ -2024,7 +2039,7 @@ out:
 	mm->total_vm += len >> PAGE_SHIFT;
 	if (flags & VM_LOCKED) {
 		mm->locked_vm += len >> PAGE_SHIFT;
-		make_pages_present(addr, addr + len);
+		mlock_vma_pages_range(vma, addr, addr + len);
 	}
 	return addr;
 }
@@ -2035,13 +2050,26 @@ EXPORT_SYMBOL(do_brk);
 void exit_mmap(struct mm_struct *mm)
 {
 	struct mmu_gather *tlb;
-	struct vm_area_struct *vma = mm->mmap;
+	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
 
 	/* mm's last user has gone, and its about to be pulled down */
 	arch_exit_mmap(mm);
 
+	if (mm->locked_vm) {
+		vma = mm->mmap;
+		while (vma) {
+			if (vma->vm_flags & VM_LOCKED)
+				munlock_vma_pages_range(vma,
+						vma->vm_start, vma->vm_end);
+			vma = vma->vm_next;
+		}
+	}
+
+	vma = mm->mmap;
+
+
 	lru_add_drain();
 	flush_cache_mm(mm);
 	tlb = tlb_gather_mmu(mm, 1);
Index: Linux/mm/mremap.c
===================================================================
--- Linux.orig/mm/mremap.c	2007-09-14 10:09:41.000000000 -0400
+++ Linux/mm/mremap.c	2007-09-14 10:24:14.000000000 -0400
@@ -23,6 +23,8 @@
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 
+#include "internal.h"
+
 static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
@@ -232,8 +234,8 @@ static unsigned long move_vma(struct vm_
 	if (vm_flags & VM_LOCKED) {
 		mm->locked_vm += new_len >> PAGE_SHIFT;
 		if (new_len > old_len)
-			make_pages_present(new_addr + old_len,
-					   new_addr + new_len);
+			mlock_vma_pages_range(vma, new_addr + old_len,
+						   new_addr + new_len);
 	}
 
 	return new_addr;
@@ -373,7 +375,7 @@ unsigned long do_mremap(unsigned long ad
 			vm_stat_account(mm, vma->vm_flags, vma->vm_file, pages);
 			if (vma->vm_flags & VM_LOCKED) {
 				mm->locked_vm += pages;
-				make_pages_present(addr + old_len,
+				mlock_vma_pages_range(vma, addr + old_len,
 						   addr + new_len);
 			}
 			ret = addr;
Index: Linux/mm/vmscan.c
===================================================================
--- Linux.orig/mm/vmscan.c	2007-09-14 10:23:55.000000000 -0400
+++ Linux/mm/vmscan.c	2007-09-14 10:24:14.000000000 -0400
@@ -543,6 +543,10 @@ static unsigned long shrink_page_list(st
 				goto activate_locked;
 			case SWAP_AGAIN:
 				goto keep_locked;
+			case SWAP_MLOCK:
+				ClearPageActive(page);
+				SetPageNoreclaim(page);
+				goto keep_locked;	/* to noreclaim list */
 			case SWAP_SUCCESS:
 				; /* try to free the page below */
 			}
Index: Linux/mm/filemap.c
===================================================================
--- Linux.orig/mm/filemap.c	2007-09-14 10:09:41.000000000 -0400
+++ Linux/mm/filemap.c	2007-09-14 10:24:14.000000000 -0400
@@ -2497,8 +2497,16 @@ generic_file_direct_IO(int rw, struct ki
 	if (rw == WRITE) {
 		write_len = iov_length(iov, nr_segs);
 		end = (offset + write_len - 1) >> PAGE_CACHE_SHIFT;
-	       	if (mapping_mapped(mapping))
+		if (mapping_mapped(mapping)) {
+			/*
+			 * Calling unmap_mapping_range like this is wrong,
+			 * because it can lead to mlocked pages being
+			 * discarded (this is true even before the Noreclaim
+			 * mlock work). direct-IO vs pagecache is a load of
+			 * junk anyway, so who cares.
+			 */
 			unmap_mapping_range(mapping, offset, write_len, 0);
+		}
 	}
 
 	retval = filemap_write_and_wait(mapping);
Index: Linux/mm/truncate.c
===================================================================
--- Linux.orig/mm/truncate.c	2007-09-14 10:09:41.000000000 -0400
+++ Linux/mm/truncate.c	2007-09-14 10:24:14.000000000 -0400
@@ -18,6 +18,7 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/buffer_head.h>	/* grr. try_to_release_page,
 				   do_invalidatepage */
+#include "internal.h"
 
 
 /**
@@ -102,6 +103,7 @@ truncate_complete_page(struct address_sp
 		do_invalidatepage(page, 0);
 
 	remove_from_page_cache(page);
+	clear_page_mlock(page);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
 	page_cache_release(page);	/* pagecache ref */
@@ -126,6 +128,7 @@ invalidate_complete_page(struct address_
 	if (PagePrivate(page) && !try_to_release_page(page, 0))
 		return 0;
 
+	clear_page_mlock(page);
 	ret = remove_mapping(mapping, page);
 
 	return ret;
@@ -352,6 +355,7 @@ invalidate_complete_page2(struct address
 	if (PageDirty(page))
 		goto failed;
 
+	clear_page_mlock(page);
 	BUG_ON(PagePrivate(page));
 	__remove_from_page_cache(page);
 	write_unlock_irq(&mapping->tree_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH/RFC 14/14] Reclaim Scalability:  cull non-reclaimable anon pages in fault path
  2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
                   ` (12 preceding siblings ...)
  2007-09-14 20:55 ` [PATCH/RFC 13/14] Reclaim Scalability: Handle Mlock'ed pages during map/unmap and truncate Lee Schermerhorn
@ 2007-09-14 20:55 ` Lee Schermerhorn
  2007-09-14 21:11 ` [PATCH/RFC 0/14] Page Reclaim Scalability Peter Zijlstra
  2007-09-17  6:44 ` Balbir Singh
  15 siblings, 0 replies; 77+ messages in thread
From: Lee Schermerhorn @ 2007-09-14 20:55 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mel, clameter, riel, balbir, andrea, a.p.zijlstra,
	eric.whitney, npiggin

PATCH/RFC 14/14 Reclaim Scalability:  cull non-reclaimable anon pages in fault path

Against:  2.6.23-rc4-mm1

Optional part of "noreclaim infrastructure"

In the fault paths that install new anonymous pages, check whether
the page is reclaimable or not using lru_cache_add_active_or_noreclaim().
If the page is reclaimable, just add it to the active lru list [via
the pagevec cache], else add it to the noreclaim list.  

This "proactive" culling in the fault path mimics the handling of
mlocked pages in Nick Piggin's series to keep mlocked pages off
the lru lists.

Notes:

1) This patch is optional--e.g., if one is concerned about the
   additional test in the fault path.  We can defer the moving of
   nonreclaimable pages until when vmscan [shrink_*_list()]
   encounters them.  Vmscan will only need to handle such pages
   once.

2) I moved the call to page_add_new_anon_rmap() to before the test
   for page_reclaimable() and thus before the calls to
   lru_cache_add_{active|noreclaim}(), so that page_reclaimable()
   could recognize the page as anon, thus obviating, I think, the
   vma arg to page_reclaimable() for this purpose.  Still needed for
   culling mlocked pages in fault path [later patch].
   TBD:   I think this reordering is OK, but the previous order may
   have existed to close some obscure race?

3) With this and other patches above installed, any anon pages
   created before swap is added--e.g., init's anonymous memory--
   will be declared non-reclaimable and placed on the noreclaim
   LRU list.  Need to add mechanism to bring such pages back when
   swap becomes available.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/memory.c     |    6 +++---
 mm/swap_state.c |    2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

Index: Linux/mm/memory.c
===================================================================
--- Linux.orig/mm/memory.c	2007-09-13 15:43:27.000000000 -0400
+++ Linux/mm/memory.c	2007-09-13 15:51:53.000000000 -0400
@@ -1665,8 +1665,8 @@ gotten:
 		ptep_clear_flush(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
-		lru_cache_add_active(new_page);
 		page_add_new_anon_rmap(new_page, vma, address);
+		lru_cache_add_active_or_noreclaim(new_page, vma);

 		/* Free the old page.. */
 		new_page = old_page;
@@ -2195,8 +2195,8 @@ static int do_anonymous_page(struct mm_s
 	if (!pte_none(*page_table))
 		goto release;
 	inc_mm_counter(mm, anon_rss);
-	lru_cache_add_active(page);
 	page_add_new_anon_rmap(page, vma, address);
+	lru_cache_add_active_or_noreclaim(page, vma);
 	set_pte_at(mm, address, page_table, entry);

 	/* No need to invalidate - it was non-present before */
@@ -2346,8 +2346,8 @@ static int __do_fault(struct mm_struct *
 		set_pte_at(mm, address, page_table, entry);
 		if (anon) {
                         inc_mm_counter(mm, anon_rss);
-                        lru_cache_add_active(page);
                         page_add_new_anon_rmap(page, vma, address);
+			lru_cache_add_active_or_noreclaim(page, vma);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(page);
Index: Linux/mm/swap_state.c
===================================================================
--- Linux.orig/mm/swap_state.c	2007-09-13 15:43:27.000000000 -0400
+++ Linux/mm/swap_state.c	2007-09-13 15:51:53.000000000 -0400
@@ -368,7 +368,7 @@ struct page *read_swap_cache_async(swp_e
 			/*
 			 * Initiate read into locked page and return.
 			 */
-			lru_cache_add_active(new_page);
+			lru_cache_add_active_or_noreclaim(new_page, vma);
 			swap_readpage(NULL, new_page);
 			return new_page;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 0/14] Page Reclaim Scalability
  2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
                   ` (13 preceding siblings ...)
  2007-09-14 20:55 ` [PATCH/RFC 14/14] Reclaim Scalability: cull non-reclaimable anon pages in fault path Lee Schermerhorn
@ 2007-09-14 21:11 ` Peter Zijlstra
  2007-09-14 21:42   ` Linus Torvalds
  2007-09-17  6:44 ` Balbir Singh
  15 siblings, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2007-09-14 21:11 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, riel, balbir, andrea, eric.whitney,
	npiggin, Linus Torvalds

On Fri, 2007-09-14 at 16:53 -0400, Lee Schermerhorn wrote:

> 1) make-anon_vma-lock-rw
> 2) make-i_mmap_lock-rw
> 
> The first two patches are not part of the noreclaim infrastructure.
> Rather, these patches improve parallelism in shrink_page_list()--
> specifically in page_referenced() and try_to_unmap()--by making the
> anon_vma lock and the i_mmap_lock reader/writer spinlocks.  

Also at Cambridge, Linus said that rw-spinlocks are usually a mistake.

Their spinning nature can cause a lot of cacheline bouncing. If it turns
out these locks still benefit, it might make sense to just turn them
into sleeping locks.

That said, even sleeping rw locks have issues on large boxen, but they
sure give a little more breathing room than mutal exclusive locks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 0/14] Page Reclaim Scalability
  2007-09-14 21:11 ` [PATCH/RFC 0/14] Page Reclaim Scalability Peter Zijlstra
@ 2007-09-14 21:42   ` Linus Torvalds
  2007-09-14 22:02     ` Peter Zijlstra
  0 siblings, 1 reply; 77+ messages in thread
From: Linus Torvalds @ 2007-09-14 21:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lee Schermerhorn, linux-mm, akpm, mel, clameter, riel, balbir,
	andrea, eric.whitney, npiggin

On Fri, 14 Sep 2007, Peter Zijlstra wrote:
> 
> Also at Cambridge, Linus said that rw-spinlocks are usually a mistake.
>
> Their spinning nature can cause a lot of cacheline bouncing.

They seem to tend to exacerbate any locking problems, at least on x86. The 
rw-spinlocks are more expensive than regular spinlocks, and while in 
theory they should allow nice parallel work by multiple readers, in 
practice the serialization and cost of locking itself seems to just make 
things worse.

But we do use them for some things. The tasklist_lock is one, and I don't 
think we could/should make that one be a regular spinlock: the tasklist 
lock is one of the most "outermost" locks we have, so we often have not 
only various process list traversal inside of it, but we have other (much 
better localized) spinlocks going on inside of it, and as a result we 
actually do end up having real work with real parallelism.

[ But in the case of tasklist_lock, the bigger reason is likely that it 
  also has a semantic reason to prefer rwlocks: you can do reader locks 
  from interrupt context, without having to disable interrupts around 
  other reader locks.

  So in the case of tasklist_lock, I think the *real* advantage is not any 
  amount of extra scalability, but the fact that rwlocks end up allowing 
  us to disable interrupts only for the few operations that need it for 
  writing! ]

So the rwlocks have certainly been successful at times. They just have 
been less successful than people perhaps expected. They're certainly not 
very "cheap", and not really scalable to many readers like a RCU read-lock 
is.

So if you actually look for scalability to lots of CPU's, I think you'd 
want RCU.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 0/14] Page Reclaim Scalability
  2007-09-14 21:42   ` Linus Torvalds
@ 2007-09-14 22:02     ` Peter Zijlstra
  2007-09-15  0:07       ` Linus Torvalds
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2007-09-14 22:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Lee Schermerhorn, linux-mm, akpm, mel, clameter, riel, balbir,
	andrea, eric.whitney, npiggin

On Fri, 2007-09-14 at 14:42 -0700, Linus Torvalds wrote:
> 
> On Fri, 14 Sep 2007, Peter Zijlstra wrote:
> > 
> > Also at Cambridge, Linus said that rw-spinlocks are usually a mistake.
> >
> > Their spinning nature can cause a lot of cacheline bouncing.
> 
> They seem to tend to exacerbate any locking problems, at least on x86. The 
> rw-spinlocks are more expensive than regular spinlocks, and while in 
> theory they should allow nice parallel work by multiple readers, in 
> practice the serialization and cost of locking itself seems to just make 
> things worse.

That was my understanding, so for a rwlock to be somewhat usefull the
write side should be rare, and the reader paths longish so as to win
some of the serialisation costs back.

When looking at the locking hierarchy as presented in rmap.c these two
locks are the first non sleeper locks (and from a quick look at the code
there are no IRQ troubles either), so changing them to a sleeping lock
is quite doable - if that turns out to have advantages over rwlock_t in
this case.


<snip the tasklist_lock story>

> So the rwlocks have certainly been successful at times. They just have 
> been less successful than people perhaps expected. They're certainly not 
> very "cheap", and not really scalable to many readers like a RCU read-lock 
> is.
> 
> So if you actually look for scalability to lots of CPU's, I think you'd 
> want RCU.

Certainly, although that might take a little more than a trivial change.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 0/14] Page Reclaim Scalability
  2007-09-14 22:02     ` Peter Zijlstra
@ 2007-09-15  0:07       ` Linus Torvalds
  0 siblings, 0 replies; 77+ messages in thread
From: Linus Torvalds @ 2007-09-15  0:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lee Schermerhorn, linux-mm, akpm, mel, clameter, riel, balbir,
	andrea, eric.whitney, npiggin

On Sat, 15 Sep 2007, Peter Zijlstra wrote:
> 
> When looking at the locking hierarchy as presented in rmap.c these two
> locks are the first non sleeper locks (and from a quick look at the code
> there are no IRQ troubles either), so changing them to a sleeping lock
> is quite doable - if that turns out to have advantages over rwlock_t in
> this case.

Well, I don't really think that read-write sleeper locks are any better 
than read-write spinlocks. They are even *more* expensive, and the only 
advantage of the sleeper lock is if it allows you to do other things. 
Which I don't think is the case here (nor do we necessarily *want* to make 
the VM have more sleeping situations)

So when it comes to anon_vma lock and i_mmap_lock, maybe rwlocks are fine. 
They do have some latency advantages if writers are really comparatively 
rare and the critical region is bigger. And I could imagine that under 
load it does get big, and the locking overhead is not a big deal.

That said - we *do* actually have things like 

	cond_resched_lock(&mapping->i_mmap_lock);

which at least to me tends to imply that maybe a sleeping lock really 
might be the right thing, since latency has been a problem for these 
things. It's certainly a sign of *something*.

> > So if you actually look for scalability to lots of CPU's, I think you'd 
> > want RCU.
> 
> Certainly, although that might take a little more than a trivial change.

Yeah, I can imagine..

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH/RFC 0/14] Page Reclaim Scalability
  2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
                   ` (14 preceding siblings ...)
  2007-09-14 21:11 ` [PATCH/RFC 0/14] Page Reclaim Scalability Peter Zijlstra
@ 2007-09-17  6:44 ` Balbir Singh
  15 siblings, 0 replies; 77+ messages in thread
From: Balbir Singh @ 2007-09-17  6:44 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, mel, clameter, riel, andrea, a.p.zijlstra,
	eric.whitney, npiggin

[snip]

> 
> Aside:  I note that in 23-rc4-mm1, the memory controller has 
> its own active and inactive list.  It may also benefit from
> use of Christoph's patch.  Further, we'll need to consider 
> whether memory controllers should maintain separate noreclaim
> lists.
> 

I need to look at the patches, but if the per zone LRU is going
to benefit, it is likely that the memory controller will benefit
from the split. We plan to do an mlock() controller, so we will
definitely gain from the noreclaim lists. The mlock() controller
will put a limit on the amount of mlocked memory and reclaim in
general will benefit from noreclaim lists, especially if the locked
memory proportion is significant.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2007-09-20 14:16 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
2007-09-14 20:54 ` [PATCH/RFC 1/14] Reclaim Scalability: Convert anon_vma lock to read/write lock Lee Schermerhorn
2007-09-17 11:02   ` Mel Gorman
2007-09-18  2:41     ` KAMEZAWA Hiroyuki
2007-09-18 11:01       ` Mel Gorman
2007-09-18 14:57         ` Rik van Riel
2007-09-18 15:37       ` Lee Schermerhorn
2007-09-18 20:17     ` Lee Schermerhorn
2007-09-20 10:19       ` Mel Gorman
2007-09-14 20:54 ` [PATCH/RFC 2/14] Reclaim Scalability: convert inode i_mmap_lock to reader/writer lock Lee Schermerhorn
2007-09-17 12:53   ` Mel Gorman
2007-09-20  1:24   ` Andrea Arcangeli
2007-09-20 14:10     ` Lee Schermerhorn
2007-09-20 14:16       ` Andrea Arcangeli
2007-09-14 20:54 ` [PATCH/RFC 3/14] Reclaim Scalability: move isolate_lru_page() to vmscan.c Lee Schermerhorn
2007-09-14 21:34   ` Peter Zijlstra
2007-09-15  1:55     ` Rik van Riel
2007-09-17 14:11     ` Lee Schermerhorn
2007-09-17  9:20   ` Balbir Singh
2007-09-17 19:19     ` Lee Schermerhorn
2007-09-14 20:54 ` [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function Lee Schermerhorn
2007-09-15  2:00   ` Rik van Riel
2007-09-17 13:19   ` Mel Gorman
2007-09-18  1:58   ` KAMEZAWA Hiroyuki
2007-09-18  2:27     ` Rik van Riel
2007-09-18  2:40       ` KAMEZAWA Hiroyuki
2007-09-18 15:04     ` Lee Schermerhorn
2007-09-18 19:41       ` Christoph Lameter
2007-09-19  0:30       ` KAMEZAWA Hiroyuki
2007-09-19 16:58         ` Lee Schermerhorn
2007-09-20  0:56           ` KAMEZAWA Hiroyuki
2007-09-14 20:54 ` [PATCH/RFC 5/14] Reclaim Scalability: Use an indexed array for LRU variables Lee Schermerhorn
2007-09-17 13:40   ` Mel Gorman
2007-09-17 14:17     ` Lee Schermerhorn
2007-09-17 14:39       ` Lee Schermerhorn
2007-09-17 18:58   ` Balbir Singh
2007-09-17 19:12     ` Lee Schermerhorn
2007-09-17 19:36       ` Balbir Singh
2007-09-17 19:36     ` Rik van Riel
2007-09-17 20:21       ` Balbir Singh
2007-09-17 21:01         ` Rik van Riel
2007-09-14 20:54 ` [PATCH/RFC 6/14] Reclaim Scalability: "No Reclaim LRU Infrastructure" Lee Schermerhorn
2007-09-14 22:47   ` Christoph Lameter
2007-09-17 15:17     ` Lee Schermerhorn
2007-09-17 18:41       ` Christoph Lameter
2007-09-18  9:54         ` Mel Gorman
2007-09-18 19:45           ` Christoph Lameter
2007-09-19 11:11             ` Mel Gorman
2007-09-19 18:03               ` Christoph Lameter
2007-09-19  6:00   ` Balbir Singh
2007-09-19 14:47     ` Lee Schermerhorn
2007-09-14 20:54 ` [PATCH/RFC 7/14] Reclaim Scalability: Non-reclaimable page statistics Lee Schermerhorn
2007-09-17  1:56   ` Rik van Riel
2007-09-14 20:54 ` [PATCH/RFC 8/14] Reclaim Scalability: Ram Disk Pages are non-reclaimable Lee Schermerhorn
2007-09-17  1:57   ` Rik van Riel
2007-09-17 14:40     ` Lee Schermerhorn
2007-09-17 18:42       ` Christoph Lameter
2007-09-14 20:54 ` [PATCH/RFC 9/14] Reclaim Scalability: SHM_LOCKED pages are nonreclaimable Lee Schermerhorn
2007-09-17  2:18   ` Rik van Riel
2007-09-14 20:55 ` [PATCH/RFC 10/14] Reclaim Scalability: track anon_vma "related vmas" Lee Schermerhorn
2007-09-17  2:52   ` Rik van Riel
2007-09-17 15:52     ` Lee Schermerhorn
2007-09-14 20:55 ` [PATCH/RFC 11/14] Reclaim Scalability: swap backed pages are nonreclaimable when no swap space available Lee Schermerhorn
2007-09-17  2:53   ` Rik van Riel
2007-09-18 17:46     ` Lee Schermerhorn
2007-09-18 20:01       ` Rik van Riel
2007-09-19 14:55         ` Lee Schermerhorn
2007-09-18  2:59   ` KAMEZAWA Hiroyuki
2007-09-18 15:47     ` Lee Schermerhorn
2007-09-14 20:55 ` [PATCH/RFC 12/14] Reclaim Scalability: Non-reclaimable Mlock'ed pages Lee Schermerhorn
2007-09-14 20:55 ` [PATCH/RFC 13/14] Reclaim Scalability: Handle Mlock'ed pages during map/unmap and truncate Lee Schermerhorn
2007-09-14 20:55 ` [PATCH/RFC 14/14] Reclaim Scalability: cull non-reclaimable anon pages in fault path Lee Schermerhorn
2007-09-14 21:11 ` [PATCH/RFC 0/14] Page Reclaim Scalability Peter Zijlstra
2007-09-14 21:42   ` Linus Torvalds
2007-09-14 22:02     ` Peter Zijlstra
2007-09-15  0:07       ` Linus Torvalds
2007-09-17  6:44 ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).