linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Lee Schermerhorn <lee.schermerhorn@hp.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, mel@csn.ul.ie, clameter@sgi.com,
	riel@redhat.com, balbir@linux.vnet.ibm.com, andrea@suse.de,
	a.p.zijlstra@chello.nl, eric.whitney@hp.com, npiggin@suse.de
Subject: [PATCH/RFC 10/14] Reclaim Scalability:  track anon_vma "related vmas"
Date: Fri, 14 Sep 2007 16:55:06 -0400	[thread overview]
Message-ID: <20070914205506.6536.5170.sendpatchset@localhost> (raw)
In-Reply-To: <20070914205359.6536.98017.sendpatchset@localhost>

PATCH/RFC 10/14 Reclaim Scalability:  track anon_vma "related vmas"

Against:  2.6.23-rc4-mm1

When a single parent forks a large number [thousands, 10s of thousands]
of children, the anon_vma list of related vmas becomes very long.  In
reclaim, this list must be traversed twice--once in page_referenced_anon()
and once in try_to_unmap_anon()--under a spin lock to reclaim the page.
Multiple cpus can end up spinning behind the same anon_vma spinlock and
traversing the lists.  This patch, part of the "noreclaim" series, treats
anon pages with list lengths longer than a tunable threshold as non-
reclaimable.

1) add mm Kconfig option NORECLAIM_ANON_VMA, dependent on NORECLAIM.

2) add a counter of related vmas to the anon_vma structure.  This won't
   increase the size of the structure on 64-bit systems, as it will fit
   in a padding slot.  
   TODO:  do we need a ref count > 4 billion?

3) In [__]anon_vma_[un]link(), track number of related vmas.  The
   count is only incremented/decremented while the anon_vma lock
   is held, so regular, non-atomic, increment/decrement is used.

4) in page_reclaimable(), check anon_vma count in vma's anon_vma, if
   vma supplied, or in page's anon_vma.  In fault path, new anon pages are
   placed on the LRU before adding the anon rmap, so we need to check
   the vma's anon_vma.  Fortunately, the vma is available at that point.
   In vmscan, we can just check the page's anon_vma for any anon pages
   that made it onto the [in]active list before the anon_vma list length
   became "excessive".

5) make the threshold tunable via /proc/sys/vm/anon_vma_reclaim_limit.
   Default value of 64 is totally arbitrary, but should be high enough
   that most applications won't hit it.

Notes:

1) a separate patch makes the anon_vma lock a reader/writer lock.
   This allows some parallelism--different cpus can work on different 
   pages that reference the same anon_vma--but this does not address the
   problem of long lists and potentially many pte's to unmap.

2) TODO:  do same for file rmap in address_space with excessive number
   of mapping vmas?

3) Treating what are theortically reclaimable pages as nonreclaimable
   [in practice they ARE nonreclaimable] will result in oom-kill of some
   tasks rather than system hang/livelock.  We can debate which is
   preferrable.  However, with these patches, Andrea Arcangeli's oom-kill
   cleanups may become more important.

4) an alternate approach:  rather than treat these pages as nonreclaimable,
   we could track the anon_vma references and in fork() [dup_mmap()], when
   the count reaches some limit, give the anon_vma to the child and its
   siblings and their descendants, and allocate a new one for the parent.
   This requires breaking COW sharing of all anon pages [only the parent
   has complete enough state to do this at this point], as tasks can't
   share pages using different anon_vmas.  This will increase memory
   pressure and hasten the onset of reclaim.  I was working on this 
   alternate approach, but shelved it to try the noreclaim list approach.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/rmap.h |   61 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/swap.h |    3 ++
 kernel/sysctl.c      |   12 ++++++++++
 mm/Kconfig           |   11 +++++++++
 mm/rmap.c            |   12 ++++++++--
 mm/vmscan.c          |   23 +++++++++++++++++--
 6 files changed, 117 insertions(+), 5 deletions(-)

Index: Linux/mm/Kconfig
===================================================================
--- Linux.orig/mm/Kconfig	2007-09-14 10:22:02.000000000 -0400
+++ Linux/mm/Kconfig	2007-09-14 10:23:52.000000000 -0400
@@ -204,3 +204,14 @@ config NORECLAIM
 	  may be non-reclaimable because:  they are locked into memory, they
 	  are anonymous pages for which no swap space exists, or they are anon
 	  pages that are expensive to unmap [long anon_vma "related vma" list.]
+
+config NORECLAIM_ANON_VMA
+	bool "Exclude pages with excessively long anon_vma lists"
+	depends on NORECLAIM
+	help
+	  Treats anonymous pages with excessively long anon_vma lists as
+	  non-reclaimable.  Long anon_vma lists results from fork()ing
+	  many [hundreds, thousands] of children from a single parent.  The
+	  anonymous pages in such tasks are very expensive [sometimes almost
+	  impossible] to reclaim.  Treating them as non-reclaimable avoids
+	  the overhead of attempting to reclaim them.
Index: Linux/include/linux/rmap.h
===================================================================
--- Linux.orig/include/linux/rmap.h	2007-09-14 10:22:02.000000000 -0400
+++ Linux/include/linux/rmap.h	2007-09-14 10:23:52.000000000 -0400
@@ -11,6 +11,20 @@
 #include <linux/memcontrol.h>
 
 /*
+ * Optionally, limit the growth of the anon_vma list of "related" vmas
+ * to ANON_VMA_LIST_LIMIT.  Add a count member
+ * to the anon_vma structure where we'd have padding on a 64-bit
+ * system w/o lock debugging.
+ */
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+#define DEFAULT_ANON_VMA_RECLAIM_LIMIT 64
+#define TRACK_ANON_VMA_COUNT 1
+#else
+#define DEFAULT_ANON_VMA_RECLAIM_LIMIT 0
+#define TRACK_ANON_VMA_COUNT 0
+#endif
+
+/*
  * The anon_vma heads a list of private "related" vmas, to scan if
  * an anonymous page pointing to this anon_vma needs to be unmapped:
  * the vmas on the list will be related by forking, or by splitting.
@@ -26,6 +40,9 @@
  */
 struct anon_vma {
 	rwlock_t rwlock;	/* Serialize access to vma list */
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+	int count;	/* number of "related" vmas */
+#endif
 	struct list_head head;	/* List of private "related" vmas */
 };
 
@@ -35,11 +52,20 @@ extern struct kmem_cache *anon_vma_cache
 
 static inline struct anon_vma *anon_vma_alloc(void)
 {
-	return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+	struct anon_vma *anon_vma;
+
+	anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+	if (anon_vma)
+		anon_vma->count = 0;
+#endif
+	return anon_vma;
 }
 
 static inline void anon_vma_free(struct anon_vma *anon_vma)
 {
+	if (TRACK_ANON_VMA_COUNT)
+		VM_BUG_ON(anon_vma->count);
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
@@ -60,6 +86,39 @@ static inline void anon_vma_unlock(struc
 		write_unlock(&anon_vma->rwlock);
 }
 
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+
+/*
+ * Track number of "related" vmas on anon_vma list.
+ * Only called with anon_vma lock held.
+ * Note:  we track related vmas on fork() and splits, but
+ * only enforce the limit on fork().
+ */
+static inline void add_related_vma(struct anon_vma *anon_vma)
+{
+	++anon_vma->count;
+}
+
+static inline void remove_related_vma(struct anon_vma *anon_vma)
+{
+	--anon_vma->count;
+	VM_BUG_ON(anon_vma->count < 0);
+}
+
+static inline struct anon_vma *page_anon_vma(struct page *page)
+{
+	VM_BUG_ON(!PageAnon(page));
+	return (struct anon_vma *)((unsigned long)page->mapping &
+						~PAGE_MAPPING_ANON);
+}
+
+#else
+
+#define add_related_vma(A)
+#define remove_related_vma(A)
+
+#endif
+
 /*
  * anon_vma helper functions.
  */
Index: Linux/mm/rmap.c
===================================================================
--- Linux.orig/mm/rmap.c	2007-09-14 10:22:02.000000000 -0400
+++ Linux/mm/rmap.c	2007-09-14 10:23:52.000000000 -0400
@@ -82,6 +82,7 @@ int anon_vma_prepare(struct vm_area_stru
 		if (likely(!vma->anon_vma)) {
 			vma->anon_vma = anon_vma;
 			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+			add_related_vma(anon_vma);
 			allocated = NULL;
 		}
 		spin_unlock(&mm->page_table_lock);
@@ -96,16 +97,21 @@ int anon_vma_prepare(struct vm_area_stru
 
 void __anon_vma_merge(struct vm_area_struct *vma, struct vm_area_struct *next)
 {
-	BUG_ON(vma->anon_vma != next->anon_vma);
+	struct anon_vma *anon_vma = vma->anon_vma;
+
+	BUG_ON(anon_vma != next->anon_vma);
 	list_del(&next->anon_vma_node);
+	remove_related_vma(anon_vma);
 }
 
 void __anon_vma_link(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 
-	if (anon_vma)
+	if (anon_vma) {
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+		add_related_vma(anon_vma);
+	}
 }
 
 void anon_vma_link(struct vm_area_struct *vma)
@@ -115,6 +121,7 @@ void anon_vma_link(struct vm_area_struct
 	if (anon_vma) {
 		write_lock(&anon_vma->rwlock);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+		add_related_vma(anon_vma);
 		write_unlock(&anon_vma->rwlock);
 	}
 }
@@ -129,6 +136,7 @@ void anon_vma_unlink(struct vm_area_stru
 
 	write_lock(&anon_vma->rwlock);
 	list_del(&vma->anon_vma_node);
+	remove_related_vma(anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
 	empty = list_empty(&anon_vma->head);
Index: Linux/include/linux/swap.h
===================================================================
--- Linux.orig/include/linux/swap.h	2007-09-14 10:22:02.000000000 -0400
+++ Linux/include/linux/swap.h	2007-09-14 10:23:52.000000000 -0400
@@ -227,6 +227,9 @@ static inline int zone_reclaim(struct zo
 #ifdef CONFIG_NORECLAIM
 extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
 extern void putback_all_noreclaim_pages(void);
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+extern int anon_vma_reclaim_limit;
+#endif
 #else
 static inline int page_reclaimable(struct page *page,
 						struct vm_area_struct *vma)
Index: Linux/mm/vmscan.c
===================================================================
--- Linux.orig/mm/vmscan.c	2007-09-14 10:23:50.000000000 -0400
+++ Linux/mm/vmscan.c	2007-09-14 10:23:52.000000000 -0400
@@ -2154,6 +2154,10 @@ int zone_reclaim(struct zone *zone, gfp_
 #endif
 
 #ifdef CONFIG_NORECLAIM
+
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+int anon_vma_reclaim_limit = DEFAULT_ANON_VMA_RECLAIM_LIMIT;
+#endif
 /*
  * page_reclaimable(struct page *page, struct vm_area_struct *vma)
  * Test whether page is reclaimable--i.e., should be placed on active/inactive
@@ -2164,8 +2168,9 @@ int zone_reclaim(struct zone *zone, gfp_
  *               If !NULL, called from fault path.
  *
  * Reasons page might not be reclaimable:
- * + page's mapping marked non-reclaimable
- * TODO - later patches
+ * 1) page's mapping marked non-reclaimable
+ * 2) anon_vma [if any] has too many related vmas
+ * [more TBD.  e.g., anon page and no swap available, page mlocked, ...]
  *
  * TODO:  specify locking assumptions
  */
@@ -2177,6 +2182,20 @@ int page_reclaimable(struct page *page, 
 	if (mapping_non_reclaimable(page_mapping(page)))
 		return 0;
 
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+	if (PageAnon(page)) {
+		struct anon_vma *anon_vma;
+
+		/*
+		 * anon page with too many related vmas?
+		 */
+		anon_vma = page_anon_vma(page);
+		VM_BUG_ON(!anon_vma);
+		if (anon_vma_reclaim_limit &&
+			anon_vma->count > anon_vma_reclaim_limit)
+			return 0;
+	}
+#endif
 	/* TODO:  test page [!]reclaimable conditions */
 
 	return 1;
Index: Linux/kernel/sysctl.c
===================================================================
--- Linux.orig/kernel/sysctl.c	2007-09-14 10:22:02.000000000 -0400
+++ Linux/kernel/sysctl.c	2007-09-14 10:23:52.000000000 -0400
@@ -1060,6 +1060,18 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 	},
 #endif
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "anon_vma_reclaim_limit",
+		.data		= &anon_vma_reclaim_limit,
+		.maxlen		= sizeof(anon_vma_reclaim_limit),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
+#endif
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2007-09-14 20:55 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
2007-09-14 20:54 ` [PATCH/RFC 1/14] Reclaim Scalability: Convert anon_vma lock to read/write lock Lee Schermerhorn
2007-09-17 11:02   ` Mel Gorman
2007-09-18  2:41     ` KAMEZAWA Hiroyuki
2007-09-18 11:01       ` Mel Gorman
2007-09-18 14:57         ` Rik van Riel
2007-09-18 15:37       ` Lee Schermerhorn
2007-09-18 20:17     ` Lee Schermerhorn
2007-09-20 10:19       ` Mel Gorman
2007-09-14 20:54 ` [PATCH/RFC 2/14] Reclaim Scalability: convert inode i_mmap_lock to reader/writer lock Lee Schermerhorn
2007-09-17 12:53   ` Mel Gorman
2007-09-20  1:24   ` Andrea Arcangeli
2007-09-20 14:10     ` Lee Schermerhorn
2007-09-20 14:16       ` Andrea Arcangeli
2007-09-14 20:54 ` [PATCH/RFC 3/14] Reclaim Scalability: move isolate_lru_page() to vmscan.c Lee Schermerhorn
2007-09-14 21:34   ` Peter Zijlstra
2007-09-15  1:55     ` Rik van Riel
2007-09-17 14:11     ` Lee Schermerhorn
2007-09-17  9:20   ` Balbir Singh
2007-09-17 19:19     ` Lee Schermerhorn
2007-09-14 20:54 ` [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function Lee Schermerhorn
2007-09-15  2:00   ` Rik van Riel
2007-09-17 13:19   ` Mel Gorman
2007-09-18  1:58   ` KAMEZAWA Hiroyuki
2007-09-18  2:27     ` Rik van Riel
2007-09-18  2:40       ` KAMEZAWA Hiroyuki
2007-09-18 15:04     ` Lee Schermerhorn
2007-09-18 19:41       ` Christoph Lameter
2007-09-19  0:30       ` KAMEZAWA Hiroyuki
2007-09-19 16:58         ` Lee Schermerhorn
2007-09-20  0:56           ` KAMEZAWA Hiroyuki
2007-09-14 20:54 ` [PATCH/RFC 5/14] Reclaim Scalability: Use an indexed array for LRU variables Lee Schermerhorn
2007-09-17 13:40   ` Mel Gorman
2007-09-17 14:17     ` Lee Schermerhorn
2007-09-17 14:39       ` Lee Schermerhorn
2007-09-17 18:58   ` Balbir Singh
2007-09-17 19:12     ` Lee Schermerhorn
2007-09-17 19:36       ` Balbir Singh
2007-09-17 19:36     ` Rik van Riel
2007-09-17 20:21       ` Balbir Singh
2007-09-17 21:01         ` Rik van Riel
2007-09-14 20:54 ` [PATCH/RFC 6/14] Reclaim Scalability: "No Reclaim LRU Infrastructure" Lee Schermerhorn
2007-09-14 22:47   ` Christoph Lameter
2007-09-17 15:17     ` Lee Schermerhorn
2007-09-17 18:41       ` Christoph Lameter
2007-09-18  9:54         ` Mel Gorman
2007-09-18 19:45           ` Christoph Lameter
2007-09-19 11:11             ` Mel Gorman
2007-09-19 18:03               ` Christoph Lameter
2007-09-19  6:00   ` Balbir Singh
2007-09-19 14:47     ` Lee Schermerhorn
2007-09-14 20:54 ` [PATCH/RFC 7/14] Reclaim Scalability: Non-reclaimable page statistics Lee Schermerhorn
2007-09-17  1:56   ` Rik van Riel
2007-09-14 20:54 ` [PATCH/RFC 8/14] Reclaim Scalability: Ram Disk Pages are non-reclaimable Lee Schermerhorn
2007-09-17  1:57   ` Rik van Riel
2007-09-17 14:40     ` Lee Schermerhorn
2007-09-17 18:42       ` Christoph Lameter
2007-09-14 20:54 ` [PATCH/RFC 9/14] Reclaim Scalability: SHM_LOCKED pages are nonreclaimable Lee Schermerhorn
2007-09-17  2:18   ` Rik van Riel
2007-09-14 20:55 ` Lee Schermerhorn [this message]
2007-09-17  2:52   ` [PATCH/RFC 10/14] Reclaim Scalability: track anon_vma "related vmas" Rik van Riel
2007-09-17 15:52     ` Lee Schermerhorn
2007-09-14 20:55 ` [PATCH/RFC 11/14] Reclaim Scalability: swap backed pages are nonreclaimable when no swap space available Lee Schermerhorn
2007-09-17  2:53   ` Rik van Riel
2007-09-18 17:46     ` Lee Schermerhorn
2007-09-18 20:01       ` Rik van Riel
2007-09-19 14:55         ` Lee Schermerhorn
2007-09-18  2:59   ` KAMEZAWA Hiroyuki
2007-09-18 15:47     ` Lee Schermerhorn
2007-09-14 20:55 ` [PATCH/RFC 12/14] Reclaim Scalability: Non-reclaimable Mlock'ed pages Lee Schermerhorn
2007-09-14 20:55 ` [PATCH/RFC 13/14] Reclaim Scalability: Handle Mlock'ed pages during map/unmap and truncate Lee Schermerhorn
2007-09-14 20:55 ` [PATCH/RFC 14/14] Reclaim Scalability: cull non-reclaimable anon pages in fault path Lee Schermerhorn
2007-09-14 21:11 ` [PATCH/RFC 0/14] Page Reclaim Scalability Peter Zijlstra
2007-09-14 21:42   ` Linus Torvalds
2007-09-14 22:02     ` Peter Zijlstra
2007-09-15  0:07       ` Linus Torvalds
2007-09-17  6:44 ` Balbir Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070914205506.6536.5170.sendpatchset@localhost \
    --to=lee.schermerhorn@hp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=andrea@suse.de \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=clameter@sgi.com \
    --cc=eric.whitney@hp.com \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=npiggin@suse.de \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).