From: Lee Schermerhorn <lee.schermerhorn@hp.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, mel@csn.ul.ie, clameter@sgi.com,
riel@redhat.com, balbir@linux.vnet.ibm.com, andrea@suse.de,
a.p.zijlstra@chello.nl, eric.whitney@hp.com, npiggin@suse.de
Subject: [PATCH/RFC 10/14] Reclaim Scalability: track anon_vma "related vmas"
Date: Fri, 14 Sep 2007 16:55:06 -0400 [thread overview]
Message-ID: <20070914205506.6536.5170.sendpatchset@localhost> (raw)
In-Reply-To: <20070914205359.6536.98017.sendpatchset@localhost>
PATCH/RFC 10/14 Reclaim Scalability: track anon_vma "related vmas"
Against: 2.6.23-rc4-mm1
When a single parent forks a large number [thousands, 10s of thousands]
of children, the anon_vma list of related vmas becomes very long. In
reclaim, this list must be traversed twice--once in page_referenced_anon()
and once in try_to_unmap_anon()--under a spin lock to reclaim the page.
Multiple cpus can end up spinning behind the same anon_vma spinlock and
traversing the lists. This patch, part of the "noreclaim" series, treats
anon pages with list lengths longer than a tunable threshold as non-
reclaimable.
1) add mm Kconfig option NORECLAIM_ANON_VMA, dependent on NORECLAIM.
2) add a counter of related vmas to the anon_vma structure. This won't
increase the size of the structure on 64-bit systems, as it will fit
in a padding slot.
TODO: do we need a ref count > 4 billion?
3) In [__]anon_vma_[un]link(), track number of related vmas. The
count is only incremented/decremented while the anon_vma lock
is held, so regular, non-atomic, increment/decrement is used.
4) in page_reclaimable(), check anon_vma count in vma's anon_vma, if
vma supplied, or in page's anon_vma. In fault path, new anon pages are
placed on the LRU before adding the anon rmap, so we need to check
the vma's anon_vma. Fortunately, the vma is available at that point.
In vmscan, we can just check the page's anon_vma for any anon pages
that made it onto the [in]active list before the anon_vma list length
became "excessive".
5) make the threshold tunable via /proc/sys/vm/anon_vma_reclaim_limit.
Default value of 64 is totally arbitrary, but should be high enough
that most applications won't hit it.
Notes:
1) a separate patch makes the anon_vma lock a reader/writer lock.
This allows some parallelism--different cpus can work on different
pages that reference the same anon_vma--but this does not address the
problem of long lists and potentially many pte's to unmap.
2) TODO: do same for file rmap in address_space with excessive number
of mapping vmas?
3) Treating what are theortically reclaimable pages as nonreclaimable
[in practice they ARE nonreclaimable] will result in oom-kill of some
tasks rather than system hang/livelock. We can debate which is
preferrable. However, with these patches, Andrea Arcangeli's oom-kill
cleanups may become more important.
4) an alternate approach: rather than treat these pages as nonreclaimable,
we could track the anon_vma references and in fork() [dup_mmap()], when
the count reaches some limit, give the anon_vma to the child and its
siblings and their descendants, and allocate a new one for the parent.
This requires breaking COW sharing of all anon pages [only the parent
has complete enough state to do this at this point], as tasks can't
share pages using different anon_vmas. This will increase memory
pressure and hasten the onset of reclaim. I was working on this
alternate approach, but shelved it to try the noreclaim list approach.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
include/linux/rmap.h | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++-
include/linux/swap.h | 3 ++
kernel/sysctl.c | 12 ++++++++++
mm/Kconfig | 11 +++++++++
mm/rmap.c | 12 ++++++++--
mm/vmscan.c | 23 +++++++++++++++++--
6 files changed, 117 insertions(+), 5 deletions(-)
Index: Linux/mm/Kconfig
===================================================================
--- Linux.orig/mm/Kconfig 2007-09-14 10:22:02.000000000 -0400
+++ Linux/mm/Kconfig 2007-09-14 10:23:52.000000000 -0400
@@ -204,3 +204,14 @@ config NORECLAIM
may be non-reclaimable because: they are locked into memory, they
are anonymous pages for which no swap space exists, or they are anon
pages that are expensive to unmap [long anon_vma "related vma" list.]
+
+config NORECLAIM_ANON_VMA
+ bool "Exclude pages with excessively long anon_vma lists"
+ depends on NORECLAIM
+ help
+ Treats anonymous pages with excessively long anon_vma lists as
+ non-reclaimable. Long anon_vma lists results from fork()ing
+ many [hundreds, thousands] of children from a single parent. The
+ anonymous pages in such tasks are very expensive [sometimes almost
+ impossible] to reclaim. Treating them as non-reclaimable avoids
+ the overhead of attempting to reclaim them.
Index: Linux/include/linux/rmap.h
===================================================================
--- Linux.orig/include/linux/rmap.h 2007-09-14 10:22:02.000000000 -0400
+++ Linux/include/linux/rmap.h 2007-09-14 10:23:52.000000000 -0400
@@ -11,6 +11,20 @@
#include <linux/memcontrol.h>
/*
+ * Optionally, limit the growth of the anon_vma list of "related" vmas
+ * to ANON_VMA_LIST_LIMIT. Add a count member
+ * to the anon_vma structure where we'd have padding on a 64-bit
+ * system w/o lock debugging.
+ */
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+#define DEFAULT_ANON_VMA_RECLAIM_LIMIT 64
+#define TRACK_ANON_VMA_COUNT 1
+#else
+#define DEFAULT_ANON_VMA_RECLAIM_LIMIT 0
+#define TRACK_ANON_VMA_COUNT 0
+#endif
+
+/*
* The anon_vma heads a list of private "related" vmas, to scan if
* an anonymous page pointing to this anon_vma needs to be unmapped:
* the vmas on the list will be related by forking, or by splitting.
@@ -26,6 +40,9 @@
*/
struct anon_vma {
rwlock_t rwlock; /* Serialize access to vma list */
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+ int count; /* number of "related" vmas */
+#endif
struct list_head head; /* List of private "related" vmas */
};
@@ -35,11 +52,20 @@ extern struct kmem_cache *anon_vma_cache
static inline struct anon_vma *anon_vma_alloc(void)
{
- return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+ struct anon_vma *anon_vma;
+
+ anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+ if (anon_vma)
+ anon_vma->count = 0;
+#endif
+ return anon_vma;
}
static inline void anon_vma_free(struct anon_vma *anon_vma)
{
+ if (TRACK_ANON_VMA_COUNT)
+ VM_BUG_ON(anon_vma->count);
kmem_cache_free(anon_vma_cachep, anon_vma);
}
@@ -60,6 +86,39 @@ static inline void anon_vma_unlock(struc
write_unlock(&anon_vma->rwlock);
}
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+
+/*
+ * Track number of "related" vmas on anon_vma list.
+ * Only called with anon_vma lock held.
+ * Note: we track related vmas on fork() and splits, but
+ * only enforce the limit on fork().
+ */
+static inline void add_related_vma(struct anon_vma *anon_vma)
+{
+ ++anon_vma->count;
+}
+
+static inline void remove_related_vma(struct anon_vma *anon_vma)
+{
+ --anon_vma->count;
+ VM_BUG_ON(anon_vma->count < 0);
+}
+
+static inline struct anon_vma *page_anon_vma(struct page *page)
+{
+ VM_BUG_ON(!PageAnon(page));
+ return (struct anon_vma *)((unsigned long)page->mapping &
+ ~PAGE_MAPPING_ANON);
+}
+
+#else
+
+#define add_related_vma(A)
+#define remove_related_vma(A)
+
+#endif
+
/*
* anon_vma helper functions.
*/
Index: Linux/mm/rmap.c
===================================================================
--- Linux.orig/mm/rmap.c 2007-09-14 10:22:02.000000000 -0400
+++ Linux/mm/rmap.c 2007-09-14 10:23:52.000000000 -0400
@@ -82,6 +82,7 @@ int anon_vma_prepare(struct vm_area_stru
if (likely(!vma->anon_vma)) {
vma->anon_vma = anon_vma;
list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+ add_related_vma(anon_vma);
allocated = NULL;
}
spin_unlock(&mm->page_table_lock);
@@ -96,16 +97,21 @@ int anon_vma_prepare(struct vm_area_stru
void __anon_vma_merge(struct vm_area_struct *vma, struct vm_area_struct *next)
{
- BUG_ON(vma->anon_vma != next->anon_vma);
+ struct anon_vma *anon_vma = vma->anon_vma;
+
+ BUG_ON(anon_vma != next->anon_vma);
list_del(&next->anon_vma_node);
+ remove_related_vma(anon_vma);
}
void __anon_vma_link(struct vm_area_struct *vma)
{
struct anon_vma *anon_vma = vma->anon_vma;
- if (anon_vma)
+ if (anon_vma) {
list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+ add_related_vma(anon_vma);
+ }
}
void anon_vma_link(struct vm_area_struct *vma)
@@ -115,6 +121,7 @@ void anon_vma_link(struct vm_area_struct
if (anon_vma) {
write_lock(&anon_vma->rwlock);
list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+ add_related_vma(anon_vma);
write_unlock(&anon_vma->rwlock);
}
}
@@ -129,6 +136,7 @@ void anon_vma_unlink(struct vm_area_stru
write_lock(&anon_vma->rwlock);
list_del(&vma->anon_vma_node);
+ remove_related_vma(anon_vma);
/* We must garbage collect the anon_vma if it's empty */
empty = list_empty(&anon_vma->head);
Index: Linux/include/linux/swap.h
===================================================================
--- Linux.orig/include/linux/swap.h 2007-09-14 10:22:02.000000000 -0400
+++ Linux/include/linux/swap.h 2007-09-14 10:23:52.000000000 -0400
@@ -227,6 +227,9 @@ static inline int zone_reclaim(struct zo
#ifdef CONFIG_NORECLAIM
extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
extern void putback_all_noreclaim_pages(void);
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+extern int anon_vma_reclaim_limit;
+#endif
#else
static inline int page_reclaimable(struct page *page,
struct vm_area_struct *vma)
Index: Linux/mm/vmscan.c
===================================================================
--- Linux.orig/mm/vmscan.c 2007-09-14 10:23:50.000000000 -0400
+++ Linux/mm/vmscan.c 2007-09-14 10:23:52.000000000 -0400
@@ -2154,6 +2154,10 @@ int zone_reclaim(struct zone *zone, gfp_
#endif
#ifdef CONFIG_NORECLAIM
+
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+int anon_vma_reclaim_limit = DEFAULT_ANON_VMA_RECLAIM_LIMIT;
+#endif
/*
* page_reclaimable(struct page *page, struct vm_area_struct *vma)
* Test whether page is reclaimable--i.e., should be placed on active/inactive
@@ -2164,8 +2168,9 @@ int zone_reclaim(struct zone *zone, gfp_
* If !NULL, called from fault path.
*
* Reasons page might not be reclaimable:
- * + page's mapping marked non-reclaimable
- * TODO - later patches
+ * 1) page's mapping marked non-reclaimable
+ * 2) anon_vma [if any] has too many related vmas
+ * [more TBD. e.g., anon page and no swap available, page mlocked, ...]
*
* TODO: specify locking assumptions
*/
@@ -2177,6 +2182,20 @@ int page_reclaimable(struct page *page,
if (mapping_non_reclaimable(page_mapping(page)))
return 0;
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+ if (PageAnon(page)) {
+ struct anon_vma *anon_vma;
+
+ /*
+ * anon page with too many related vmas?
+ */
+ anon_vma = page_anon_vma(page);
+ VM_BUG_ON(!anon_vma);
+ if (anon_vma_reclaim_limit &&
+ anon_vma->count > anon_vma_reclaim_limit)
+ return 0;
+ }
+#endif
/* TODO: test page [!]reclaimable conditions */
return 1;
Index: Linux/kernel/sysctl.c
===================================================================
--- Linux.orig/kernel/sysctl.c 2007-09-14 10:22:02.000000000 -0400
+++ Linux/kernel/sysctl.c 2007-09-14 10:23:52.000000000 -0400
@@ -1060,6 +1060,18 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
},
#endif
+#ifdef CONFIG_NORECLAIM_ANON_VMA
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "anon_vma_reclaim_limit",
+ .data = &anon_vma_reclaim_limit,
+ .maxlen = sizeof(anon_vma_reclaim_limit),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ },
+#endif
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-09-14 20:55 UTC|newest]
Thread overview: 77+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-09-14 20:53 [PATCH/RFC 0/14] Page Reclaim Scalability Lee Schermerhorn
2007-09-14 20:54 ` [PATCH/RFC 1/14] Reclaim Scalability: Convert anon_vma lock to read/write lock Lee Schermerhorn
2007-09-17 11:02 ` Mel Gorman
2007-09-18 2:41 ` KAMEZAWA Hiroyuki
2007-09-18 11:01 ` Mel Gorman
2007-09-18 14:57 ` Rik van Riel
2007-09-18 15:37 ` Lee Schermerhorn
2007-09-18 20:17 ` Lee Schermerhorn
2007-09-20 10:19 ` Mel Gorman
2007-09-14 20:54 ` [PATCH/RFC 2/14] Reclaim Scalability: convert inode i_mmap_lock to reader/writer lock Lee Schermerhorn
2007-09-17 12:53 ` Mel Gorman
2007-09-20 1:24 ` Andrea Arcangeli
2007-09-20 14:10 ` Lee Schermerhorn
2007-09-20 14:16 ` Andrea Arcangeli
2007-09-14 20:54 ` [PATCH/RFC 3/14] Reclaim Scalability: move isolate_lru_page() to vmscan.c Lee Schermerhorn
2007-09-14 21:34 ` Peter Zijlstra
2007-09-15 1:55 ` Rik van Riel
2007-09-17 14:11 ` Lee Schermerhorn
2007-09-17 9:20 ` Balbir Singh
2007-09-17 19:19 ` Lee Schermerhorn
2007-09-14 20:54 ` [PATCH/RFC 4/14] Reclaim Scalability: Define page_anon() function Lee Schermerhorn
2007-09-15 2:00 ` Rik van Riel
2007-09-17 13:19 ` Mel Gorman
2007-09-18 1:58 ` KAMEZAWA Hiroyuki
2007-09-18 2:27 ` Rik van Riel
2007-09-18 2:40 ` KAMEZAWA Hiroyuki
2007-09-18 15:04 ` Lee Schermerhorn
2007-09-18 19:41 ` Christoph Lameter
2007-09-19 0:30 ` KAMEZAWA Hiroyuki
2007-09-19 16:58 ` Lee Schermerhorn
2007-09-20 0:56 ` KAMEZAWA Hiroyuki
2007-09-14 20:54 ` [PATCH/RFC 5/14] Reclaim Scalability: Use an indexed array for LRU variables Lee Schermerhorn
2007-09-17 13:40 ` Mel Gorman
2007-09-17 14:17 ` Lee Schermerhorn
2007-09-17 14:39 ` Lee Schermerhorn
2007-09-17 18:58 ` Balbir Singh
2007-09-17 19:12 ` Lee Schermerhorn
2007-09-17 19:36 ` Balbir Singh
2007-09-17 19:36 ` Rik van Riel
2007-09-17 20:21 ` Balbir Singh
2007-09-17 21:01 ` Rik van Riel
2007-09-14 20:54 ` [PATCH/RFC 6/14] Reclaim Scalability: "No Reclaim LRU Infrastructure" Lee Schermerhorn
2007-09-14 22:47 ` Christoph Lameter
2007-09-17 15:17 ` Lee Schermerhorn
2007-09-17 18:41 ` Christoph Lameter
2007-09-18 9:54 ` Mel Gorman
2007-09-18 19:45 ` Christoph Lameter
2007-09-19 11:11 ` Mel Gorman
2007-09-19 18:03 ` Christoph Lameter
2007-09-19 6:00 ` Balbir Singh
2007-09-19 14:47 ` Lee Schermerhorn
2007-09-14 20:54 ` [PATCH/RFC 7/14] Reclaim Scalability: Non-reclaimable page statistics Lee Schermerhorn
2007-09-17 1:56 ` Rik van Riel
2007-09-14 20:54 ` [PATCH/RFC 8/14] Reclaim Scalability: Ram Disk Pages are non-reclaimable Lee Schermerhorn
2007-09-17 1:57 ` Rik van Riel
2007-09-17 14:40 ` Lee Schermerhorn
2007-09-17 18:42 ` Christoph Lameter
2007-09-14 20:54 ` [PATCH/RFC 9/14] Reclaim Scalability: SHM_LOCKED pages are nonreclaimable Lee Schermerhorn
2007-09-17 2:18 ` Rik van Riel
2007-09-14 20:55 ` Lee Schermerhorn [this message]
2007-09-17 2:52 ` [PATCH/RFC 10/14] Reclaim Scalability: track anon_vma "related vmas" Rik van Riel
2007-09-17 15:52 ` Lee Schermerhorn
2007-09-14 20:55 ` [PATCH/RFC 11/14] Reclaim Scalability: swap backed pages are nonreclaimable when no swap space available Lee Schermerhorn
2007-09-17 2:53 ` Rik van Riel
2007-09-18 17:46 ` Lee Schermerhorn
2007-09-18 20:01 ` Rik van Riel
2007-09-19 14:55 ` Lee Schermerhorn
2007-09-18 2:59 ` KAMEZAWA Hiroyuki
2007-09-18 15:47 ` Lee Schermerhorn
2007-09-14 20:55 ` [PATCH/RFC 12/14] Reclaim Scalability: Non-reclaimable Mlock'ed pages Lee Schermerhorn
2007-09-14 20:55 ` [PATCH/RFC 13/14] Reclaim Scalability: Handle Mlock'ed pages during map/unmap and truncate Lee Schermerhorn
2007-09-14 20:55 ` [PATCH/RFC 14/14] Reclaim Scalability: cull non-reclaimable anon pages in fault path Lee Schermerhorn
2007-09-14 21:11 ` [PATCH/RFC 0/14] Page Reclaim Scalability Peter Zijlstra
2007-09-14 21:42 ` Linus Torvalds
2007-09-14 22:02 ` Peter Zijlstra
2007-09-15 0:07 ` Linus Torvalds
2007-09-17 6:44 ` Balbir Singh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070914205506.6536.5170.sendpatchset@localhost \
--to=lee.schermerhorn@hp.com \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=andrea@suse.de \
--cc=balbir@linux.vnet.ibm.com \
--cc=clameter@sgi.com \
--cc=eric.whitney@hp.com \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
--cc=npiggin@suse.de \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).