* [PATCH 0/6] [RFC] Fallocate Volatile Ranges v4
@ 2012-06-13 1:10 John Stultz
2012-06-13 1:10 ` [PATCH 1/6] [RFC] Interval tree implementation John Stultz
` (5 more replies)
0 siblings, 6 replies; 9+ messages in thread
From: John Stultz @ 2012-06-13 1:10 UTC (permalink / raw)
To: LKML
Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
Aneesh Kumar K.V, Taras Glek, Mike Hommey, Jan Kara,
KOSAKI Motohiro
So after sending out the v3 iteration of this patch set, I noticed
that the approach taken there of using writepage to trigger the
volatile range purging would not be functional, as writepage isn't
called on systems without swap.
Given this, for this iteration I've reverted the patch set back to
using a shrinker. I've also included a patch to convert ashmem to
use volatile ranges, which almost cuts the size of the ashmem
driver in half.
Now, I've also gotten feedback from Kosaki-san that the shrinker
interface along with tracking volatile ranges via an lru is a poor
choice, as it is not NUMA aware. I've spent some time trying to
learn more about the VM internals, and I've included two HACK
patches that try to propose a way to address these concerns.
The basic idea is to remove the the volatile range lru management.
Instead leveraging the VM's page lru management, by deactivating
the entire range when marking it volatile. Then in the writepage
code we check if a page is part of a volatile range, and if so
we purge it. This provides LRU-like behavior for purging volatile
ranges (so we purge old volatile ranges before new ones), while
allowing the kernel to free memory first on a numa node that is
under pressure.
Of course, the issue I had w/ the v3 patchset still applies.
We never call shmem_writepage on a system without swap! So to
address this, I've create a quite hackish patch that tries to
manage the anonymous active and inactive lrus slightly differently.
The idea with this is on systems without swap, there's not much use
to keeping inactive and active anonymous lrus separately, since we
can't swap out any of the anonymous memory (at least this is my
assumption, and I may be wrong). So when we don't have swap, keep
all anonymous pages active. Thus, only pages that are deactivated
will be moved to the inactive lru. Then we change shrink_lruvec()
to try to writeout inactive anonymous pages. Of course, writepage
won't write anything out if there's no swap, but it will now
check if the page is volatile and purge it if appropriate.
So... does that sound reasonable or terrible?
Are there other approaches I should be trying?
Maybe the original shrinker approach isn't so bad after all? :)
Many thanks again to Dave Hansen who's been fending off my
numerous questions on irc for the last few days.
thanks
-john
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Taras Glek <tgek@mozilla.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
John Stultz (6):
[RFC] Interval tree implementation
[RFC] Add volatile range management code
[RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
[RFC] ashmem: Convert ashmem to use volatile ranges
[RFC][HACK] tmpfs: Purge volatile ranges on writepage instead of
using shrinker
[RFC][HACK] mm: Change memory management of anonymous pages on
swapless systems
drivers/staging/android/ashmem.c | 331 +--------------------------
fs/open.c | 3 +-
include/linux/falloc.h | 7 +-
include/linux/intervaltree.h | 55 +++++
include/linux/pagevec.h | 5 +-
include/linux/swap.h | 23 ++-
include/linux/volatile.h | 40 ++++
lib/Makefile | 2 +-
lib/intervaltree.c | 119 ++++++++++
mm/Makefile | 2 +-
mm/shmem.c | 102 +++++++++
mm/swap.c | 13 +-
mm/vmscan.c | 9 -
mm/volatile.c | 467 ++++++++++++++++++++++++++++++++++++++
14 files changed, 828 insertions(+), 350 deletions(-)
create mode 100644 include/linux/intervaltree.h
create mode 100644 include/linux/volatile.h
create mode 100644 lib/intervaltree.c
create mode 100644 mm/volatile.c
--
1.7.3.2.146.gca209
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/6] [RFC] Interval tree implementation
2012-06-13 1:10 [PATCH 0/6] [RFC] Fallocate Volatile Ranges v4 John Stultz
@ 2012-06-13 1:10 ` John Stultz
2012-06-22 7:06 ` Michel Lespinasse
2012-06-13 1:10 ` [PATCH 2/6] [RFC] Add volatile range management code John Stultz
` (4 subsequent siblings)
5 siblings, 1 reply; 9+ messages in thread
From: John Stultz @ 2012-06-13 1:10 UTC (permalink / raw)
To: LKML
Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
Aneesh Kumar K.V, Taras Glek, Mike Hommey, Jan Kara,
KOSAKI Motohiro
After Andrew suggested something like his mumbletree idea
to better store a list of intervals, I worked on a few different
approaches, and this is what I've finally managed to get working.
The idea of storing intervals in a tree is nice, but has a number
of complications. When adding an interval, its possible that a
large interval will consume and merge a number of smaller intervals.
When removing a interval, its possible you may end up splitting an
existing interval, causing one interval to become two. This makes it
very difficult to provide generic list_head like behavior, as
the parent structures would need to be duplicated and removed,
and that has lots of memory ownership issues.
So, this is a much simplified and more list_head like
implementation. You can add a node to a tree, or remove a node
to a tree, but the generic implementation doesn't do the
merging or splitting for you. But it does provide helpers to
find overlapping and adjacent intervals.
Andrew also really wanted this interval-tree implementation to be
resuable so we don't duplicate the file locking logic. I'm not
totally convinced that the requirements between the volatile
intervals and file locking are really equivelent, but this reduced
impelementation may make it possible.
Changelog:
v2:
* Reworked code to use an rbtree instead of splaying
v3:
* Added range_tree_next_in_range() to avoid having to start
lookups from the root every time.
* Fixed some comments and return NULL instead of 0, as suggested
by Aneesh Kumar K.V
v6:
* Fixed range_tree_in_range() so that it finds the earliest range,
rather then the first. This allows the next_in_range() function
to properly cover all the ranges in the tree.
* Minor clenaups to simplify some of the functions
v7:
* Changed terminology from rangetree to intervaltree as suggested
by Jan Kara
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Taras Glek <tgek@mozilla.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
include/linux/intervaltree.h | 55 +++++++++++++++++++
lib/Makefile | 2 +-
lib/intervaltree.c | 119 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 175 insertions(+), 1 deletions(-)
create mode 100644 include/linux/intervaltree.h
create mode 100644 lib/intervaltree.c
diff --git a/include/linux/intervaltree.h b/include/linux/intervaltree.h
new file mode 100644
index 0000000..cfaa174
--- /dev/null
+++ b/include/linux/intervaltree.h
@@ -0,0 +1,55 @@
+#ifndef _LINUX_INTERVALTREE_H
+#define _LINUX_INTERVALTREE_H
+
+#include <linux/types.h>
+#include <linux/rbtree.h>
+
+struct interval_tree_node {
+ struct rb_node rb;
+ u64 start;
+ u64 end;
+};
+
+struct interval_tree_root {
+ struct rb_root head;
+};
+
+static inline void interval_tree_init(struct interval_tree_root *root)
+{
+ root->head = RB_ROOT;
+}
+
+static inline void interval_tree_node_init(struct interval_tree_node *node)
+{
+ rb_init_node(&node->rb);
+ node->start = 0;
+ node->end = 0;
+}
+
+static inline int interval_tree_empty(struct interval_tree_root *root)
+{
+ return RB_EMPTY_ROOT(&root->head);
+}
+
+static inline
+struct interval_tree_node *interval_tree_root_node(
+ struct interval_tree_root *root)
+{
+ struct interval_tree_node *ret;
+ ret = container_of(root->head.rb_node, struct interval_tree_node, rb);
+ return ret;
+}
+
+extern struct interval_tree_node *interval_tree_in_interval(
+ struct interval_tree_root *root,
+ u64 start, u64 end);
+extern struct interval_tree_node *interval_tree_next_in_interval(
+ struct interval_tree_node *node,
+ u64 start, u64 end);
+extern void interval_tree_add(struct interval_tree_root *root,
+ struct interval_tree_node *node);
+extern void interval_tree_remove(struct interval_tree_root *root,
+ struct interval_tree_node *node);
+#endif
+
+
diff --git a/lib/Makefile b/lib/Makefile
index 8c31a0c..2bbad25 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
idr.o int_sqrt.o extable.o prio_tree.o \
sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o prio_heap.o ratelimit.o show_mem.o \
- is_single_threaded.o plist.o decompress.o
+ is_single_threaded.o plist.o decompress.o intervaltree.o
lib-$(CONFIG_MMU) += ioremap.o
lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/intervaltree.c b/lib/intervaltree.c
new file mode 100644
index 0000000..47c52e0
--- /dev/null
+++ b/lib/intervaltree.c
@@ -0,0 +1,119 @@
+#include <linux/intervaltree.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+/* This code implements a naive interval tree, which stores a series of
+ * non-intersecting intervals.
+ * More complex interval trees can be read about here:
+ * http://en.wikipedia.org/wiki/Interval_tree
+ */
+
+
+/**
+ * interval_tree_in_interval - Returns the first node that intersects with the
+ * given interval
+ * @root: interval_tree root
+ * @start: interval start
+ * @end: interval end
+ *
+ */
+struct interval_tree_node *interval_tree_in_interval(
+ struct interval_tree_root *root,
+ u64 start, u64 end)
+{
+ struct rb_node *p = root->head.rb_node;
+ struct interval_tree_node *candidate, *match = NULL;
+
+ while (p) {
+ candidate = rb_entry(p, struct interval_tree_node, rb);
+ if (end < candidate->start)
+ p = p->rb_left;
+ else if (start > candidate->end)
+ p = p->rb_right;
+ else {
+ /* We found one, but try to find an earlier match */
+ match = candidate;
+ p = p->rb_left;
+ }
+ }
+
+ return match;
+}
+
+
+/**
+ * interval_tree_next_in_interval - Return the next interval in a intervaltree
+ * thatintersects with a specified interval.
+ * @root: interval_tree root
+ * @start: interval start
+ * @end: interval end
+ *
+ */
+struct interval_tree_node *interval_tree_next_in_interval(
+ struct interval_tree_node *node,
+ u64 start, u64 end)
+{
+ struct rb_node *next;
+ struct interval_tree_node *candidate;
+ if (!node)
+ return NULL;
+ next = rb_next(&node->rb);
+ if (!next)
+ return NULL;
+
+ candidate = container_of(next, struct interval_tree_node, rb);
+
+ if ((candidate->start > end) || (candidate->end < start))
+ return NULL;
+
+ return candidate;
+}
+
+/**
+ * interval_tree_add - Add a node to a interval tree
+ * @root: interval tree to be added to
+ * @node: interval_tree_node to be added
+ *
+ * Adds a node to the interval tree. Added interval should not intersect with
+ * existing intervals in the tree.
+ */
+void interval_tree_add(struct interval_tree_root *root,
+ struct interval_tree_node *node)
+{
+ struct rb_node **p = &root->head.rb_node;
+ struct rb_node *parent = NULL;
+ struct interval_tree_node *ptr;
+
+ WARN_ON_ONCE(!RB_EMPTY_NODE(&node->rb));
+
+ /* XXX might want to conditionalize this on debugging checks */
+ WARN_ON_ONCE(!!interval_tree_in_interval(root, node->start, node->end));
+
+ while (*p) {
+ parent = *p;
+ ptr = rb_entry(parent, struct interval_tree_node, rb);
+ if (node->start < ptr->start)
+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;
+ }
+ rb_link_node(&node->rb, parent, p);
+ rb_insert_color(&node->rb, &root->head);
+}
+
+
+/**
+ * interval_tree_remove: Removes a given node from the tree
+ * @root: root of tree
+ * @node: Node to be removed
+ *
+ * Removes a node and splays the tree
+ */
+void interval_tree_remove(struct interval_tree_root *root,
+ struct interval_tree_node *node)
+{
+ WARN_ON_ONCE(RB_EMPTY_NODE(&node->rb));
+
+ rb_erase(&node->rb, &root->head);
+ RB_CLEAR_NODE(&node->rb);
+}
--
1.7.3.2.146.gca209
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/6] [RFC] Add volatile range management code
2012-06-13 1:10 [PATCH 0/6] [RFC] Fallocate Volatile Ranges v4 John Stultz
2012-06-13 1:10 ` [PATCH 1/6] [RFC] Interval tree implementation John Stultz
@ 2012-06-13 1:10 ` John Stultz
2012-06-13 1:10 ` [PATCH 3/6] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers John Stultz
` (3 subsequent siblings)
5 siblings, 0 replies; 9+ messages in thread
From: John Stultz @ 2012-06-13 1:10 UTC (permalink / raw)
To: LKML
Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
Aneesh Kumar K.V, Taras Glek, Mike Hommey, Jan Kara,
KOSAKI Motohiro
This patch provides the volatile range management code
that filesystems can utilize when implementing
FALLOC_FL_MARK_VOLATILE.
It tracks a collection of page ranges against a mapping
stored in an interval-tree. This code handles coalescing
overlapping and adjacent ranges, as well as splitting
ranges when sub-chunks are removed.
The ranges can be marked purged or unpurged. And there is
a per-fs lru list that tracks all the unpurged ranges for
that fs.
v2:
* Fix bug in volatile_ranges_get_last_used returning bad
start,end values
* Rework for intervaltree renaming
* Optimize volatile_range_lru_size to avoid running through
lru list each time.
v3:
* Improve function name to make it clear what the
volatile_ranges_pluck_lru() code does.
* Drop volatile_range_lru_size and unpurged_page_count
mangement as its now unused
v4:
* Re-add volatile_range_lru_size and unpruged_page_count
* Fix bug in range_remove when we split ranges, we add
an overlapping range before resizing the existing range.
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Taras Glek <tgek@mozilla.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
include/linux/volatile.h | 45 ++++
mm/Makefile | 2 +-
mm/volatile.c | 509 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 555 insertions(+), 1 deletions(-)
create mode 100644 include/linux/volatile.h
create mode 100644 mm/volatile.c
diff --git a/include/linux/volatile.h b/include/linux/volatile.h
new file mode 100644
index 0000000..51f3d6e
--- /dev/null
+++ b/include/linux/volatile.h
@@ -0,0 +1,45 @@
+#ifndef _LINUX_VOLATILE_H
+#define _LINUX_VOLATILE_H
+
+#include <linux/fs.h>
+
+struct volatile_fs_head {
+ struct mutex lock;
+ struct list_head lru_head;
+ s64 unpurged_page_count;
+};
+
+
+#define DEFINE_VOLATILE_FS_HEAD(name) struct volatile_fs_head name = { \
+ .lock = __MUTEX_INITIALIZER(name.lock), \
+ .lru_head = LIST_HEAD_INIT(name.lru_head), \
+ .unpurged_page_count = 0, \
+}
+
+
+static inline void volatile_range_lock(struct volatile_fs_head *head)
+{
+ mutex_lock(&head->lock);
+}
+
+static inline void volatile_range_unlock(struct volatile_fs_head *head)
+{
+ mutex_unlock(&head->lock);
+}
+
+extern long volatile_range_add(struct volatile_fs_head *head,
+ struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index);
+extern long volatile_range_remove(struct volatile_fs_head *head,
+ struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index);
+
+extern s64 volatile_range_lru_size(struct volatile_fs_head *head);
+
+extern void volatile_range_clear(struct volatile_fs_head *head,
+ struct address_space *mapping);
+
+extern s64 volatile_ranges_pluck_lru(struct volatile_fs_head *head,
+ struct address_space **mapping,
+ loff_t *start, loff_t *end);
+#endif /* _LINUX_VOLATILE_H */
diff --git a/mm/Makefile b/mm/Makefile
index 2e2fbbe..3e3cd6f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -16,7 +16,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
readahead.o swap.o truncate.o vmscan.o shmem.o \
prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
page_isolation.o mm_init.o mmu_context.o percpu.o \
- compaction.o $(mmu-y)
+ compaction.o volatile.o $(mmu-y)
obj-y += init-mm.o
ifdef CONFIG_NO_BOOTMEM
diff --git a/mm/volatile.c b/mm/volatile.c
new file mode 100644
index 0000000..f822ba2
--- /dev/null
+++ b/mm/volatile.c
@@ -0,0 +1,509 @@
+/* mm/volatile.c
+ *
+ * Volatile page range managment.
+ * Copyright 2011 Linaro
+ *
+ * Based on mm/ashmem.c
+ * by Robert Love <rlove@google.com>
+ * Copyright (C) 2008 Google, Inc.
+ *
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * The volatile range management is a helper layer on top of the range tree
+ * code, which is used to help filesystems manage page ranges that are volatile.
+ *
+ * These ranges are stored in a per-mapping range tree. Storing both purged and
+ * unpurged ranges connected to that address_space. Unpurged ranges are also
+ * linked together in an lru list that is per-volatile-fs-head (basically
+ * per-filesystem).
+ *
+ * The goal behind volatile ranges is to allow applications to interact
+ * with the kernel's cache management infrastructure. In particular an
+ * application can say "this memory contains data that might be useful in
+ * the future, but can be reconstructed if necessary, so if the kernel
+ * needs, it can zap and reclaim this memory without having to swap it out.
+ *
+ * The proposed mechanism - at a high level - is for user-space to be able
+ * to say "This memory is volatile" and then later "this memory is no longer
+ * volatile". If the content of the memory is still available the second
+ * request succeeds. If not, the memory is marked non-volatile and an
+ * error is returned to denote that the contents have been lost.
+ *
+ * Credits to Neil Brown for the above description.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/pagemap.h>
+#include <linux/volatile.h>
+#include <linux/intervaltree.h>
+#include <linux/hash.h>
+#include <linux/shmem_fs.h>
+
+
+struct volatile_range {
+ struct list_head lru;
+ struct interval_tree_node interval_node;
+ unsigned int purged;
+ struct address_space *mapping;
+};
+
+
+/*
+ * To avoid bloating the address_space structure, we use
+ * a hash structure to map from address_space mappings to
+ * the interval_tree root that stores volatile ranges
+ */
+static DEFINE_MUTEX(hash_mutex);
+static struct hlist_head *mapping_hash;
+static long mapping_hash_shift = 8;
+struct mapping_hash_entry {
+ struct interval_tree_root root;
+ struct address_space *mapping;
+ struct hlist_node hnode;
+};
+
+
+static inline
+struct interval_tree_root *__mapping_to_root(struct address_space *mapping)
+{
+ struct hlist_node *elem;
+ struct mapping_hash_entry *entry;
+ struct interval_tree_root *ret = NULL;
+
+ hlist_for_each_entry_rcu(entry, elem,
+ &mapping_hash[hash_ptr(mapping, mapping_hash_shift)],
+ hnode)
+ if (entry->mapping == mapping)
+ ret = &entry->root;
+
+ return ret;
+}
+
+
+static inline
+struct interval_tree_root *mapping_to_root(struct address_space *mapping)
+{
+ struct interval_tree_root *ret;
+
+ mutex_lock(&hash_mutex);
+ ret = __mapping_to_root(mapping);
+ mutex_unlock(&hash_mutex);
+ return ret;
+}
+
+
+static inline
+struct interval_tree_root *mapping_allocate_root(struct address_space *mapping)
+{
+ struct mapping_hash_entry *entry;
+ struct interval_tree_root *dblchk;
+ struct interval_tree_root *ret = NULL;
+
+ entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+ if (!entry)
+ return NULL;
+
+ mutex_lock(&hash_mutex);
+ /* Since we dropped the lock, double check that no one has
+ * created the same hash entry.
+ */
+ dblchk = __mapping_to_root(mapping);
+ if (dblchk) {
+ kfree(entry);
+ ret = dblchk;
+ goto out;
+ }
+
+ INIT_HLIST_NODE(&entry->hnode);
+ entry->mapping = mapping;
+ interval_tree_init(&entry->root);
+
+ hlist_add_head_rcu(&entry->hnode,
+ &mapping_hash[hash_ptr(mapping, mapping_hash_shift)]);
+
+ ret = &entry->root;
+out:
+ mutex_unlock(&hash_mutex);
+ return ret;
+}
+
+
+static inline void mapping_free_root(struct interval_tree_root *root)
+{
+ struct mapping_hash_entry *entry;
+
+ mutex_lock(&hash_mutex);
+ entry = container_of(root, struct mapping_hash_entry, root);
+
+ hlist_del_rcu(&entry->hnode);
+ kfree(entry);
+ mutex_unlock(&hash_mutex);
+}
+
+
+/* volatile range helpers */
+static inline void vrange_resize(struct volatile_fs_head *head,
+ struct volatile_range *range,
+ pgoff_t start_index, pgoff_t end_index)
+{
+ s64 old_size, new_size;
+
+ old_size = range->interval_node.end - range->interval_node.start;
+ new_size = end_index-start_index;
+
+ if (!range->purged)
+ head->unpurged_page_count += new_size - old_size;
+
+ range->interval_node.start = start_index;
+ range->interval_node.end = end_index;
+}
+
+static struct volatile_range *vrange_alloc(void)
+{
+ struct volatile_range *new;
+
+ new = kzalloc(sizeof(struct volatile_range), GFP_KERNEL);
+ if (!new)
+ return 0;
+ interval_tree_node_init(&new->interval_node);
+ return new;
+}
+
+static void vrange_del(struct volatile_fs_head *head,
+ struct interval_tree_root *root,
+ struct volatile_range *vrange)
+{
+ if (!vrange->purged) {
+ head->unpurged_page_count -=
+ vrange->interval_node.end - vrange->interval_node.start;
+ list_del(&vrange->lru);
+ }
+ interval_tree_remove(root, &vrange->interval_node);
+ kfree(vrange);
+}
+
+
+/**
+ * volatile_range_add: Marks a page interval as volatile
+ * @head: per-fs volatile head
+ * @mapping: address space who's range is being marked volatile
+ * @start_index: Starting page in range to be marked volatile
+ * @end_index: Ending page in range to be marked volatile
+ *
+ * Mark a region as volatile. Coalesces overlapping and neighboring regions.
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ * Returns 1 if the range was coalesced with any purged ranges.
+ * Returns 0 on success.
+ */
+long volatile_range_add(struct volatile_fs_head *head,
+ struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index)
+{
+ struct volatile_range *new;
+ struct interval_tree_node *node;
+ struct volatile_range *vrange;
+ struct interval_tree_root *root;
+ int purged = 0;
+ u64 start = (u64)start_index;
+ u64 end = (u64)end_index;
+
+ /* Make sure we're properly locked */
+ WARN_ON(!mutex_is_locked(&head->lock));
+
+ /*
+ * Because the lock might be held in a shrinker, release
+ * it during allocation.
+ */
+ mutex_unlock(&head->lock);
+ new = vrange_alloc();
+ mutex_lock(&head->lock);
+ if (!new)
+ return -ENOMEM;
+
+ root = mapping_to_root(mapping);
+ if (!root) {
+ mutex_unlock(&head->lock);
+ root = mapping_allocate_root(mapping);
+ mutex_lock(&head->lock);
+ if (!root) {
+ kfree(new);
+ return -ENOMEM;
+ }
+ }
+
+ /* First, find any existing intervals that overlap */
+ node = interval_tree_in_interval(root, start, end);
+ while (node) {
+ /* Already entirely marked volatile, so we're done */
+ if (node->start < start && node->end > end) {
+ /* don't need the allocated value */
+ kfree(new);
+ return purged;
+ }
+
+ /* Grab containing volatile range */
+ vrange = container_of(node, struct volatile_range,
+ interval_node);
+
+ /* Resize the new range to cover all overlapping ranges */
+ start = min_t(u64, start, node->start);
+ end = max_t(u64, end, node->end);
+
+ /* Inherit purged state from overlapping ranges */
+ purged |= vrange->purged;
+
+
+ node = interval_tree_next_in_interval(&vrange->interval_node,
+ start, end);
+ /* Delete the old range, as we consume it */
+ vrange_del(head, root, vrange);
+ }
+
+ /* Coalesce left-adjacent ranges */
+ node = interval_tree_in_interval(root, start-1, start);
+ if (node) {
+ vrange = container_of(node, struct volatile_range,
+ interval_node);
+ /* Only coalesce if both are either purged or unpurged */
+ if (vrange->purged == purged) {
+ /* resize new range */
+ start = min_t(u64, start, node->start);
+ end = max_t(u64, end, node->end);
+ /* delete old range */
+ vrange_del(head, root, vrange);
+ }
+ }
+
+ /* Coalesce right-adjacent ranges */
+ node = interval_tree_in_interval(root, end, end+1);
+ if (node) {
+ vrange = container_of(node, struct volatile_range,
+ interval_node);
+ /* Only coalesce if both are either purged or unpurged */
+ if (vrange->purged == purged) {
+ /* resize new range */
+ start = min_t(u64, start, node->start);
+ end = max_t(u64, end, node->end);
+ /* delete old range */
+ vrange_del(head, root, vrange);
+ }
+ }
+ /* Assign and store the new range in the range tree */
+ new->mapping = mapping;
+ new->interval_node.start = start;
+ new->interval_node.end = end;
+ new->purged = purged;
+ interval_tree_add(root, &new->interval_node);
+
+ /* Only add unpurged ranges to LRU */
+ if (!purged) {
+ head->unpurged_page_count += end - start;
+ list_add_tail(&new->lru, &head->lru_head);
+ }
+ return purged;
+}
+
+
+/**
+ * volatile_range_remove: Marks a page interval as nonvolatile
+ * @head: per-fs volatile head
+ * @mapping: address space who's range is being marked nonvolatile
+ * @start_index: Starting page in range to be marked nonvolatile
+ * @end_index: Ending page in range to be marked nonvolatile
+ *
+ * Mark a region as nonvolatile. And remove any contained pages
+ * from the volatile range tree.
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ * Returns 1 if any portion of the range was purged.
+ * Returns 0 on success.
+ */
+long volatile_range_remove(struct volatile_fs_head *head,
+ struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index)
+{
+ struct volatile_range *new;
+ struct interval_tree_node *node;
+ struct interval_tree_root *root;
+ int ret = 0;
+ int used_new = 0;
+ u64 start = (u64)start_index;
+ u64 end = (u64)end_index;
+
+ /* Make sure we're properly locked */
+ WARN_ON(!mutex_is_locked(&head->lock));
+
+ /*
+ * Because the lock might be held in a shrinker, release
+ * it during allocation.
+ */
+ mutex_unlock(&head->lock);
+ new = vrange_alloc();
+ mutex_lock(&head->lock);
+ if (!new)
+ return -ENOMEM;
+
+ root = mapping_to_root(mapping);
+ if (!root)
+ goto out;
+
+
+ /* Find any overlapping ranges */
+ node = interval_tree_in_interval(root, start, end);
+ while (node) {
+ struct volatile_range *vrange;
+ vrange = container_of(node, struct volatile_range,
+ interval_node);
+
+ ret |= vrange->purged;
+
+ if (start <= node->start && end >= node->end) {
+ /* delete: volatile range is totally within range */
+ node = interval_tree_next_in_interval(
+ &vrange->interval_node,
+ start, end);
+ vrange_del(head, root, vrange);
+ } else if (node->start >= start) {
+ /* resize: volatile range right-overlaps range */
+ vrange_resize(head, vrange, end+1, node->end);
+ node = interval_tree_next_in_interval(
+ &vrange->interval_node,
+ start, end);
+
+ } else if (node->end <= end) {
+ /* resize: volatile range left-overlaps range */
+ vrange_resize(head, vrange, node->start, start-1);
+ node = interval_tree_next_in_interval(
+ &vrange->interval_node,
+ start, end);
+ } else {
+ /* split: range is totally within a volatile range */
+ used_new = 1; /* we only do this once */
+ new->mapping = mapping;
+ new->interval_node.start = end + 1;
+ new->interval_node.end = node->end;
+ new->purged = vrange->purged;
+ vrange_resize(head, vrange, node->start, start-1);
+ interval_tree_add(root, &new->interval_node);
+ if (!new->purged)
+ list_add_tail(&new->lru, &head->lru_head);
+
+ break;
+ }
+ }
+
+out:
+ if (!used_new)
+ kfree(new);
+
+ return ret;
+}
+
+/**
+ * volatile_range_lru_size: Returns the number of unpurged pages on the lru
+ * @head: per-fs volatile head
+ *
+ * Returns the number of unpurged pages on the LRU
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ */
+s64 volatile_range_lru_size(struct volatile_fs_head *head)
+{
+ WARN_ON(!mutex_is_locked(&head->lock));
+ return head->unpurged_page_count;
+}
+
+
+/**
+ * volatile_ranges_pluck_lru: Returns mapping and size of lru unpurged range
+ * @head: per-fs volatile head
+ * @mapping: dbl pointer to mapping who's range is being purged
+ * @start: Pointer to starting address of range being purged
+ * @end: Pointer to ending address of range being purged
+ *
+ * Returns the mapping, start and end values of the least recently used
+ * range. Marks the range as purged and removes it from the LRU.
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ * Returns 1 on success if a range was returned
+ * Return 0 if no ranges were found.
+ */
+s64 volatile_ranges_pluck_lru(struct volatile_fs_head *head,
+ struct address_space **mapping,
+ loff_t *start, loff_t *end)
+{
+ struct volatile_range *range;
+
+ WARN_ON(!mutex_is_locked(&head->lock));
+
+ if (list_empty(&head->lru_head))
+ return 0;
+
+ range = list_first_entry(&head->lru_head, struct volatile_range, lru);
+
+ *start = range->interval_node.start;
+ *end = range->interval_node.end;
+ *mapping = range->mapping;
+
+ head->unpurged_page_count -= *end - *start;
+ list_del(&range->lru);
+ range->purged = 1;
+
+ return 1;
+}
+
+
+/*
+ * Cleans up any volatile ranges.
+ */
+void volatile_range_clear(struct volatile_fs_head *head,
+ struct address_space *mapping)
+{
+ struct volatile_range *tozap;
+ struct interval_tree_root *root;
+
+ WARN_ON(!mutex_is_locked(&head->lock));
+
+ root = mapping_to_root(mapping);
+ if (!root)
+ return;
+
+ while (!interval_tree_empty(root)) {
+ struct interval_tree_node *tmp;
+ tmp = interval_tree_root_node(root);
+ tozap = container_of(tmp, struct volatile_range, interval_node);
+ vrange_del(head, root, tozap);
+ }
+ mapping_free_root(root);
+}
+
+
+static int __init volatile_init(void)
+{
+ int i, size;
+
+ size = 1U << mapping_hash_shift;
+ mapping_hash = kzalloc(sizeof(mapping_hash)*size, GFP_KERNEL);
+ for (i = 0; i < size; i++)
+ INIT_HLIST_HEAD(&mapping_hash[i]);
+
+ return 0;
+}
+arch_initcall(volatile_init);
--
1.7.3.2.146.gca209
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 3/6] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
2012-06-13 1:10 [PATCH 0/6] [RFC] Fallocate Volatile Ranges v4 John Stultz
2012-06-13 1:10 ` [PATCH 1/6] [RFC] Interval tree implementation John Stultz
2012-06-13 1:10 ` [PATCH 2/6] [RFC] Add volatile range management code John Stultz
@ 2012-06-13 1:10 ` John Stultz
2012-06-13 1:10 ` [PATCH 4/6] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
` (2 subsequent siblings)
5 siblings, 0 replies; 9+ messages in thread
From: John Stultz @ 2012-06-13 1:10 UTC (permalink / raw)
To: LKML
Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
Aneesh Kumar K.V, Taras Glek, Mike Hommey, Jan Kara,
KOSAKI Motohiro
This patch enables FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE
functionality for tmpfs making use of the volatile range
management code.
Conceptually, FALLOC_FL_MARK_VOLATILE is like a delayed
FALLOC_FL_PUNCH_HOLE. This allows applications that have
data caches that can be re-created to tell the kernel that
some memory contains data that is useful in the future, but
can be recreated if needed, so if the kernel needs, it can
zap the memory without having to swap it out.
In use, applications use FALLOC_FL_MARK_VOLATILE to mark
page ranges as volatile when they are not in use. Then later
if they wants to reuse the data, they use
FALLOC_FL_UNMARK_VOLATILE, which will return an error if the
data has been purged.
This is very much influenced by the Android Ashmem interface by
Robert Love so credits to him and the Android developers.
In many cases the code & logic come directly from the ashmem patch.
The intent of this patch is to allow for ashmem-like behavior, but
embeds the idea a little deeper into the VM code.
This is a reworked version of the fadvise volatile idea submitted
earlier to the list. Thanks to Dave Chinner for suggesting to
rework the idea in this fashion. Also thanks to Dmitry Adamushko
for continued review and bug reporting, and Dave Hansen for
help with the original design and mentoring me in the VM code.
v3:
* Fix off by one issue when truncating page ranges
* Use Dave Hansesn's suggestion to use shmem_writepage to trigger
range purging instead of using a shrinker.
v4:
* Revert the shrinker removal, since writepage won't get called
if we don't have swap.
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Taras Glek <tgek@mozilla.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
fs/open.c | 3 +-
include/linux/falloc.h | 7 ++-
mm/shmem.c | 112 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 118 insertions(+), 4 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index d6c79a0..c0531c0 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -223,7 +223,8 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
return -EINVAL;
/* Return error if mode is not supported */
- if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
+ FALLOC_FL_MARK_VOLATILE | FALLOC_FL_UNMARK_VOLATILE))
return -EOPNOTSUPP;
/* Punch hole must have keep size set */
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 73e0b62..3e47ad5 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -1,9 +1,10 @@
#ifndef _FALLOC_H_
#define _FALLOC_H_
-#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */
-#define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */
-
+#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */
+#define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */
+#define FALLOC_FL_MARK_VOLATILE 0x04 /* mark range volatile */
+#define FALLOC_FL_UNMARK_VOLATILE 0x08 /* mark range non-volatile */
#ifdef __KERNEL__
/*
diff --git a/mm/shmem.c b/mm/shmem.c
index a15a466..0513816 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -64,6 +64,7 @@ static struct vfsmount *shm_mnt;
#include <linux/highmem.h>
#include <linux/seq_file.h>
#include <linux/magic.h>
+#include <linux/volatile.h>
#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -624,6 +625,102 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
return error;
}
+static DEFINE_VOLATILE_FS_HEAD(shmem_volatile_head);
+
+static int shmem_mark_volatile(struct inode *inode, loff_t offset, loff_t len)
+{
+ loff_t lstart, lend;
+ int ret;
+
+ lstart = offset >> PAGE_CACHE_SHIFT;
+ lend = (offset+len) >> PAGE_CACHE_SHIFT;
+
+ volatile_range_lock(&shmem_volatile_head);
+ ret = volatile_range_add(&shmem_volatile_head, &inode->i_data,
+ lstart, lend);
+ if (ret > 0) { /* immdiately purge */
+ shmem_truncate_range(inode, lstart<<PAGE_CACHE_SHIFT,
+ ((lend+1)<<PAGE_CACHE_SHIFT)-1);
+ ret = 0;
+ }
+ volatile_range_unlock(&shmem_volatile_head);
+
+ return ret;
+}
+
+static int shmem_unmark_volatile(struct inode *inode, loff_t offset, loff_t len)
+{
+ loff_t lstart, lend;
+ int ret;
+
+ lstart = offset >> PAGE_CACHE_SHIFT;
+ lend = (offset+len) >> PAGE_CACHE_SHIFT;
+
+ volatile_range_lock(&shmem_volatile_head);
+ ret = volatile_range_remove(&shmem_volatile_head,
+ &inode->i_data,
+ lstart, lend);
+ volatile_range_unlock(&shmem_volatile_head);
+
+ return ret;
+}
+
+static void shmem_clear_volatile(struct inode *inode)
+{
+ volatile_range_lock(&shmem_volatile_head);
+ volatile_range_clear(&shmem_volatile_head, &inode->i_data);
+ volatile_range_unlock(&shmem_volatile_head);
+}
+
+static
+int shmem_volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
+{
+ s64 nr_to_scan = sc->nr_to_scan;
+ const gfp_t gfp_mask = sc->gfp_mask;
+ struct address_space *mapping;
+ loff_t start, end;
+ int ret;
+ s64 page_count;
+
+ if (nr_to_scan && !(gfp_mask & __GFP_FS))
+ return -1;
+
+ volatile_range_lock(&shmem_volatile_head);
+ page_count = volatile_range_lru_size(&shmem_volatile_head);
+ if (!nr_to_scan)
+ goto out;
+
+ do {
+ ret = volatile_ranges_pluck_lru(&shmem_volatile_head,
+ &mapping, &start, &end);
+ if (ret) {
+ shmem_truncate_range(mapping->host,
+ start<<PAGE_CACHE_SHIFT,
+ (end<<PAGE_CACHE_SHIFT)-1);
+ nr_to_scan -= end-start;
+ page_count -= end-start;
+ };
+ } while (ret && (nr_to_scan > 0));
+
+out:
+ volatile_range_unlock(&shmem_volatile_head);
+
+ return page_count;
+}
+
+static struct shrinker shmem_volatile_shrinker = {
+ .shrink = shmem_volatile_shrink,
+ .seeks = DEFAULT_SEEKS,
+};
+
+static int __init shmem_shrinker_init(void)
+{
+ register_shrinker(&shmem_volatile_shrinker);
+ return 0;
+}
+arch_initcall(shmem_shrinker_init);
+
+
static void shmem_evict_inode(struct inode *inode)
{
struct shmem_inode_info *info = SHMEM_I(inode);
@@ -1806,6 +1903,14 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
/* No need to unmap again: hole-punching leaves COWed pages */
error = 0;
goto out;
+ } else if (mode & FALLOC_FL_MARK_VOLATILE) {
+ /* Mark pages volatile, sort of delayed hole punching */
+ error = shmem_mark_volatile(inode, offset, len);
+ goto out;
+ } else if (mode & FALLOC_FL_UNMARK_VOLATILE) {
+ /* Mark pages non-volatile, return error if pages were purged */
+ error = shmem_unmark_volatile(inode, offset, len);
+ goto out;
}
/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
@@ -1884,6 +1989,12 @@ out:
return error;
}
+static int shmem_release(struct inode *inode, struct file *file)
+{
+ shmem_clear_volatile(inode);
+ return 0;
+}
+
static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf)
{
struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb);
@@ -2795,6 +2906,7 @@ static const struct file_operations shmem_file_operations = {
.splice_read = shmem_file_splice_read,
.splice_write = generic_file_splice_write,
.fallocate = shmem_fallocate,
+ .release = shmem_release,
#endif
};
--
1.7.3.2.146.gca209
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 4/6] [RFC] ashmem: Convert ashmem to use volatile ranges
2012-06-13 1:10 [PATCH 0/6] [RFC] Fallocate Volatile Ranges v4 John Stultz
` (2 preceding siblings ...)
2012-06-13 1:10 ` [PATCH 3/6] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers John Stultz
@ 2012-06-13 1:10 ` John Stultz
2012-06-13 1:11 ` [PATCH 5/6] [RFC][HACK] tmpfs: Purge volatile ranges on writepage instead of using shrinker John Stultz
2012-06-13 1:11 ` [PATCH 6/6] [RFC][HACK] mm: Change memory management of anonymous pages on swapless systems John Stultz
5 siblings, 0 replies; 9+ messages in thread
From: John Stultz @ 2012-06-13 1:10 UTC (permalink / raw)
To: LKML
Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
Aneesh Kumar K.V, Taras Glek, Mike Hommey, Jan Kara,
KOSAKI Motohiro
Rework of my first pass attempt at getting ashmem to utilize
the volatile range code, now using the fallocate interface.
In this implementaiton GET_PIN_STATUS is unimplemented, due to
the fact that adding a ISVOLATILE check wasn't considered
terribly useful in earlier reviews. It would be trivial to
re-add that functionality, but I wanted to check w/ the
Android developers to see how often GET_PIN_STATUS is actually
used?
Similarly the ashmem PURGE_ALL_CACHES ioctl does not function,
as the volatile range purging is no longer directly under its
control.
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Taras Glek <tgek@mozilla.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
drivers/staging/android/ashmem.c | 331 +------------------------------------
1 files changed, 9 insertions(+), 322 deletions(-)
diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index e84dbec..9e8fe37 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -50,26 +50,6 @@ struct ashmem_area {
};
/*
- * ashmem_range - represents an interval of unpinned (evictable) pages
- * Lifecycle: From unpin to pin
- * Locking: Protected by `ashmem_mutex'
- */
-struct ashmem_range {
- struct list_head lru; /* entry in LRU list */
- struct list_head unpinned; /* entry in its area's unpinned list */
- struct ashmem_area *asma; /* associated area */
- size_t pgstart; /* starting page, inclusive */
- size_t pgend; /* ending page, inclusive */
- unsigned int purged; /* ASHMEM_NOT or ASHMEM_WAS_PURGED */
-};
-
-/* LRU list of unpinned pages, protected by ashmem_mutex */
-static LIST_HEAD(ashmem_lru_list);
-
-/* Count of pages on our LRU list, protected by ashmem_mutex */
-static unsigned long lru_count;
-
-/*
* ashmem_mutex - protects the list of and each individual ashmem_area
*
* Lock Ordering: ashmex_mutex -> i_mutex -> i_alloc_sem
@@ -77,102 +57,9 @@ static unsigned long lru_count;
static DEFINE_MUTEX(ashmem_mutex);
static struct kmem_cache *ashmem_area_cachep __read_mostly;
-static struct kmem_cache *ashmem_range_cachep __read_mostly;
-
-#define range_size(range) \
- ((range)->pgend - (range)->pgstart + 1)
-
-#define range_on_lru(range) \
- ((range)->purged == ASHMEM_NOT_PURGED)
-
-#define page_range_subsumes_range(range, start, end) \
- (((range)->pgstart >= (start)) && ((range)->pgend <= (end)))
-
-#define page_range_subsumed_by_range(range, start, end) \
- (((range)->pgstart <= (start)) && ((range)->pgend >= (end)))
-
-#define page_in_range(range, page) \
- (((range)->pgstart <= (page)) && ((range)->pgend >= (page)))
-
-#define page_range_in_range(range, start, end) \
- (page_in_range(range, start) || page_in_range(range, end) || \
- page_range_subsumes_range(range, start, end))
-
-#define range_before_page(range, page) \
- ((range)->pgend < (page))
#define PROT_MASK (PROT_EXEC | PROT_READ | PROT_WRITE)
-static inline void lru_add(struct ashmem_range *range)
-{
- list_add_tail(&range->lru, &ashmem_lru_list);
- lru_count += range_size(range);
-}
-
-static inline void lru_del(struct ashmem_range *range)
-{
- list_del(&range->lru);
- lru_count -= range_size(range);
-}
-
-/*
- * range_alloc - allocate and initialize a new ashmem_range structure
- *
- * 'asma' - associated ashmem_area
- * 'prev_range' - the previous ashmem_range in the sorted asma->unpinned list
- * 'purged' - initial purge value (ASMEM_NOT_PURGED or ASHMEM_WAS_PURGED)
- * 'start' - starting page, inclusive
- * 'end' - ending page, inclusive
- *
- * Caller must hold ashmem_mutex.
- */
-static int range_alloc(struct ashmem_area *asma,
- struct ashmem_range *prev_range, unsigned int purged,
- size_t start, size_t end)
-{
- struct ashmem_range *range;
-
- range = kmem_cache_zalloc(ashmem_range_cachep, GFP_KERNEL);
- if (unlikely(!range))
- return -ENOMEM;
-
- range->asma = asma;
- range->pgstart = start;
- range->pgend = end;
- range->purged = purged;
-
- list_add_tail(&range->unpinned, &prev_range->unpinned);
-
- if (range_on_lru(range))
- lru_add(range);
-
- return 0;
-}
-
-static void range_del(struct ashmem_range *range)
-{
- list_del(&range->unpinned);
- if (range_on_lru(range))
- lru_del(range);
- kmem_cache_free(ashmem_range_cachep, range);
-}
-
-/*
- * range_shrink - shrinks a range
- *
- * Caller must hold ashmem_mutex.
- */
-static inline void range_shrink(struct ashmem_range *range,
- size_t start, size_t end)
-{
- size_t pre = range_size(range);
-
- range->pgstart = start;
- range->pgend = end;
-
- if (range_on_lru(range))
- lru_count -= pre - range_size(range);
-}
static int ashmem_open(struct inode *inode, struct file *file)
{
@@ -198,12 +85,6 @@ static int ashmem_open(struct inode *inode, struct file *file)
static int ashmem_release(struct inode *ignored, struct file *file)
{
struct ashmem_area *asma = file->private_data;
- struct ashmem_range *range, *next;
-
- mutex_lock(&ashmem_mutex);
- list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned)
- range_del(range);
- mutex_unlock(&ashmem_mutex);
if (asma->file)
fput(asma->file);
@@ -337,56 +218,6 @@ out:
return ret;
}
-/*
- * ashmem_shrink - our cache shrinker, called from mm/vmscan.c :: shrink_slab
- *
- * 'nr_to_scan' is the number of objects (pages) to prune, or 0 to query how
- * many objects (pages) we have in total.
- *
- * 'gfp_mask' is the mask of the allocation that got us into this mess.
- *
- * Return value is the number of objects (pages) remaining, or -1 if we cannot
- * proceed without risk of deadlock (due to gfp_mask).
- *
- * We approximate LRU via least-recently-unpinned, jettisoning unpinned partial
- * chunks of ashmem regions LRU-wise one-at-a-time until we hit 'nr_to_scan'
- * pages freed.
- */
-static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
-{
- struct ashmem_range *range, *next;
-
- /* We might recurse into filesystem code, so bail out if necessary */
- if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
- return -1;
- if (!sc->nr_to_scan)
- return lru_count;
-
- mutex_lock(&ashmem_mutex);
- list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
- loff_t start = range->pgstart * PAGE_SIZE;
- loff_t end = (range->pgend + 1) * PAGE_SIZE;
-
- do_fallocate(range->asma->file,
- FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
- start, end - start);
- range->purged = ASHMEM_WAS_PURGED;
- lru_del(range);
-
- sc->nr_to_scan -= range_size(range);
- if (sc->nr_to_scan <= 0)
- break;
- }
- mutex_unlock(&ashmem_mutex);
-
- return lru_count;
-}
-
-static struct shrinker ashmem_shrinker = {
- .shrink = ashmem_shrink,
- .seeks = DEFAULT_SEEKS * 4,
-};
-
static int set_prot_mask(struct ashmem_area *asma, unsigned long prot)
{
int ret = 0;
@@ -459,136 +290,10 @@ static int get_name(struct ashmem_area *asma, void __user *name)
return ret;
}
-/*
- * ashmem_pin - pin the given ashmem region, returning whether it was
- * previously purged (ASHMEM_WAS_PURGED) or not (ASHMEM_NOT_PURGED).
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_pin(struct ashmem_area *asma, size_t pgstart, size_t pgend)
-{
- struct ashmem_range *range, *next;
- int ret = ASHMEM_NOT_PURGED;
-
- list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned) {
- /* moved past last applicable page; we can short circuit */
- if (range_before_page(range, pgstart))
- break;
-
- /*
- * The user can ask us to pin pages that span multiple ranges,
- * or to pin pages that aren't even unpinned, so this is messy.
- *
- * Four cases:
- * 1. The requested range subsumes an existing range, so we
- * just remove the entire matching range.
- * 2. The requested range overlaps the start of an existing
- * range, so we just update that range.
- * 3. The requested range overlaps the end of an existing
- * range, so we just update that range.
- * 4. The requested range punches a hole in an existing range,
- * so we have to update one side of the range and then
- * create a new range for the other side.
- */
- if (page_range_in_range(range, pgstart, pgend)) {
- ret |= range->purged;
-
- /* Case #1: Easy. Just nuke the whole thing. */
- if (page_range_subsumes_range(range, pgstart, pgend)) {
- range_del(range);
- continue;
- }
-
- /* Case #2: We overlap from the start, so adjust it */
- if (range->pgstart >= pgstart) {
- range_shrink(range, pgend + 1, range->pgend);
- continue;
- }
-
- /* Case #3: We overlap from the rear, so adjust it */
- if (range->pgend <= pgend) {
- range_shrink(range, range->pgstart, pgstart-1);
- continue;
- }
-
- /*
- * Case #4: We eat a chunk out of the middle. A bit
- * more complicated, we allocate a new range for the
- * second half and adjust the first chunk's endpoint.
- */
- range_alloc(asma, range, range->purged,
- pgend + 1, range->pgend);
- range_shrink(range, range->pgstart, pgstart - 1);
- break;
- }
- }
-
- return ret;
-}
-
-/*
- * ashmem_unpin - unpin the given range of pages. Returns zero on success.
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_unpin(struct ashmem_area *asma, size_t pgstart, size_t pgend)
-{
- struct ashmem_range *range, *next;
- unsigned int purged = ASHMEM_NOT_PURGED;
-
-restart:
- list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned) {
- /* short circuit: this is our insertion point */
- if (range_before_page(range, pgstart))
- break;
-
- /*
- * The user can ask us to unpin pages that are already entirely
- * or partially pinned. We handle those two cases here.
- */
- if (page_range_subsumed_by_range(range, pgstart, pgend))
- return 0;
- if (page_range_in_range(range, pgstart, pgend)) {
- pgstart = min_t(size_t, range->pgstart, pgstart),
- pgend = max_t(size_t, range->pgend, pgend);
- purged |= range->purged;
- range_del(range);
- goto restart;
- }
- }
-
- return range_alloc(asma, range, purged, pgstart, pgend);
-}
-
-/*
- * ashmem_get_pin_status - Returns ASHMEM_IS_UNPINNED if _any_ pages in the
- * given interval are unpinned and ASHMEM_IS_PINNED otherwise.
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_get_pin_status(struct ashmem_area *asma, size_t pgstart,
- size_t pgend)
-{
- struct ashmem_range *range;
- int ret = ASHMEM_IS_PINNED;
-
- list_for_each_entry(range, &asma->unpinned_list, unpinned) {
- if (range_before_page(range, pgstart))
- break;
- if (page_range_in_range(range, pgstart, pgend)) {
- ret = ASHMEM_IS_UNPINNED;
- break;
- }
- }
-
- return ret;
-}
-
static int ashmem_pin_unpin(struct ashmem_area *asma, unsigned long cmd,
void __user *p)
{
struct ashmem_pin pin;
- size_t pgstart, pgend;
int ret = -EINVAL;
if (unlikely(!asma->file))
@@ -610,20 +315,24 @@ static int ashmem_pin_unpin(struct ashmem_area *asma, unsigned long cmd,
if (unlikely(PAGE_ALIGN(asma->size) < pin.offset + pin.len))
return -EINVAL;
- pgstart = pin.offset / PAGE_SIZE;
- pgend = pgstart + (pin.len / PAGE_SIZE) - 1;
mutex_lock(&ashmem_mutex);
switch (cmd) {
case ASHMEM_PIN:
- ret = ashmem_pin(asma, pgstart, pgend);
+ ret = do_fallocate(asma->file, FALLOC_FL_MARK_VOLATILE,
+ pin.offset, pin.len);
break;
case ASHMEM_UNPIN:
- ret = ashmem_unpin(asma, pgstart, pgend);
+ ret = do_fallocate(asma->file, FALLOC_FL_UNMARK_VOLATILE,
+ pin.offset, pin.len);
break;
case ASHMEM_GET_PIN_STATUS:
- ret = ashmem_get_pin_status(asma, pgstart, pgend);
+ /*
+ * XXX - volatile ranges currently don't provide status,
+ * due to questionable utility
+ */
+ ret = -EINVAL;
break;
}
@@ -667,15 +376,6 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
break;
case ASHMEM_PURGE_ALL_CACHES:
ret = -EPERM;
- if (capable(CAP_SYS_ADMIN)) {
- struct shrink_control sc = {
- .gfp_mask = GFP_KERNEL,
- .nr_to_scan = 0,
- };
- ret = ashmem_shrink(&ashmem_shrinker, &sc);
- sc.nr_to_scan = ret;
- ashmem_shrink(&ashmem_shrinker, &sc);
- }
break;
}
@@ -711,22 +411,12 @@ static int __init ashmem_init(void)
return -ENOMEM;
}
- ashmem_range_cachep = kmem_cache_create("ashmem_range_cache",
- sizeof(struct ashmem_range),
- 0, 0, NULL);
- if (unlikely(!ashmem_range_cachep)) {
- printk(KERN_ERR "ashmem: failed to create slab cache\n");
- return -ENOMEM;
- }
-
ret = misc_register(&ashmem_misc);
if (unlikely(ret)) {
printk(KERN_ERR "ashmem: failed to register misc device!\n");
return ret;
}
- register_shrinker(&ashmem_shrinker);
-
printk(KERN_INFO "ashmem: initialized\n");
return 0;
@@ -736,13 +426,10 @@ static void __exit ashmem_exit(void)
{
int ret;
- unregister_shrinker(&ashmem_shrinker);
-
ret = misc_deregister(&ashmem_misc);
if (unlikely(ret))
printk(KERN_ERR "ashmem: failed to unregister misc device!\n");
- kmem_cache_destroy(ashmem_range_cachep);
kmem_cache_destroy(ashmem_area_cachep);
printk(KERN_INFO "ashmem: unloaded\n");
--
1.7.3.2.146.gca209
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 5/6] [RFC][HACK] tmpfs: Purge volatile ranges on writepage instead of using shrinker
2012-06-13 1:10 [PATCH 0/6] [RFC] Fallocate Volatile Ranges v4 John Stultz
` (3 preceding siblings ...)
2012-06-13 1:10 ` [PATCH 4/6] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
@ 2012-06-13 1:11 ` John Stultz
2012-06-13 1:11 ` [PATCH 6/6] [RFC][HACK] mm: Change memory management of anonymous pages on swapless systems John Stultz
5 siblings, 0 replies; 9+ messages in thread
From: John Stultz @ 2012-06-13 1:11 UTC (permalink / raw)
To: LKML
Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
Aneesh Kumar K.V, Taras Glek, Mike Hommey, Jan Kara,
KOSAKI Motohiro
In order to avoid using a shrinker, purge volatile ranges on writepage.
This requires deactivating/activating the page ranges together to
ensure relative LRU behavior in purging the volatile ranges.
This no longer requires the volatile range lru so remove that code.
Also add volatile range infrastructure to find a range that contains
a given page.
NOTE: On systems without swap, the VM won't shrink anonymous memory,
so writepage is never called and volatile ranges won't be purged.
This issue will be addressed in a following patch.
I've only been able to do minimal testing, so there's probably much
still wrong with this patch. But hopefully it will provide a concrete
approach for discussion.
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Taras Glek <tgek@mozilla.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
include/linux/volatile.h | 11 ++----
mm/shmem.c | 90 ++++++++++++++++++++-------------------------
mm/volatile.c | 84 +++++++++++--------------------------------
3 files changed, 64 insertions(+), 121 deletions(-)
diff --git a/include/linux/volatile.h b/include/linux/volatile.h
index 51f3d6e..8edb5c1 100644
--- a/include/linux/volatile.h
+++ b/include/linux/volatile.h
@@ -5,15 +5,11 @@
struct volatile_fs_head {
struct mutex lock;
- struct list_head lru_head;
- s64 unpurged_page_count;
};
#define DEFINE_VOLATILE_FS_HEAD(name) struct volatile_fs_head name = { \
.lock = __MUTEX_INITIALIZER(name.lock), \
- .lru_head = LIST_HEAD_INIT(name.lru_head), \
- .unpurged_page_count = 0, \
}
@@ -34,12 +30,11 @@ extern long volatile_range_remove(struct volatile_fs_head *head,
struct address_space *mapping,
pgoff_t start_index, pgoff_t end_index);
-extern s64 volatile_range_lru_size(struct volatile_fs_head *head);
-
extern void volatile_range_clear(struct volatile_fs_head *head,
struct address_space *mapping);
-extern s64 volatile_ranges_pluck_lru(struct volatile_fs_head *head,
- struct address_space **mapping,
+int volatile_page_in_range(struct volatile_fs_head *head,
+ struct page *page,
loff_t *start, loff_t *end);
+
#endif /* _LINUX_VOLATILE_H */
diff --git a/mm/shmem.c b/mm/shmem.c
index 0513816..8dbb8ac 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -629,7 +629,7 @@ static DEFINE_VOLATILE_FS_HEAD(shmem_volatile_head);
static int shmem_mark_volatile(struct inode *inode, loff_t offset, loff_t len)
{
- loff_t lstart, lend;
+ loff_t lstart, lend, index;
int ret;
lstart = offset >> PAGE_CACHE_SHIFT;
@@ -643,6 +643,17 @@ static int shmem_mark_volatile(struct inode *inode, loff_t offset, loff_t len)
((lend+1)<<PAGE_CACHE_SHIFT)-1);
ret = 0;
}
+
+ for (index = lstart; index < lend; index++) {
+ struct page *page;
+ int error = shmem_getpage(inode, index, &page, SGP_FALLOC,NULL);
+ if (error)
+ break;
+
+ deactivate_page(page);
+ unlock_page(page);
+ page_cache_release(page);
+ }
volatile_range_unlock(&shmem_volatile_head);
return ret;
@@ -650,7 +661,7 @@ static int shmem_mark_volatile(struct inode *inode, loff_t offset, loff_t len)
static int shmem_unmark_volatile(struct inode *inode, loff_t offset, loff_t len)
{
- loff_t lstart, lend;
+ loff_t lstart, lend, index;
int ret;
lstart = offset >> PAGE_CACHE_SHIFT;
@@ -660,6 +671,18 @@ static int shmem_unmark_volatile(struct inode *inode, loff_t offset, loff_t len)
ret = volatile_range_remove(&shmem_volatile_head,
&inode->i_data,
lstart, lend);
+
+ for (index = lstart; index < lend; index++) {
+ struct page *page;
+ int error = shmem_getpage(inode, index, &page, SGP_FALLOC,NULL);
+ if (error)
+ break;
+
+ activate_page(page);
+ unlock_page(page);
+ page_cache_release(page);
+ }
+
volatile_range_unlock(&shmem_volatile_head);
return ret;
@@ -672,54 +695,6 @@ static void shmem_clear_volatile(struct inode *inode)
volatile_range_unlock(&shmem_volatile_head);
}
-static
-int shmem_volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
-{
- s64 nr_to_scan = sc->nr_to_scan;
- const gfp_t gfp_mask = sc->gfp_mask;
- struct address_space *mapping;
- loff_t start, end;
- int ret;
- s64 page_count;
-
- if (nr_to_scan && !(gfp_mask & __GFP_FS))
- return -1;
-
- volatile_range_lock(&shmem_volatile_head);
- page_count = volatile_range_lru_size(&shmem_volatile_head);
- if (!nr_to_scan)
- goto out;
-
- do {
- ret = volatile_ranges_pluck_lru(&shmem_volatile_head,
- &mapping, &start, &end);
- if (ret) {
- shmem_truncate_range(mapping->host,
- start<<PAGE_CACHE_SHIFT,
- (end<<PAGE_CACHE_SHIFT)-1);
- nr_to_scan -= end-start;
- page_count -= end-start;
- };
- } while (ret && (nr_to_scan > 0));
-
-out:
- volatile_range_unlock(&shmem_volatile_head);
-
- return page_count;
-}
-
-static struct shrinker shmem_volatile_shrinker = {
- .shrink = shmem_volatile_shrink,
- .seeks = DEFAULT_SEEKS,
-};
-
-static int __init shmem_shrinker_init(void)
-{
- register_shrinker(&shmem_volatile_shrinker);
- return 0;
-}
-arch_initcall(shmem_shrinker_init);
-
static void shmem_evict_inode(struct inode *inode)
{
@@ -884,14 +859,29 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
struct inode *inode;
swp_entry_t swap;
pgoff_t index;
+ loff_t start, end;
BUG_ON(!PageLocked(page));
mapping = page->mapping;
index = page->index;
inode = mapping->host;
info = SHMEM_I(inode);
+
if (info->flags & VM_LOCKED)
goto redirty;
+
+ volatile_range_lock(&shmem_volatile_head);
+ if (volatile_page_in_range(&shmem_volatile_head, page, &start, &end)) {
+ unlock_page(page);
+ volatile_range_unlock(&shmem_volatile_head);
+ shmem_truncate_range(inode, start<<PAGE_CACHE_SHIFT,
+ (end<<PAGE_CACHE_SHIFT)-1);
+
+ return 0;
+ }
+ volatile_range_unlock(&shmem_volatile_head);
+
+
if (!total_swap_pages)
goto redirty;
diff --git a/mm/volatile.c b/mm/volatile.c
index f822ba2..5d5c2cd 100644
--- a/mm/volatile.c
+++ b/mm/volatile.c
@@ -158,14 +158,6 @@ static inline void vrange_resize(struct volatile_fs_head *head,
struct volatile_range *range,
pgoff_t start_index, pgoff_t end_index)
{
- s64 old_size, new_size;
-
- old_size = range->interval_node.end - range->interval_node.start;
- new_size = end_index-start_index;
-
- if (!range->purged)
- head->unpurged_page_count += new_size - old_size;
-
range->interval_node.start = start_index;
range->interval_node.end = end_index;
}
@@ -185,11 +177,6 @@ static void vrange_del(struct volatile_fs_head *head,
struct interval_tree_root *root,
struct volatile_range *vrange)
{
- if (!vrange->purged) {
- head->unpurged_page_count -=
- vrange->interval_node.end - vrange->interval_node.start;
- list_del(&vrange->lru);
- }
interval_tree_remove(root, &vrange->interval_node);
kfree(vrange);
}
@@ -309,11 +296,6 @@ long volatile_range_add(struct volatile_fs_head *head,
new->purged = purged;
interval_tree_add(root, &new->interval_node);
- /* Only add unpurged ranges to LRU */
- if (!purged) {
- head->unpurged_page_count += end - start;
- list_add_tail(&new->lru, &head->lru_head);
- }
return purged;
}
@@ -400,8 +382,6 @@ long volatile_range_remove(struct volatile_fs_head *head,
new->purged = vrange->purged;
vrange_resize(head, vrange, node->start, start-1);
interval_tree_add(root, &new->interval_node);
- if (!new->purged)
- list_add_tail(&new->lru, &head->lru_head);
break;
}
@@ -414,62 +394,40 @@ out:
return ret;
}
-/**
- * volatile_range_lru_size: Returns the number of unpurged pages on the lru
- * @head: per-fs volatile head
- *
- * Returns the number of unpurged pages on the LRU
- *
- * Must lock the volatile_fs_head before calling!
- *
- */
-s64 volatile_range_lru_size(struct volatile_fs_head *head)
-{
- WARN_ON(!mutex_is_locked(&head->lock));
- return head->unpurged_page_count;
-}
-
-/**
- * volatile_ranges_pluck_lru: Returns mapping and size of lru unpurged range
- * @head: per-fs volatile head
- * @mapping: dbl pointer to mapping who's range is being purged
- * @start: Pointer to starting address of range being purged
- * @end: Pointer to ending address of range being purged
- *
- * Returns the mapping, start and end values of the least recently used
- * range. Marks the range as purged and removes it from the LRU.
- *
- * Must lock the volatile_fs_head before calling!
- *
- * Returns 1 on success if a range was returned
- * Return 0 if no ranges were found.
- */
-s64 volatile_ranges_pluck_lru(struct volatile_fs_head *head,
- struct address_space **mapping,
+int volatile_page_in_range(struct volatile_fs_head *head,
+ struct page *page,
loff_t *start, loff_t *end)
{
- struct volatile_range *range;
+ struct interval_tree_root *root;
+ struct interval_tree_node *node;
+ int ret = 0;
WARN_ON(!mutex_is_locked(&head->lock));
- if (list_empty(&head->lru_head))
+ root = mapping_to_root(page->mapping);
+ if (!root)
return 0;
- range = list_first_entry(&head->lru_head, struct volatile_range, lru);
-
- *start = range->interval_node.start;
- *end = range->interval_node.end;
- *mapping = range->mapping;
+ node = interval_tree_in_interval(root, page->index, page->index);
+ if (node) {
+ struct volatile_range *vrange;
- head->unpurged_page_count -= *end - *start;
- list_del(&range->lru);
- range->purged = 1;
+ vrange = container_of(node, struct volatile_range,
+ interval_node);
- return 1;
+ if (!vrange->purged) {
+ *start = vrange->interval_node.start;
+ *end = vrange->interval_node.end;
+ vrange->purged = 1;
+ ret = 1;
+ }
+ }
+ return ret;
}
+
/*
* Cleans up any volatile ranges.
*/
--
1.7.3.2.146.gca209
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 6/6] [RFC][HACK] mm: Change memory management of anonymous pages on swapless systems
2012-06-13 1:10 [PATCH 0/6] [RFC] Fallocate Volatile Ranges v4 John Stultz
` (4 preceding siblings ...)
2012-06-13 1:11 ` [PATCH 5/6] [RFC][HACK] tmpfs: Purge volatile ranges on writepage instead of using shrinker John Stultz
@ 2012-06-13 1:11 ` John Stultz
2012-06-13 4:36 ` Minchan Kim
5 siblings, 1 reply; 9+ messages in thread
From: John Stultz @ 2012-06-13 1:11 UTC (permalink / raw)
To: LKML
Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
Aneesh Kumar K.V, Taras Glek, Mike Hommey, Jan Kara,
KOSAKI Motohiro
Due to my newbie-ness, the following may not be precise, but
I think it conveys the intent of what I'm trying to do here.
Anonymous memory is tracked on two LRU lists: LRU_INACTIVE_ANON
and LRU_ACTIVE_ANON. This split is useful when we need to free
up pages and are trying to decide what to swap out.
However, on systems that do no have swap, this partition is less
clear. In many cases the code avoids aging active anonymous pages
onto the inactive list. However in some cases pages do get moved
to the inactive list, but we never call writepage, as there isn't
anything to swap out.
This patch changes some of the active/inactive list management of
anonymous memory when there is no swap. In that case pages are
always added to the active lru. The intent is that since anonymous
pages cannot be swapped out, they all shoudld be active.
The one exception is volatile pages, which can be moved to
the inactive lru by calling deactivate_page().
In addition, I've changed the logic so we also do try to shrink
the inactive anonymous lru, and call writepage. This should only
be done if there are volatile pages on the inactive lru.
This allows us to purge volatile pages in writepage when the system
does not have swap.
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Taras Glek <tgek@mozilla.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
include/linux/pagevec.h | 5 +----
include/linux/swap.h | 23 +++++++++++++++--------
mm/swap.c | 13 ++++++++++++-
mm/vmscan.c | 9 ---------
4 files changed, 28 insertions(+), 22 deletions(-)
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 2aa12b8..e1312a5 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -22,6 +22,7 @@ struct pagevec {
void __pagevec_release(struct pagevec *pvec);
void __pagevec_lru_add(struct pagevec *pvec, enum lru_list lru);
+void __pagevec_lru_add_anon(struct pagevec *pvec);
unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
pgoff_t start, unsigned nr_pages);
unsigned pagevec_lookup_tag(struct pagevec *pvec,
@@ -64,10 +65,6 @@ static inline void pagevec_release(struct pagevec *pvec)
__pagevec_release(pvec);
}
-static inline void __pagevec_lru_add_anon(struct pagevec *pvec)
-{
- __pagevec_lru_add(pvec, LRU_INACTIVE_ANON);
-}
static inline void __pagevec_lru_add_active_anon(struct pagevec *pvec)
{
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c84ec68..639936f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -238,14 +238,6 @@ extern void swap_setup(void);
extern void add_page_to_unevictable_list(struct page *page);
-/**
- * lru_cache_add: add a page to the page lists
- * @page: the page to add
- */
-static inline void lru_cache_add_anon(struct page *page)
-{
- __lru_cache_add(page, LRU_INACTIVE_ANON);
-}
static inline void lru_cache_add_file(struct page *page)
{
@@ -474,5 +466,20 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
}
#endif /* CONFIG_SWAP */
+
+/**
+ * lru_cache_add: add a page to the page lists
+ * @page: the page to add
+ */
+static inline void lru_cache_add_anon(struct page *page)
+{
+ int lru = LRU_INACTIVE_ANON;
+ if (!total_swap_pages)
+ lru = LRU_ACTIVE_ANON;
+
+ __lru_cache_add(page, lru);
+}
+
+
#endif /* __KERNEL__*/
#endif /* _LINUX_SWAP_H */
diff --git a/mm/swap.c b/mm/swap.c
index 4e7e2ec..f35df46 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -691,7 +691,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
SetPageLRU(page_tail);
if (page_evictable(page_tail, NULL)) {
- if (PageActive(page)) {
+ if (PageActive(page) || !total_swap_pages) {
SetPageActive(page_tail);
active = 1;
lru = LRU_ACTIVE_ANON;
@@ -755,6 +755,17 @@ void __pagevec_lru_add(struct pagevec *pvec, enum lru_list lru)
}
EXPORT_SYMBOL(__pagevec_lru_add);
+
+void __pagevec_lru_add_anon(struct pagevec *pvec)
+{
+ if (!total_swap_pages)
+ __pagevec_lru_add(pvec, LRU_ACTIVE_ANON);
+ else
+ __pagevec_lru_add(pvec, LRU_INACTIVE_ANON);
+}
+EXPORT_SYMBOL(__pagevec_lru_add_anon);
+
+
/**
* pagevec_lookup - gang pagecache lookup
* @pvec: Where the resulting pages are placed
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeb3bc9..52d8ad9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1597,15 +1597,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
if (!global_reclaim(sc))
force_scan = true;
- /* If we have no swap space, do not bother scanning anon pages. */
- if (!sc->may_swap || (nr_swap_pages <= 0)) {
- noswap = 1;
- fraction[0] = 0;
- fraction[1] = 1;
- denominator = 1;
- goto out;
- }
-
anon = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
get_lru_size(lruvec, LRU_INACTIVE_ANON);
file = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
--
1.7.3.2.146.gca209
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 6/6] [RFC][HACK] mm: Change memory management of anonymous pages on swapless systems
2012-06-13 1:11 ` [PATCH 6/6] [RFC][HACK] mm: Change memory management of anonymous pages on swapless systems John Stultz
@ 2012-06-13 4:36 ` Minchan Kim
0 siblings, 0 replies; 9+ messages in thread
From: Minchan Kim @ 2012-06-13 4:36 UTC (permalink / raw)
To: John Stultz
Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V,
Taras Glek, Mike Hommey, Jan Kara, KOSAKI Motohiro
Hi John,
On 06/13/2012 10:11 AM, John Stultz wrote:
> Due to my newbie-ness, the following may not be precise, but
> I think it conveys the intent of what I'm trying to do here.
>
> Anonymous memory is tracked on two LRU lists: LRU_INACTIVE_ANON
> and LRU_ACTIVE_ANON. This split is useful when we need to free
> up pages and are trying to decide what to swap out.
>
> However, on systems that do no have swap, this partition is less
> clear. In many cases the code avoids aging active anonymous pages
> onto the inactive list. However in some cases pages do get moved
> to the inactive list, but we never call writepage, as there isn't
> anything to swap out.
I confess I don't look at your code still yet so I might be wrong.
As I read your comment, pop old history.
Long time ago, I did prevent anon aging totally without swapless but dropped up by Rik's comment.
We should think swapoff system and swapless(!CONFIG_SWAP) separately.
Of course, in swapless system, your code would work. Even, we might not need anon lru list itself.
But in swap-off system, it is possible that user can do swapon anytime so we need keeping the aging.
That was why Rik didn't like my patch at that time.
>
> This patch changes some of the active/inactive list management of
> anonymous memory when there is no swap. In that case pages are
> always added to the active lru. The intent is that since anonymous
> pages cannot be swapped out, they all shoudld be active.
>
> The one exception is volatile pages, which can be moved to
> the inactive lru by calling deactivate_page().
>
> In addition, I've changed the logic so we also do try to shrink
> the inactive anonymous lru, and call writepage. This should only
> be done if there are volatile pages on the inactive lru.
>
> This allows us to purge volatile pages in writepage when the system
> does not have swap.
>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Android Kernel Team <kernel-team@android.com>
> CC: Robert Love <rlove@google.com>
> CC: Mel Gorman <mel@csn.ul.ie>
> CC: Hugh Dickins <hughd@google.com>
> CC: Dave Hansen <dave@linux.vnet.ibm.com>
> CC: Rik van Riel <riel@redhat.com>
> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> CC: Dave Chinner <david@fromorbit.com>
> CC: Neil Brown <neilb@suse.de>
> CC: Andrea Righi <andrea@betterlinux.com>
> CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> CC: Taras Glek <tgek@mozilla.com>
> CC: Mike Hommey <mh@glandium.org>
> CC: Jan Kara <jack@suse.cz>
> CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
> include/linux/pagevec.h | 5 +----
> include/linux/swap.h | 23 +++++++++++++++--------
> mm/swap.c | 13 ++++++++++++-
> mm/vmscan.c | 9 ---------
> 4 files changed, 28 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
> index 2aa12b8..e1312a5 100644
> --- a/include/linux/pagevec.h
> +++ b/include/linux/pagevec.h
> @@ -22,6 +22,7 @@ struct pagevec {
>
> void __pagevec_release(struct pagevec *pvec);
> void __pagevec_lru_add(struct pagevec *pvec, enum lru_list lru);
> +void __pagevec_lru_add_anon(struct pagevec *pvec);
> unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
> pgoff_t start, unsigned nr_pages);
> unsigned pagevec_lookup_tag(struct pagevec *pvec,
> @@ -64,10 +65,6 @@ static inline void pagevec_release(struct pagevec *pvec)
> __pagevec_release(pvec);
> }
>
> -static inline void __pagevec_lru_add_anon(struct pagevec *pvec)
> -{
> - __pagevec_lru_add(pvec, LRU_INACTIVE_ANON);
> -}
>
> static inline void __pagevec_lru_add_active_anon(struct pagevec *pvec)
> {
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index c84ec68..639936f 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -238,14 +238,6 @@ extern void swap_setup(void);
>
> extern void add_page_to_unevictable_list(struct page *page);
>
> -/**
> - * lru_cache_add: add a page to the page lists
> - * @page: the page to add
> - */
> -static inline void lru_cache_add_anon(struct page *page)
> -{
> - __lru_cache_add(page, LRU_INACTIVE_ANON);
> -}
>
> static inline void lru_cache_add_file(struct page *page)
> {
> @@ -474,5 +466,20 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> }
>
> #endif /* CONFIG_SWAP */
> +
> +/**
> + * lru_cache_add: add a page to the page lists
> + * @page: the page to add
> + */
> +static inline void lru_cache_add_anon(struct page *page)
> +{
> + int lru = LRU_INACTIVE_ANON;
> + if (!total_swap_pages)
> + lru = LRU_ACTIVE_ANON;
> +
> + __lru_cache_add(page, lru);
> +}
> +
> +
> #endif /* __KERNEL__*/
> #endif /* _LINUX_SWAP_H */
> diff --git a/mm/swap.c b/mm/swap.c
> index 4e7e2ec..f35df46 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -691,7 +691,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
> SetPageLRU(page_tail);
>
> if (page_evictable(page_tail, NULL)) {
> - if (PageActive(page)) {
> + if (PageActive(page) || !total_swap_pages) {
> SetPageActive(page_tail);
> active = 1;
> lru = LRU_ACTIVE_ANON;
> @@ -755,6 +755,17 @@ void __pagevec_lru_add(struct pagevec *pvec, enum lru_list lru)
> }
> EXPORT_SYMBOL(__pagevec_lru_add);
>
> +
> +void __pagevec_lru_add_anon(struct pagevec *pvec)
> +{
> + if (!total_swap_pages)
> + __pagevec_lru_add(pvec, LRU_ACTIVE_ANON);
> + else
> + __pagevec_lru_add(pvec, LRU_INACTIVE_ANON);
> +}
> +EXPORT_SYMBOL(__pagevec_lru_add_anon);
> +
> +
> /**
> * pagevec_lookup - gang pagecache lookup
> * @pvec: Where the resulting pages are placed
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eeb3bc9..52d8ad9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1597,15 +1597,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> if (!global_reclaim(sc))
> force_scan = true;
>
> - /* If we have no swap space, do not bother scanning anon pages. */
> - if (!sc->may_swap || (nr_swap_pages <= 0)) {
> - noswap = 1;
> - fraction[0] = 0;
> - fraction[1] = 1;
> - denominator = 1;
> - goto out;
> - }
> -
> anon = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
> get_lru_size(lruvec, LRU_INACTIVE_ANON);
> file = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/6] [RFC] Interval tree implementation
2012-06-13 1:10 ` [PATCH 1/6] [RFC] Interval tree implementation John Stultz
@ 2012-06-22 7:06 ` Michel Lespinasse
0 siblings, 0 replies; 9+ messages in thread
From: Michel Lespinasse @ 2012-06-22 7:06 UTC (permalink / raw)
To: John Stultz
Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V,
Taras Glek, Mike Hommey, Jan Kara, KOSAKI Motohiro
On Tue, Jun 12, 2012 at 6:10 PM, John Stultz <john.stultz@linaro.org> wrote:
> After Andrew suggested something like his mumbletree idea
> to better store a list of intervals, I worked on a few different
> approaches, and this is what I've finally managed to get working.
>
> The idea of storing intervals in a tree is nice, but has a number
> of complications. When adding an interval, its possible that a
> large interval will consume and merge a number of smaller intervals.
> When removing a interval, its possible you may end up splitting an
> existing interval, causing one interval to become two. This makes it
> very difficult to provide generic list_head like behavior, as
> the parent structures would need to be duplicated and removed,
> and that has lots of memory ownership issues.
>
> So, this is a much simplified and more list_head like
> implementation. You can add a node to a tree, or remove a node
> to a tree, but the generic implementation doesn't do the
> merging or splitting for you. But it does provide helpers to
> find overlapping and adjacent intervals.
>
> Andrew also really wanted this interval-tree implementation to be
> resuable so we don't duplicate the file locking logic. I'm not
> totally convinced that the requirements between the volatile
> intervals and file locking are really equivelent, but this reduced
> impelementation may make it possible.
Well, we already have several implementations of interval trees:
* lib/prio_tree.c seems to do what you want. It's often used with
VMAs, but there is also a 'raw' node structure, as used in
mm/kmemleak.c, which supports all the operations you are proposing.
* lib/rb_tree.c has some support for the "augmented tree" structure,
with which you can implement interval trees as is done in
arch/x86/mm/pat_rbtree.c
I don't think it makes sense to have two implementations already, and
I am looking into possibly collapsing these into one. Please don't add
a third one :)
Also, the name "interval tree" typically implies a data structure that
is capable of representing overlapping intervals. Since your
implementation does not support that (it requires its callers to make
sure the intervals are non-overlapping), I think calling it "interval
tree" could cause confusion.
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-06-22 7:06 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-13 1:10 [PATCH 0/6] [RFC] Fallocate Volatile Ranges v4 John Stultz
2012-06-13 1:10 ` [PATCH 1/6] [RFC] Interval tree implementation John Stultz
2012-06-22 7:06 ` Michel Lespinasse
2012-06-13 1:10 ` [PATCH 2/6] [RFC] Add volatile range management code John Stultz
2012-06-13 1:10 ` [PATCH 3/6] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers John Stultz
2012-06-13 1:10 ` [PATCH 4/6] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
2012-06-13 1:11 ` [PATCH 5/6] [RFC][HACK] tmpfs: Purge volatile ranges on writepage instead of using shrinker John Stultz
2012-06-13 1:11 ` [PATCH 6/6] [RFC][HACK] mm: Change memory management of anonymous pages on swapless systems John Stultz
2012-06-13 4:36 ` Minchan Kim
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox