[PATCH 0/3] Volatile Ranges

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] Volatile Ranges
@ 2012-04-24 17:49 John Stultz
  2012-04-24 17:49 ` [PATCH 1/3] Range tree implementation John Stultz
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: John Stultz @ 2012-04-24 17:49 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V

So the last RFC iteration of this patch set didn't get any feedback,
so I wanted to properly submit the volatile range functionality for
inclusion.  I suspect this will bring some useful critiques out of
the wood work, which I'd welcome, but maybe not and this can be
queued. Here's to optimism. :)

In addition to the two patches for volatile ranges, I've also
included an RFC patch converting the ashmem driver to use volatile
ranges.

Thoughts or feedback on it would be appreciated.

thanks
-john

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

John Stultz (3):
  Range tree implementation
  fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  [RFC] ashmem: Convert ashmem to use volatile ranges

 drivers/staging/android/ashmem.c |  327 +--------------------------
 fs/file_table.c                  |    4 +
 include/linux/fadvise.h          |    5 +
 include/linux/rangetree.h        |   56 +++++
 include/linux/volatile.h         |   12 +
 lib/Makefile                     |    2 +-
 lib/rangetree.c                  |  128 +++++++++++
 mm/Makefile                      |    2 +-
 mm/fadvise.c                     |   16 ++-
 mm/volatile.c                    |  457 ++++++++++++++++++++++++++++++++++++++
 10 files changed, 689 insertions(+), 320 deletions(-)
 create mode 100644 include/linux/rangetree.h
 create mode 100644 include/linux/volatile.h
 create mode 100644 lib/rangetree.c
 create mode 100644 mm/volatile.c

-- 
1.7.3.2.146.gca209


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/3] Range tree implementation
  2012-04-24 17:49 [PATCH 0/3] Volatile Ranges John Stultz
@ 2012-04-24 17:49 ` John Stultz
  2012-04-24 19:14   ` Peter Zijlstra
  2012-04-25 12:16   ` Dmitry Adamushko
  2012-04-24 17:49 ` [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz
  2012-04-24 17:49 ` [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
  2 siblings, 2 replies; 29+ messages in thread
From: John Stultz @ 2012-04-24 17:49 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V

After Andrew suggested something like his mumbletree idea
to better store a list of ranges, I worked on a few different
approaches, and this is what I've finally managed to get working.

I suspect range-tree isn't a totally accurate name, but I
couldn't quite make out the difference between range trees
and interval trees, so I just picked one to call it. Do
let me know if you have a better name.

The idea of storing ranges in a tree is nice, but has a number
of complications. When adding a range, its possible that a
large range will consume and merge a number of smaller ranges.
When removing a range, its possible you may end up splitting an
existing range, causing one range to become two. This makes it
very difficult to provide generic list_head like behavior, as
the parent structures would need to be duplicated and removed,
and that has lots of memory ownership issues.

So, this is a much simplified and more list_head like
implementation. You can add a node to a tree, or remove a node
to a tree, but the generic implementation doesn't do the
merging or splitting for you. But it does provide helpers to
find overlapping and adjacent ranges.

Andrew also really wanted this range-tree implementation to be
resuable so we don't duplicate the file locking logic. I'm not
totally convinced that the requirements between the volatile
ranges and file locking are really equivelent, but this reduced
impelementation may make it possible.


Changelog:
v2:
* Reworked code to use an rbtree instead of splaying

v3:
* Added range_tree_next_in_range() to avoid having to start
  lookups from the root every time.
* Fixed some comments and return NULL instead of 0, as suggested
  by Aneesh Kumar K.V

v6:
* Fixed range_tree_in_range() so that it finds the earliest range,
  rather then the first. This allows the next_in_range() function
  to properly cover all the ranges in the tree.
* Minor clenaups to simplify some of the functions

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/rangetree.h |   56 ++++++++++++++++++++
 lib/Makefile              |    2 +-
 lib/rangetree.c           |  128 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 185 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/rangetree.h
 create mode 100644 lib/rangetree.c

diff --git a/include/linux/rangetree.h b/include/linux/rangetree.h
new file mode 100644
index 0000000..c61ce7c
--- /dev/null
+++ b/include/linux/rangetree.h
@@ -0,0 +1,56 @@
+#ifndef _LINUX_RANGETREE_H
+#define _LINUX_RANGETREE_H
+
+#include <linux/types.h>
+#include <linux/rbtree.h>
+
+struct range_tree_node {
+	struct rb_node rb;
+	u64 start;
+	u64 end;
+};
+
+struct range_tree_root {
+	struct rb_root head;
+};
+
+static inline void range_tree_init(struct range_tree_root *root)
+{
+	root->head = RB_ROOT;
+}
+
+static inline void range_tree_node_init(struct range_tree_node *node)
+{
+	rb_init_node(&node->rb);
+	node->start = 0;
+	node->end = 0;
+}
+
+static inline int range_tree_empty(struct range_tree_root *root)
+{
+	return RB_EMPTY_ROOT(&root->head);
+}
+
+static inline
+struct range_tree_node *range_tree_root_node(struct range_tree_root *root)
+{
+	struct range_tree_node *ret;
+	ret = container_of(root->head.rb_node, struct range_tree_node, rb);
+	return ret;
+}
+
+extern struct range_tree_node *range_tree_in_range(struct range_tree_root *root,
+							 u64 start, u64 end);
+extern struct range_tree_node *range_tree_in_range_adjacent(
+						struct range_tree_root *root,
+							 u64 start, u64 end);
+extern struct range_tree_node *range_tree_next_in_range(
+						struct range_tree_node *node,
+						u64 start, u64 end);
+extern void range_tree_add(struct range_tree_root *root,
+						struct range_tree_node *node);
+extern void range_tree_remove(struct range_tree_root *root,
+						struct range_tree_node *node);
+#endif
+
+
diff --git a/lib/Makefile b/lib/Makefile
index 18515f0..f43ef0d 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 idr.o int_sqrt.o extable.o prio_tree.o \
 	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 	 proportions.o prio_heap.o ratelimit.o show_mem.o \
-	 is_single_threaded.o plist.o decompress.o
+	 is_single_threaded.o plist.o decompress.o rangetree.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/rangetree.c b/lib/rangetree.c
new file mode 100644
index 0000000..08185bc
--- /dev/null
+++ b/lib/rangetree.c
@@ -0,0 +1,128 @@
+#include <linux/rangetree.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+
+
+/**
+ * range_tree_in_range - Returns the first node that overlaps with the
+ *                       given range
+ * @root: range_tree root
+ * @start: range start
+ * @end: range end
+ *
+ */
+struct range_tree_node *range_tree_in_range(struct range_tree_root *root,
+						u64 start, u64 end)
+{
+	struct rb_node *p = root->head.rb_node;
+	struct range_tree_node *candidate, *match = NULL;
+
+	while (p) {
+		candidate = rb_entry(p, struct range_tree_node, rb);
+		if (end < candidate->start)
+			p = p->rb_left;
+		else if (start > candidate->end)
+			p = p->rb_right;
+		else {
+			/* We found one, but try to find an earlier match */
+			match = candidate;
+			p = p->rb_left;
+		}
+	}
+
+	return match;
+}
+
+
+/**
+ * range_tree_in_range_adjacent - Returns the first node that overlaps or
+ *                                is adjacent with the given range.
+ * @root: range_tree root
+ * @start: range start
+ * @end: range end
+ *
+ */
+struct range_tree_node *range_tree_in_range_adjacent(
+					struct range_tree_root *root,
+					u64 start, u64 end)
+{
+	struct rb_node *p = root->head.rb_node;
+	struct range_tree_node *candidate;
+
+	while (p) {
+		candidate = rb_entry(p, struct range_tree_node, rb);
+		if (end+1 < candidate->start)
+			p = p->rb_left;
+		else if (start > candidate->end + 1)
+			p = p->rb_right;
+		else
+			return candidate;
+	}
+	return NULL;
+}
+
+struct range_tree_node *range_tree_next_in_range(struct range_tree_node *node,
+							u64 start, u64 end)
+{
+	struct rb_node *next;
+	struct range_tree_node *candidate;
+	if (!node)
+		return NULL;
+	next = rb_next(&node->rb);
+	if (!next)
+		return NULL;
+
+	candidate = container_of(next, struct range_tree_node, rb);
+
+	if ((candidate->start > end) || (candidate->end < start))
+		return NULL;
+
+	return candidate;
+}
+
+/**
+ * range_tree_add - Add a node to a range tree
+ * @root: range tree to be added to
+ * @node: range_tree_node to be added
+ *
+ * Adds a node to the range tree.
+ */
+void range_tree_add(struct range_tree_root *root,
+					struct range_tree_node *node)
+{
+	struct rb_node **p = &root->head.rb_node;
+	struct rb_node *parent = NULL;
+	struct range_tree_node *ptr;
+
+	WARN_ON_ONCE(!RB_EMPTY_NODE(&node->rb));
+
+	while (*p) {
+		parent = *p;
+		ptr = rb_entry(parent, struct range_tree_node, rb);
+		if (node->start < ptr->start)
+			p = &(*p)->rb_left;
+		else
+			p = &(*p)->rb_right;
+	}
+	rb_link_node(&node->rb, parent, p);
+	rb_insert_color(&node->rb, &root->head);
+
+}
+
+
+/**
+ * range_tree_remove: Removes a given node from the tree
+ * @root: root of tree
+ * @node: Node to be removed
+ *
+ * Removes a node and splays the tree
+ */
+void range_tree_remove(struct range_tree_root *root,
+						struct range_tree_node *node)
+{
+	WARN_ON_ONCE(RB_EMPTY_NODE(&node->rb));
+
+	rb_erase(&node->rb, &root->head);
+	RB_CLEAR_NODE(&node->rb);
+}
-- 
1.7.3.2.146.gca209


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] Range tree implementation
  2012-04-24 17:49 ` [PATCH 1/3] Range tree implementation John Stultz
@ 2012-04-24 19:14   ` Peter Zijlstra
  2012-04-24 19:25     ` John Stultz
  2012-04-25 12:16   ` Dmitry Adamushko
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2012-04-24 19:14 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
	Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On Tue, 2012-04-24 at 10:49 -0700, John Stultz wrote:
> This makes it
> very difficult to provide generic list_head like behavior, as
> the parent structures would need to be duplicated and removed,
> and that has lots of memory ownership issues. 

You can in fact modify the rb-tree to have O(1) iteration by using the
empty leaf pointers to keep pointers to next/prev nodes.

Its a bit of a bother since you'd need to wrap ->rb_left and ->rb_right
in functions.. but now that we have coccinelle that shouldn't actually
be too hard.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] Range tree implementation
  2012-04-24 19:14   ` Peter Zijlstra
@ 2012-04-24 19:25     ` John Stultz
  2012-04-24 19:33       ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: John Stultz @ 2012-04-24 19:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
	Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On 04/24/2012 12:14 PM, Peter Zijlstra wrote:
> On Tue, 2012-04-24 at 10:49 -0700, John Stultz wrote:
>> This makes it
>> very difficult to provide generic list_head like behavior, as
>> the parent structures would need to be duplicated and removed,
>> and that has lots of memory ownership issues.
> You can in fact modify the rb-tree to have O(1) iteration by using the
> empty leaf pointers to keep pointers to next/prev nodes.
>
> Its a bit of a bother since you'd need to wrap ->rb_left and ->rb_right
> in functions.. but now that we have coccinelle that shouldn't actually
> be too hard.
>
Sorry, I'm not sure I'm following you.

My point above was that a generic range-tree implementation that manages 
the splitting and coalescing of ranges internally is difficult, due to 
memory ownership issues.  This makes it hard to have a generic list_head 
style structure that you can use in your own structures. Thus in a way 
similar to how the rb_tree leaves the insert and search implementation 
to the suers,  there is a range_tree_node structure, and the splitting 
and coalescing logic is left to the range-tree user.

Does your suggestion address the ownership issue differently?  Or is it 
just a general optimization improvement?

thanks
-john

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] Range tree implementation
  2012-04-24 19:25     ` John Stultz
@ 2012-04-24 19:33       ` Peter Zijlstra
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2012-04-24 19:33 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
	Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On Tue, 2012-04-24 at 12:25 -0700, John Stultz wrote:
> On 04/24/2012 12:14 PM, Peter Zijlstra wrote:
> > On Tue, 2012-04-24 at 10:49 -0700, John Stultz wrote:
> >> This makes it
> >> very difficult to provide generic list_head like behavior, as
> >> the parent structures would need to be duplicated and removed,
> >> and that has lots of memory ownership issues.
> > You can in fact modify the rb-tree to have O(1) iteration by using the
> > empty leaf pointers to keep pointers to next/prev nodes.
> >
> > Its a bit of a bother since you'd need to wrap ->rb_left and ->rb_right
> > in functions.. but now that we have coccinelle that shouldn't actually
> > be too hard.
> >
> Sorry, I'm not sure I'm following you.
> 
> My point above was that a generic range-tree implementation that manages 
> the splitting and coalescing of ranges internally is difficult, due to 
> memory ownership issues.  This makes it hard to have a generic list_head 
> style structure that you can use in your own structures. Thus in a way 
> similar to how the rb_tree leaves the insert and search implementation 
> to the suers,  there is a range_tree_node structure, and the splitting 
> and coalescing logic is left to the range-tree user.
> 
> Does your suggestion address the ownership issue differently?  Or is it 
> just a general optimization improvement?

Oh, I thought you also wanted a list_head to aid in the traversal
required to find adjacent ranges etc.. 

Brain completely failed to grasp what you were trying to say, sorry for
the noise. 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] Range tree implementation
  2012-04-24 17:49 ` [PATCH 1/3] Range tree implementation John Stultz
  2012-04-24 19:14   ` Peter Zijlstra
@ 2012-04-25 12:16   ` Dmitry Adamushko
  2012-04-25 16:19     ` John Stultz
  1 sibling, 1 reply; 29+ messages in thread
From: Dmitry Adamushko @ 2012-04-25 12:16 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Andrea Righi, Aneesh Kumar K.V

Hi John,

range_tree_in_range_adjacent() is not used in your code, and it
doesn't seem to be very useful in general case. range_tree_in_range()
can do the same thing (and you use it that way in the 2nd patch) and
is more flexible (can be paired with range_tree_next_in_range()). So I
think it can be dropped altogether.

Now, I'm wondering whether it actually makes sense to make a dedicated
interface out of the remaining bits.

Almost everything is common rb_tree-handling code that can be found in
any place where rb-trees are used (hard-coded for flexibility,
performance or whatever other reasons). So my feeling is that it
should not be different here.

-- Dmitry

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] Range tree implementation
  2012-04-25 12:16   ` Dmitry Adamushko
@ 2012-04-25 16:19     ` John Stultz
  2012-04-26 10:00       ` Dmitry Adamushko
  0 siblings, 1 reply; 29+ messages in thread
From: John Stultz @ 2012-04-25 16:19 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Andrea Righi, Aneesh Kumar K.V

On 04/25/2012 05:16 AM, Dmitry Adamushko wrote:
> Hi John,
>
> range_tree_in_range_adjacent() is not used in your code, and it
> doesn't seem to be very useful in general case. range_tree_in_range()
> can do the same thing (and you use it that way in the 2nd patch) and
> is more flexible (can be paired with range_tree_next_in_range()). So I
> think it can be dropped altogether.

Agreed. I actually at one point meant to do this and forgot. Thanks for 
pointing it out!

> Now, I'm wondering whether it actually makes sense to make a dedicated
> interface out of the remaining bits.
>
> Almost everything is common rb_tree-handling code that can be found in
> any place where rb-trees are used (hard-coded for flexibility,
> performance or whatever other reasons). So my feeling is that it
> should not be different here.
>
Sorry, not sure I quite understand what you're suggesting. Are you 
saying it doesn't make sense to have a generic range tree 
implementation, since really its just a small shim over the rbtree 
code?  So instead range-tree users should just implment them 
themselves?  Or something else?

thanks
-john


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] Range tree implementation
  2012-04-25 16:19     ` John Stultz
@ 2012-04-26 10:00       ` Dmitry Adamushko
  2012-04-27 19:34         ` John Stultz
  0 siblings, 1 reply; 29+ messages in thread
From: Dmitry Adamushko @ 2012-04-26 10:00 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Andrea Righi, Aneesh Kumar K.V

>> [ ... ]
>>
>> Almost everything is common rb_tree-handling code that can be found in
>> any place where rb-trees are used (hard-coded for flexibility,
>> performance or whatever other reasons). So my feeling is that it
>> should not be different here.
>>
> Sorry, not sure I quite understand what you're suggesting. Are you saying it
> doesn't make sense to have a generic range tree implementation, since really
> its just a small shim over the rbtree code?  So instead range-tree users
> should just implment them themselves?

Exactly. It's not much different from other rbtree
search-insert-delete implementations out there.

What are the generic use cases that we want to support with this interface?

Is the current notion of the 'overlapping range' as taken by
range_tree_in_range() common enough? What if another use-case requires
_only_ the ranges that are strictly inside the [ start, end ] range?
In this case, we might be better off sticking to the same
'implement-your-own-search' approach taken by the generic rbtree
interface.

-- Dmitry

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] Range tree implementation
  2012-04-26 10:00       ` Dmitry Adamushko
@ 2012-04-27 19:34         ` John Stultz
  0 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2012-04-27 19:34 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown,
	Andrea Righi, Aneesh Kumar K.V

On 04/26/2012 03:00 AM, Dmitry Adamushko wrote:
>>> [ ... ]
>>>
>>> Almost everything is common rb_tree-handling code that can be found in
>>> any place where rb-trees are used (hard-coded for flexibility,
>>> performance or whatever other reasons). So my feeling is that it
>>> should not be different here.
>>>
>> Sorry, not sure I quite understand what you're suggesting. Are you saying it
>> doesn't make sense to have a generic range tree implementation, since really
>> its just a small shim over the rbtree code?  So instead range-tree users
>> should just implment them themselves?
> Exactly. It's not much different from other rbtree
> search-insert-delete implementations out there.
>
> What are the generic use cases that we want to support with this interface?

Well, Andrew really wants to see posix file locking to reuse something 
like this (which is the whole reason I split it out separately). And 
Aneesh has similar range management needs for his hugetlb region tracking.

> Is the current notion of the 'overlapping range' as taken by
> range_tree_in_range() common enough? What if another use-case requires
> _only_ the ranges that are strictly inside the [ start, end ] range?
> In this case, we might be better off sticking to the same
> 'implement-your-own-search' approach taken by the generic rbtree
> interface.
Or we could extend the interface for more specific requests?

So its true that the range-tree construct is a pretty thin layer over 
the rbtree code, but even so, we've had a few places in the kernel where 
folks basically are duplicating the same logic over and over again, so 
it might make sense to consolidate the obvious use cases, even just to 
avoid some of the simpler bugs that can happen with rbtree logic (for 
instance, I sent a simple fix to an rbtree related thinko in Rafael's 
recent userspace wakelock api earlier this week).

Anther example is the timerqueue structure I added earlier, which again 
is a thin shim over the rbtree, but allows a few different users to 
share code for quick insert ordered list behavior.

So yea, even though its fairly thin, I think it still has value for reuse.

thanks
-john

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-24 17:49 [PATCH 0/3] Volatile Ranges John Stultz
  2012-04-24 17:49 ` [PATCH 1/3] Range tree implementation John Stultz
@ 2012-04-24 17:49 ` John Stultz
  2012-04-24 19:20   ` Peter Zijlstra
  2012-04-27  0:39   ` Dave Chinner
  2012-04-24 17:49 ` [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
  2 siblings, 2 replies; 29+ messages in thread
From: John Stultz @ 2012-04-24 17:49 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V

This patch provides new fadvise flags that can be used to mark
file pages as volatile, which will allow it to be discarded if the
kernel wants to reclaim memory.

This is useful for userspace to allocate things like caches, and lets
the kernel destructively (but safely) reclaim them when there's memory
pressure.

It's different from FADV_DONTNEED since the pages are not immediately
discarded; they are only discarded under pressure.

This is very much influenced by the Android Ashmem interface by
Robert Love so credits to him and the Android developers.
In many cases the code & logic come directly from the ashmem patch.
The intent of this patch is to allow for ashmem-like behavior, but
embeds the idea a little deeper into the VM code, instead of isolating
it into a specific driver.

Also many thanks to Dave Hansen who helped design and develop the
initial version of this patch, and has provided continued review and
mentoring for me in the VM code. And thanks to Dmitry Adamushko for
continued review and bug reporting.


v2:
* After the valid critique that just dropping pages would poke holes
in volatile ranges, and instead we should zap an entire range if we
drop any of it, I changed the code to more closely mimic the ashmem
implementation, which zaps entire ranges via a shrinker using an lru
list that tracks which range has been marked volatile the longest.

v3:
* Reworked to use range tree implementation.

v4:
* Renamed functions to avoid confusion.
* More consistant PAGE_CACHE_SHIFT usage, suggested by Dmitry
  Adamushko
* Fixes exit without unlocking issue found by Dmitry Adamushko
* Migrate to rbtree based rangetree implementation
* Simplified locking to use global lock (we were grabbing global
  lru lock every time anyway).
* Avoid ENOMEM isses by allocating before we get into complicated
  code.
* Add some documentation to the volatile.c file from Neil Brown

v5:
* More fixes suggested by Dmitry Adamushko
* Improve range colescing so that we don't coalesce neighboring
  purged ranges.
* Utilize range_tree_next_in_range to avoid doing every lookup
  from the tree's root.

v6:
* Immediately zap range if we coalesce overlapping purged range.
* Use hash table to do mapping->rangetree lookup instead of
  bloating the address_space structure

v7:
* Race fix noted by Dmitry
* Clear volatile ranges on fput, so volatile ranges don't persist
  if no one has the file open
* Made it tmpfs only, using shmem_truncate_range() instead of
  vmtruncate_range(). This avoids the lockdep warnings caused
  by calling vmtruncate_range() from the shrinker. Seems to
  work ok, but I'd not be surprised if this isn't correct.
  Extra eyes would be appreciated here.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 fs/file_table.c          |    4 +
 include/linux/fadvise.h  |    5 +
 include/linux/volatile.h |   12 ++
 mm/Makefile              |    2 +-
 mm/fadvise.c             |   16 ++-
 mm/volatile.c            |  457 ++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 494 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/volatile.h
 create mode 100644 mm/volatile.c

diff --git a/fs/file_table.c b/fs/file_table.c
index 70f2a0f..ada2c88 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -24,6 +24,7 @@
 #include <linux/percpu_counter.h>
 #include <linux/percpu.h>
 #include <linux/ima.h>
+#include <linux/volatile.h>
 
 #include <linux/atomic.h>
 
@@ -238,6 +239,9 @@ static void __fput(struct file *file)
 	eventpoll_release(file);
 	locks_remove_flock(file);
 
+	/* Volatile ranges should not persist after all fds are closed */
+	mapping_clear_volatile_ranges(&inode->i_data);
+
 	if (unlikely(file->f_flags & FASYNC)) {
 		if (file->f_op && file->f_op->fasync)
 			file->f_op->fasync(-1, file, 0);
diff --git a/include/linux/fadvise.h b/include/linux/fadvise.h
index e8e7471..443951c 100644
--- a/include/linux/fadvise.h
+++ b/include/linux/fadvise.h
@@ -18,4 +18,9 @@
 #define POSIX_FADV_NOREUSE	5 /* Data will be accessed once.  */
 #endif
 
+#define POSIX_FADV_VOLATILE	8  /* _can_ toss, but don't toss now */
+#define POSIX_FADV_NONVOLATILE	9  /* Remove VOLATILE flag */
+
+
+
 #endif	/* FADVISE_H_INCLUDED */
diff --git a/include/linux/volatile.h b/include/linux/volatile.h
new file mode 100644
index 0000000..85a9249
--- /dev/null
+++ b/include/linux/volatile.h
@@ -0,0 +1,12 @@
+#ifndef _LINUX_VOLATILE_H
+#define _LINUX_VOLATILE_H
+
+#include <linux/fs.h>
+
+extern long mapping_range_volatile(struct address_space *mapping,
+				pgoff_t start_index, pgoff_t end_index);
+extern long mapping_range_nonvolatile(struct address_space *mapping,
+				pgoff_t start_index, pgoff_t end_index);
+extern void mapping_clear_volatile_ranges(struct address_space *mapping);
+
+#endif /* _LINUX_VOLATILE_H */
diff --git a/mm/Makefile b/mm/Makefile
index 50ec00e..7b6c7a8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -13,7 +13,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
 			   page_isolation.o mm_init.o mmu_context.o percpu.o \
-			   $(mmu-y)
+			   volatile.o $(mmu-y)
 obj-y += init-mm.o
 
 ifdef CONFIG_NO_BOOTMEM
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 469491e0..3e33845 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -17,6 +17,7 @@
 #include <linux/fadvise.h>
 #include <linux/writeback.h>
 #include <linux/syscalls.h>
+#include <linux/volatile.h>
 
 #include <asm/unistd.h>
 
@@ -106,7 +107,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
 		nrpages = end_index - start_index + 1;
 		if (!nrpages)
 			nrpages = ~0UL;
-		
+
 		ret = force_page_cache_readahead(mapping, file,
 				start_index,
 				nrpages);
@@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
 			invalidate_mapping_pages(mapping, start_index,
 						end_index);
 		break;
+	case POSIX_FADV_VOLATILE:
+		/* First and last PARTIAL page! */
+		start_index = offset >> PAGE_CACHE_SHIFT;
+		end_index = endbyte >> PAGE_CACHE_SHIFT;
+		ret = mapping_range_volatile(mapping, start_index, end_index);
+		break;
+	case POSIX_FADV_NONVOLATILE:
+		/* First and last PARTIAL page! */
+		start_index = offset >> PAGE_CACHE_SHIFT;
+		end_index = endbyte >> PAGE_CACHE_SHIFT;
+		ret = mapping_range_nonvolatile(mapping, start_index,
+								end_index);
+		break;
 	default:
 		ret = -EINVAL;
 	}
diff --git a/mm/volatile.c b/mm/volatile.c
new file mode 100644
index 0000000..e94e980
--- /dev/null
+++ b/mm/volatile.c
@@ -0,0 +1,457 @@
+/* mm/volatile.c
+ *
+ * Volatile page range managment.
+ *      Copyright 2011 Linaro
+ *
+ * Based on mm/ashmem.c
+ *      by Robert Love <rlove@google.com>
+ *      Copyright (C) 2008 Google, Inc.
+ *
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ *
+ * The goal behind volatile ranges is to allow applications to interact
+ * with the kernel's cache management infrastructure.  In particular an
+ * application can say "this memory contains data that might be useful in
+ * the future, but can be reconstructed if necessary, so if the kernel
+ * needs, it can zap and reclaim this memory without having to swap it out.
+ *
+ * The proposed mechanism - at a high level - is for user-space to be able
+ * to say "This memory is volatile" and then later "this memory is no longer
+ * volatile".  If the content of the memory is still available the second
+ * request succeeds.  If not, the memory is marked non-volatile and an
+ * error is returned to denote that the contents have been lost.
+ *
+ * Credits to Neil Brown for the above description.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/pagemap.h>
+#include <linux/volatile.h>
+#include <linux/rangetree.h>
+#include <linux/hash.h>
+#include <linux/shmem_fs.h>
+
+static DEFINE_MUTEX(volatile_mutex);
+
+struct volatile_range {
+	struct list_head	lru;
+	struct range_tree_node	range_node;
+	unsigned int		purged;
+	struct address_space	*mapping;
+};
+
+/* LRU list of volatile page ranges */
+static LIST_HEAD(volatile_lru_list);
+
+/* Count of pages on our LRU list */
+static u64 lru_count;
+
+
+/*
+ * To avoid bloating the address_space structure, we use
+ * a hash structure to map from address_space mappings to
+ * the range_tree root that stores volatile ranges
+ */
+static struct hlist_head *mapping_hash;
+static long mapping_hash_shift = 8;
+
+struct mapping_hash_entry {
+	struct range_tree_root root;
+	struct address_space *mapping;
+	struct hlist_node hnode;
+};
+
+static inline
+struct range_tree_root *mapping_to_root(struct address_space *mapping)
+{
+	struct hlist_node *elem;
+	struct mapping_hash_entry *entry;
+
+	hlist_for_each_entry_rcu(entry, elem,
+			&mapping_hash[hash_ptr(mapping, mapping_hash_shift)],
+				hnode)
+		if (entry->mapping == mapping)
+			return &entry->root;
+
+	return NULL;
+}
+
+static inline
+struct range_tree_root *mapping_allocate_root(struct address_space *mapping)
+{
+	struct mapping_hash_entry *entry;
+	struct range_tree_root *dblchk;
+
+	/* Drop the volatile_mutex to avoid lockdep deadlock warnings */
+	mutex_unlock(&volatile_mutex);
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	mutex_lock(&volatile_mutex);
+
+	/* Since we dropped the lock, double check that no one has
+	 * created the same hash entry.
+	 */
+	dblchk = mapping_to_root(mapping);
+	if (dblchk) {
+		kfree(entry);
+		return dblchk;
+	}
+
+	INIT_HLIST_NODE(&entry->hnode);
+	entry->mapping = mapping;
+	range_tree_init(&entry->root);
+
+	hlist_add_head_rcu(&entry->hnode,
+		&mapping_hash[hash_ptr(mapping, mapping_hash_shift)]);
+
+	return &entry->root;
+}
+
+static inline void mapping_free_root(struct range_tree_root *root)
+{
+	struct mapping_hash_entry *entry;
+
+	entry = container_of(root, struct mapping_hash_entry, root);
+
+	hlist_del_rcu(&entry->hnode);
+	kfree(entry);
+}
+
+
+/* Range tree helpers */
+static inline u64 range_size(struct volatile_range *range)
+{
+	return range->range_node.end - range->range_node.start + 1;
+}
+
+static inline void lru_add(struct volatile_range *range)
+{
+	list_add_tail(&range->lru, &volatile_lru_list);
+	lru_count += range_size(range);
+}
+
+static inline void lru_del(struct volatile_range *range)
+{
+	list_del(&range->lru);
+	lru_count -= range_size(range);
+}
+
+#define range_on_lru(range) (!(range)->purged)
+
+
+static inline void volatile_range_resize(struct volatile_range *range,
+				pgoff_t start_index, pgoff_t end_index)
+{
+	size_t pre = range_size(range);
+
+	range->range_node.start = start_index;
+	range->range_node.end = end_index;
+
+	if (range_on_lru(range))
+		lru_count -= pre - range_size(range);
+}
+
+static struct volatile_range *vrange_alloc(void)
+{
+	struct volatile_range *new;
+
+	new = kzalloc(sizeof(struct volatile_range), GFP_KERNEL);
+	if (!new)
+		return 0;
+	range_tree_node_init(&new->range_node);
+	return new;
+}
+
+static void vrange_del(struct range_tree_root *root,
+				struct volatile_range *vrange)
+{
+	if (range_on_lru(vrange))
+		lru_del(vrange);
+	range_tree_remove(root, &vrange->range_node);
+	kfree(vrange);
+}
+
+
+
+/*
+ * Mark a region as volatile, allowing dirty pages to be purged
+ * under memory pressure
+ */
+long mapping_range_volatile(struct address_space *mapping,
+				pgoff_t start_index, pgoff_t end_index)
+{
+	struct volatile_range *new;
+	struct range_tree_node *node;
+	struct volatile_range *vrange;
+	struct range_tree_root *root;
+	u64 start, end;
+	int purged = 0;
+	start = (u64)start_index;
+	end = (u64)end_index;
+
+	if (strncmp(mapping->host->i_sb->s_type->name, "tmpfs",
+			strlen("tmpfs")))
+		return -EINVAL;
+
+	new = vrange_alloc();
+	if (!new)
+		return -ENOMEM;
+
+	mutex_lock(&volatile_mutex);
+
+
+	root = mapping_to_root(mapping);
+	if (!root)
+		root = mapping_allocate_root(mapping);
+
+	/* Find any existing ranges that overlap */
+	node = range_tree_in_range(root, start, end);
+	while (node) {
+		/* Already entirely marked volatile, so we're done */
+		if (node->start < start && node->end > end) {
+			/* don't need the allocated value */
+			kfree(new);
+			goto out;
+		}
+
+		/* Grab containing volatile range */
+		vrange = container_of(node, struct volatile_range, range_node);
+
+		/* resize range */
+		start = min_t(u64, start, node->start);
+		end = max_t(u64, end, node->end);
+		purged |= vrange->purged;
+
+		node = range_tree_next_in_range(&vrange->range_node,
+								start, end);
+		vrange_del(root, vrange);
+	}
+
+	/* Coalesce left-adjacent ranges */
+	node = range_tree_in_range(root, start-1, start);
+	if (node) {
+		vrange = container_of(node, struct volatile_range, range_node);
+		/* Only coalesce if both are either purged or unpurged */
+		if (vrange->purged == purged) {
+			/* resize range */
+			start = min_t(u64, start, node->start);
+			end = max_t(u64, end, node->end);
+			vrange_del(root, vrange);
+		}
+	}
+
+	/* Coalesce right-adjacent ranges */
+	node = range_tree_in_range(root, end, end+1);
+	if (node) {
+		vrange = container_of(node, struct volatile_range, range_node);
+		/* Only coalesce if both are either purged or unpurged */
+		if (vrange->purged == purged) {
+			/* resize range */
+			start = min_t(u64, start, node->start);
+			end = max_t(u64, end, node->end);
+			vrange_del(root, vrange);
+		}
+	}
+
+	new->mapping = mapping;
+	new->range_node.start = start;
+	new->range_node.end = end;
+	new->purged = purged;
+
+	if (purged) {
+		struct inode *inode;
+		loff_t pstart, pend;
+
+		inode = mapping->host;
+		pstart = start << PAGE_CACHE_SHIFT;
+		pend = ((end + 1) << PAGE_CACHE_SHIFT) - 1;
+		vmtruncate_range(inode, pstart, pend);
+	}
+	range_tree_add(root, &new->range_node);
+	if (range_on_lru(new))
+		lru_add(new);
+
+out:
+	mutex_unlock(&volatile_mutex);
+
+	return 0;
+}
+
+/*
+ * Mark a region as nonvolatile, returns 1 if any pages in the region
+ * were purged.
+ */
+long mapping_range_nonvolatile(struct address_space *mapping,
+				pgoff_t start_index, pgoff_t end_index)
+{
+	struct volatile_range *new;
+	struct range_tree_node *node;
+	struct range_tree_root *root;
+	int ret  = 0;
+	u64 start, end;
+	int used_new = 0;
+
+	start = (u64)start_index;
+	end = (u64)end_index;
+
+	if (strncmp(mapping->host->i_sb->s_type->name, "tmpfs",
+			strlen("tmpfs")))
+		return -EINVAL;
+
+	/* create new node */
+	new = vrange_alloc();
+	if (!new)
+		return -ENOMEM;
+
+	mutex_lock(&volatile_mutex);
+	root = mapping_to_root(mapping);
+	if (!root)
+		goto out; /* if no range tree root, there's nothing to unmark */
+
+	node = range_tree_in_range(root, start, end);
+	while (node) {
+		struct volatile_range *vrange;
+		vrange = container_of(node, struct volatile_range, range_node);
+
+		ret |= vrange->purged;
+
+		if (start <= node->start && end >= node->end) {
+			/* delete: volatile range is totally within range */
+			node = range_tree_next_in_range(&vrange->range_node,
+								start, end);
+			vrange_del(root, vrange);
+		} else if (node->start >= start) {
+			/* resize: volatile range right-overlaps range */
+			volatile_range_resize(vrange, end+1, node->end);
+			node = range_tree_next_in_range(&vrange->range_node,
+								start, end);
+
+		} else if (node->end <= end) {
+			/* resize: volatile range left-overlaps range */
+			volatile_range_resize(vrange, node->start, start-1);
+			node = range_tree_next_in_range(&vrange->range_node,
+								start, end);
+		} else {
+			/* split: range is totally within a volatile range */
+			used_new = 1; /* we only do this once */
+			new->mapping = mapping;
+			new->range_node.start = end + 1;
+			new->range_node.end = node->end;
+			new->purged = vrange->purged;
+			range_tree_add(root, &new->range_node);
+			if (range_on_lru(new))
+				lru_add(new);
+			volatile_range_resize(vrange, node->start, start-1);
+
+			break;
+		}
+	}
+
+out:
+	mutex_unlock(&volatile_mutex);
+
+	if (!used_new)
+		kfree(new);
+
+	return ret;
+}
+
+
+/*
+ * Cleans up any volatile ranges.
+ */
+void mapping_clear_volatile_ranges(struct address_space *mapping)
+{
+	struct volatile_range *tozap;
+	struct range_tree_root *root;
+
+	mutex_lock(&volatile_mutex);
+
+	root = mapping_to_root(mapping);
+	if (!root)
+		goto out;
+
+	while (!range_tree_empty(root)) {
+		struct range_tree_node *tmp;
+		tmp = range_tree_root_node(root);
+		tozap = container_of(tmp, struct volatile_range, range_node);
+		vrange_del(root, tozap);
+	}
+	mapping_free_root(root);
+out:
+	mutex_unlock(&volatile_mutex);
+}
+
+/*
+ * Purges volatile ranges when under memory pressure
+ */
+static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
+{
+	struct volatile_range *range, *next;
+	s64 nr_to_scan = sc->nr_to_scan;
+	const gfp_t gfp_mask = sc->gfp_mask;
+
+	if (nr_to_scan && !(gfp_mask & __GFP_FS))
+		return -1;
+	if (!nr_to_scan)
+		return lru_count;
+
+	mutex_lock(&volatile_mutex);
+	list_for_each_entry_safe(range, next, &volatile_lru_list, lru) {
+		struct inode *inode;
+		loff_t start, end;
+
+		inode = range->mapping->host;
+
+		start = range->range_node.start << PAGE_CACHE_SHIFT;
+		end = ((range->range_node.end + 1) << PAGE_CACHE_SHIFT) - 1;
+
+		shmem_truncate_range(inode, start, end);
+
+		lru_del(range);
+		range->purged = 1;
+		nr_to_scan -= range_size(range);
+
+		if (nr_to_scan <= 0)
+			break;
+	}
+	mutex_unlock(&volatile_mutex);
+
+	return lru_count;
+}
+
+static struct shrinker volatile_shrinker = {
+	.shrink = volatile_shrink,
+	.seeks = DEFAULT_SEEKS,
+};
+
+static int __init volatile_init(void)
+{
+	int i, size;
+
+	size = 1U << mapping_hash_shift;
+
+	mapping_hash = kzalloc(sizeof(mapping_hash)*size, GFP_KERNEL);
+
+	for (i = 0; i < size; i++)
+		INIT_HLIST_HEAD(&mapping_hash[i]);
+
+	register_shrinker(&volatile_shrinker);
+
+
+	return 0;
+}
+
+arch_initcall(volatile_init);
-- 
1.7.3.2.146.gca209


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-24 17:49 ` [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz
@ 2012-04-24 19:20   ` Peter Zijlstra
  2012-04-24 19:50     ` John Stultz
  2012-04-27  0:39   ` Dave Chinner
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2012-04-24 19:20 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
	Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On Tue, 2012-04-24 at 10:49 -0700, John Stultz wrote:
> +/*
> + * Purges volatile ranges when under memory pressure
> + */
> +static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
> +{ 

Hmm, I would have expected regular page reclaim to know about this
instead of using a shrinker interface. Is this done specifically to
avoid growing small holes in all ranges and instead dump entire ranges
thereby keeping other ranges usable?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-24 19:20   ` Peter Zijlstra
@ 2012-04-24 19:50     ` John Stultz
  0 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2012-04-24 19:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
	Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On 04/24/2012 12:20 PM, Peter Zijlstra wrote:
> On Tue, 2012-04-24 at 10:49 -0700, John Stultz wrote:
>> +/*
>> + * Purges volatile ranges when under memory pressure
>> + */
>> +static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
>> +{
> Hmm, I would have expected regular page reclaim to know about this
> instead of using a shrinker interface. Is this done specifically to
> avoid growing small holes in all ranges and instead dump entire ranges
> thereby keeping other ranges usable?
>
So yes. We don't want to have single pages purged from the ranges.  My 
first attempt used the standard writeout method to do the purging. 
However, since the ranges in their entirety should be purged at once 
(since if any single page is purged, the entire range must be 
reconstructed), keeping track of the least-recently-unpinned range is 
more useful then the least-recently-used page.

Now, Dave Chinner suggested using tags to mark pages as part of a 
volatile range, as well as tracking their purged status, then iterating 
over adjacent pages on writeout to also purge the entire range, which 
could be done, but seemed more costly to iterate over each page updating 
the tags when marking or unmarking a range as volatile.  This suggestion 
was also (as I understood) mainly to address the memory overhead of 
early implementation adding a pointer to each address_space structure. 
That issue was resolved by using the hash table for address_space -> 
range_tree mappings.

thanks
-john


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-24 17:49 ` [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz
  2012-04-24 19:20   ` Peter Zijlstra
@ 2012-04-27  0:39   ` Dave Chinner
  2012-04-27 15:25     ` Dave Hansen
  2012-04-27 19:14     ` John Stultz
  1 sibling, 2 replies; 29+ messages in thread
From: Dave Chinner @ 2012-04-27  0:39 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
	Neil Brown, Andrea Righi, Aneesh Kumar K.V

On Tue, Apr 24, 2012 at 10:49:46AM -0700, John Stultz wrote:
> This patch provides new fadvise flags that can be used to mark
> file pages as volatile, which will allow it to be discarded if the
> kernel wants to reclaim memory.

.....

> @@ -18,4 +18,9 @@
>  #define POSIX_FADV_NOREUSE	5 /* Data will be accessed once.  */
>  #endif
>  
> +#define POSIX_FADV_VOLATILE	8  /* _can_ toss, but don't toss now */
> +#define POSIX_FADV_NONVOLATILE	9  /* Remove VOLATILE flag */

These aren't POSIX standards, so I don't think they should have the
POSIX_ prefix. Besides....

....
> @@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>  			invalidate_mapping_pages(mapping, start_index,
>  						end_index);
>  		break;
> +	case POSIX_FADV_VOLATILE:
> +		/* First and last PARTIAL page! */
> +		start_index = offset >> PAGE_CACHE_SHIFT;
> +		end_index = endbyte >> PAGE_CACHE_SHIFT;
> +		ret = mapping_range_volatile(mapping, start_index, end_index);
> +		break;
> +	case POSIX_FADV_NONVOLATILE:
> +		/* First and last PARTIAL page! */
> +		start_index = offset >> PAGE_CACHE_SHIFT;
> +		end_index = endbyte >> PAGE_CACHE_SHIFT;
> +		ret = mapping_range_nonvolatile(mapping, start_index,
> +								end_index);

As it is, I'm still not sold on these being an fadvise() interface
because all it really is a delayed hole punching interface whose
functionailty is currently specific to tmpfs. The behaviour cannot
be implemented sanely by anything else at this point.

> + * The goal behind volatile ranges is to allow applications to interact
> + * with the kernel's cache management infrastructure.  In particular an
> + * application can say "this memory contains data that might be useful in
> + * the future, but can be reconstructed if necessary, so if the kernel
> + * needs, it can zap and reclaim this memory without having to swap it out.

This is what I mean - the definition of volatility is specific to a
filesystem implementation - one that doesn't store persistent data.

> + * The proposed mechanism - at a high level - is for user-space to be able
> + * to say "This memory is volatile" and then later "this memory is no longer
> + * volatile".  If the content of the memory is still available the second
> + * request succeeds.  If not, the memory is marked non-volatile and an
> + * error is returned to denote that the contents have been lost.

For a filesystem, it's not "memory" that is volatile - it is the
*data* that we have to consider that these hints apply to, and that
implies both in memory and on stable storage. because you are
targetting a filesystem without persisten storage, you are using
"memory" interchangably with "data". That basically results in an
interface that can only be used by non-persistent filesystems.
However, for managing on-disk caches of fixed sizes, being able to
mark regions as volatile or not is just as helpful to them as it is
to memory based caches on tmpfs....

So why can't you implement this as fallocate() flags, and then make
the tmpfs implementation of those fallocate flags do the right
things?  I think fallocate is the right interface, because this is
simply an extension of the existing hole punching implementation.
IOWs, the specification you are describing means that FADV_VOLATILE
could be correctly implemented as an immediate hole punch by every
filesystem that supports hole punching.

This probably won't perform wonderfully, which is where the range
tracking and delayed punching (and the implied memory freeing)
optimiation comes into play. Sure, for tmpfs this can be implemented
as a shrinker, but for real filesystems that have to punch blocks a
shrinker is really the wrong context to be running such
transactions. However, using the fallocate() interface allows each
filesytsem to optimise the delayed hole punching as they see best,
something that cannot be done with this fadvise() interface.

It's all great that this can replace a single function in ashmem,
but focussing purely on ashmem misses the point that this
functionality has wider use, and that using a different interface
allows independently tailored and optimised implementations of that
functionality....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-27  0:39   ` Dave Chinner
@ 2012-04-27 15:25     ` Dave Hansen
  2012-04-28  1:36       ` Dave Chinner
  2012-04-27 19:14     ` John Stultz
  1 sibling, 1 reply; 29+ messages in thread
From: Dave Hansen @ 2012-04-27 15:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On 04/26/2012 05:39 PM, Dave Chinner wrote:
> As it is, I'm still not sold on these being an fadvise() interface
> because all it really is a delayed hole punching interface whose
> functionailty is currently specific to tmpfs. The behaviour cannot
> be implemented sanely by anything else at this point.
...
> IOWs, the specification you are describing means that FADV_VOLATILE
> could be correctly implemented as an immediate hole punch by every
> filesystem that supports hole punching.

Ahhh, I think I see where you're going with this.

1. Data written to a file somehow (mmap(), write()) and is on disk
2. mmap() the data, and fault it in
3. Do a small write to it
4. set FADV_VOLATILE on it

Now we've got a dirty page which can theoretically be tossed out.  But,
we've got old data on the disk and no real way to tell that it was old
if it got faulted in again.  It's a much cleaner situation to just drop
that data off the disk (hole punch) than to leave it around.  Is that
the concern?

But, we have other APIs that act this way, tossing out dirty data
without reflecting that on-disk (MADV_DONTNEED at least).  Is it really
a stretch to define the FADV_VOLATILE to behave the same way?  IOW,
Should the behavior _really_ be hole punching?  That'll cost us I/O to
throw away data during memory reclaim since we have to go write the
information about the hole.  Seems like a much more appropriate thing to
just toss the data out since the app can handle it.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-27 15:25     ` Dave Hansen
@ 2012-04-28  1:36       ` Dave Chinner
  2012-04-30 21:07         ` John Stultz
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2012-04-28  1:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On Fri, Apr 27, 2012 at 08:25:40AM -0700, Dave Hansen wrote:
> On 04/26/2012 05:39 PM, Dave Chinner wrote:
> > As it is, I'm still not sold on these being an fadvise() interface
> > because all it really is a delayed hole punching interface whose
> > functionailty is currently specific to tmpfs. The behaviour cannot
> > be implemented sanely by anything else at this point.
> ...
> > IOWs, the specification you are describing means that FADV_VOLATILE
> > could be correctly implemented as an immediate hole punch by every
> > filesystem that supports hole punching.
> 
> Ahhh, I think I see where you're going with this.
> 
> 1. Data written to a file somehow (mmap(), write()) and is on disk
> 2. mmap() the data, and fault it in
> 3. Do a small write to it
> 4. set FADV_VOLATILE on it
> 
> Now we've got a dirty page which can theoretically be tossed out.  But,
> we've got old data on the disk and no real way to tell that it was old
> if it got faulted in again.  It's a much cleaner situation to just drop
> that data off the disk (hole punch) than to leave it around.  Is that
> the concern?

Right - if you are using a persistent filesystem then just dropping
the pages in memory does not do what the interface expects it to do.
rather than having a hole in the file, there is stale data. That
sounds like a recipe for security problems to me.

> But, we have other APIs that act this way, tossing out dirty data
> without reflecting that on-disk (MADV_DONTNEED at least). 

But that's a mmap() operation, not a file operation.  mmap()/
madvise() implies you are workng with memory pages, fadvise()
implies you are working with *file data*. The mmap() functionality is
designed really for anonymous memory, to be able to toss it out when
finished with, not for manipulating data files. Yes, it works with
data files, but that leaves unknown stale data behind on disk....

> Is it really
> a stretch to define the FADV_VOLATILE to behave the same way?  IOW,
> Should the behavior _really_ be hole punching?  That'll cost us I/O to
> throw away data during memory reclaim since we have to go write the
> information about the hole.

Not for tmpfs, which is what this is initially aimed at. And for
filesystems, we already do IO in shrinkers during reclaim, so at a
stretch hole punching in a shrinker is possible, though undesirable
and possible to avoid.

> Seems like a much more appropriate thing to
> just toss the data out since the app can handle it.

On tmpfs, re-reading the data range that was tossed will return
zeros. Easy to detect and handle, and has no security implications.
On a real filesystem, it will return stale data. Not only is that
much harder to detect - it may not even be possible depending on how
the application tracks data in it's files. If the filesystem punches
holes, it will return zeros just like tmpfs will.

That's my concern - that persistent filesystems will have different
behaviour to in-memory filesystems. They *must* be consistent in
behaviour w.r.t. to stale data exposure, otherwise we are in a world
of pain when applications start to use this. Quite frankly, I don't
care about performance of VOLATILE ranges, but I care greatly
about ensuring filesystems don't expose stale data to user
applications....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-28  1:36       ` Dave Chinner
@ 2012-04-30 21:07         ` John Stultz
  2012-05-01  0:08           ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: John Stultz @ 2012-04-30 21:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dave Hansen, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On 04/27/2012 06:36 PM, Dave Chinner wrote:
> That's my concern - that persistent filesystems will have different
> behaviour to in-memory filesystems. They *must* be consistent in
> behaviour w.r.t. to stale data exposure, otherwise we are in a world
> of pain when applications start to use this. Quite frankly, I don't
> care about performance of VOLATILE ranges, but I care greatly
> about ensuring filesystems don't expose stale data to user
> applications....
>
I think we're in agreement with the rest of this email, but I do want to 
stress that the performance of volatile ranges will become more 
ciritical, as in order for folks to effectively use them, they need to 
be able to mark and unmark ranges any time they're not using the data.

No application likely wants their data to be purged, but by volunteering 
it allows them to help the kernel with low-memory constraints and 
improve entire system behavior.

So if the overhead is too great for marking and unmarking pages, 
applications will be less likely to "help out".  :)

thanks
-john


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-30 21:07         ` John Stultz
@ 2012-05-01  0:08           ` Dave Chinner
  2012-05-01  0:46             ` John Stultz
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2012-05-01  0:08 UTC (permalink / raw)
  To: John Stultz
  Cc: Dave Hansen, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On Mon, Apr 30, 2012 at 02:07:16PM -0700, John Stultz wrote:
> On 04/27/2012 06:36 PM, Dave Chinner wrote:
> >That's my concern - that persistent filesystems will have different
> >behaviour to in-memory filesystems. They *must* be consistent in
> >behaviour w.r.t. to stale data exposure, otherwise we are in a world
> >of pain when applications start to use this. Quite frankly, I don't
> >care about performance of VOLATILE ranges, but I care greatly
> >about ensuring filesystems don't expose stale data to user
> >applications....
> >
> I think we're in agreement with the rest of this email, but I do
> want to stress that the performance of volatile ranges will become
> more ciritical, as in order for folks to effectively use them, they
> need to be able to mark and unmark ranges any time they're not using
> the data.

Performance is far less important than data security. Make it safe
first, then optimise performance. As it is, the initial target of
tmpfs - by it's very nature of returning zeros for regions not
backed by pages - is safe w.r.t. stale data exposure, so it will not
be slowed down by using an fallocate "best effort" hole-punching
interface.  The performance of other filesystems is something that
the relevant filesystem developers can worry about....

> So if the overhead is too great for marking and unmarking pages,
> applications will be less likely to "help out".  :)

Devil's Advocate: If the benefit of managing caches in such a manner
is this marginal, then why add the complexity to the kernel?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-05-01  0:08           ` Dave Chinner
@ 2012-05-01  0:46             ` John Stultz
  2012-05-01  1:28               ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: John Stultz @ 2012-05-01  0:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dave Hansen, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On 04/30/2012 05:08 PM, Dave Chinner wrote:
> On Mon, Apr 30, 2012 at 02:07:16PM -0700, John Stultz wrote:
>> On 04/27/2012 06:36 PM, Dave Chinner wrote:
>>> That's my concern - that persistent filesystems will have different
>>> behaviour to in-memory filesystems. They *must* be consistent in
>>> behaviour w.r.t. to stale data exposure, otherwise we are in a world
>>> of pain when applications start to use this. Quite frankly, I don't
>>> care about performance of VOLATILE ranges, but I care greatly
>>> about ensuring filesystems don't expose stale data to user
>>> applications....
>>>
>> I think we're in agreement with the rest of this email, but I do
>> want to stress that the performance of volatile ranges will become
>> more ciritical, as in order for folks to effectively use them, they
>> need to be able to mark and unmark ranges any time they're not using
>> the data.
> Performance is far less important than data security. Make it safe
> first, then optimise performance. As it is, the initial target of
> tmpfs - by it's very nature of returning zeros for regions not
> backed by pages - is safe w.r.t. stale data exposure, so it will not
> be slowed down by using an fallocate "best effort" hole-punching
> interface.  The performance of other filesystems is something that
> the relevant filesystem developers can worry about....

Again, I think we're quite in agreement about the issue of stale data.  
I just want to make sure you understand that the marking and unmarking 
paths will need to be fast if they are to attract users.


>> So if the overhead is too great for marking and unmarking pages,
>> applications will be less likely to "help out".  :)
> Devil's Advocate: If the benefit of managing caches in such a manner
> is this marginal, then why add the complexity to the kernel?
>
I'm not saying the benefit is marginal. When we are resource constrained 
(no swap) and we need to free memory, having regions pre-marked by 
applications is a great benefit, as we can immediately take those marked 
volatile ranges (as opposed to memory notifiers, where we request 
applications to free memory themselves).  Being able to free chunks of 
application memory, rather then killing the application provides a 
better experience/overall system performance.  However, if applications 
feel the marking and unmarking is too costly, they are less likely to 
mark the freeable ranges as volatile.

So only if no consideration for performance is given,  in that case 
there'd be no benefit to adding the interface.

thanks
-john






^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-05-01  0:46             ` John Stultz
@ 2012-05-01  1:28               ` Dave Chinner
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2012-05-01  1:28 UTC (permalink / raw)
  To: John Stultz
  Cc: Dave Hansen, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On Mon, Apr 30, 2012 at 05:46:12PM -0700, John Stultz wrote:
> On 04/30/2012 05:08 PM, Dave Chinner wrote:
> >On Mon, Apr 30, 2012 at 02:07:16PM -0700, John Stultz wrote:
> >>On 04/27/2012 06:36 PM, Dave Chinner wrote:
> >>So if the overhead is too great for marking and unmarking pages,
> >>applications will be less likely to "help out".  :)
> >Devil's Advocate: If the benefit of managing caches in such a manner
> >is this marginal, then why add the complexity to the kernel?
> >
> I'm not saying the benefit is marginal. When we are resource
> constrained (no swap) and we need to free memory, having regions
> pre-marked by applications is a great benefit, as we can immediately
> take those marked volatile ranges (as opposed to memory notifiers,
> where we request applications to free memory themselves). 

This isn't a performance problem - this is a "avoid a hard fail"
problem. OOM tends to be fatal, and when you get to those situations
performance is already compromised. Once again, resiliency in low
memory situations is more important that absolute performance

> Being
> able to free chunks of application memory, rather then killing the
> application provides a better experience/overall system performance.

Exactly. The choice you are making here is better resiliency at an
edge case vs hard fail. The result is better overall behaviour, but
what you are trying to solve is not a performance problem. Hence
making the argument that performance is critical or no-one will use
it is, IMO, way off the mark. Developer's will use it, otherwise it
is their application that will get killed by the OS rather than the
competing app that uses the interface and allows the kernel to
shrink it's memory usage.....

> However, if applications feel the marking and unmarking is too
> costly, they are less likely to mark the freeable ranges as
> volatile.

Tracking ranges is not going to be a performance problem. For
real filesystems, it's the hole punching to guarantee data security
that will cause performance issues. In the case you are most
interested in (tmpfs) there is no hole punching overhead when
freeing ranges, so the only overhead is the range tracking. That's
code you are writing, so I don't see how it would end up unfit for
your purpose. :)

> So only if no consideration for performance is given,  in that case
> there'd be no benefit to adding the interface.

I did not say that no consideration should be given to performance,
just that data safety comes first. For your specific application on
tmpfs, data safety comes for free due to the nature of the tmpfs
implementation. However, that data safety needs to be explicit in
the API, not a result of an undocumented side effect of a specific
implementation. For real filesystems there will be overhead as a
result of fulfilling the data safety requirement, but solving those
performance issues is the responsibility of the filesystem
developers, not you....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-27  0:39   ` Dave Chinner
  2012-04-27 15:25     ` Dave Hansen
@ 2012-04-27 19:14     ` John Stultz
  2012-04-28  2:04       ` Dave Chinner
  1 sibling, 1 reply; 29+ messages in thread
From: John Stultz @ 2012-04-27 19:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
	Neil Brown, Andrea Righi, Aneesh Kumar K.V

On 04/26/2012 05:39 PM, Dave Chinner wrote:
> On Tue, Apr 24, 2012 at 10:49:46AM -0700, John Stultz wrote:
>> @@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>>   			invalidate_mapping_pages(mapping, start_index,
>>   						end_index);
>>   		break;
>> +	case POSIX_FADV_VOLATILE:
>> +		/* First and last PARTIAL page! */
>> +		start_index = offset>>  PAGE_CACHE_SHIFT;
>> +		end_index = endbyte>>  PAGE_CACHE_SHIFT;
>> +		ret = mapping_range_volatile(mapping, start_index, end_index);
>> +		break;
>> +	case POSIX_FADV_NONVOLATILE:
>> +		/* First and last PARTIAL page! */
>> +		start_index = offset>>  PAGE_CACHE_SHIFT;
>> +		end_index = endbyte>>  PAGE_CACHE_SHIFT;
>> +		ret = mapping_range_nonvolatile(mapping, start_index,
>> +								end_index);
> As it is, I'm still not sold on these being an fadvise() interface
> because all it really is a delayed hole punching interface whose
> functionailty is currently specific to tmpfs. The behaviour cannot
> be implemented sanely by anything else at this point.
Yea. So I spent some time looking at the various hole punching 
mechanisms and they aren't all together consistent across filesystems. 
For instance, on some filesystems (ext4 and mostly disk backed fs) you 
have to use fallocate(fd, |FALLOC_FL_PUNCH_HOLE,...)|, while on tmpfs, 
its madvise(...,MADV_REMOVE).   So in a way, currently, the 
FADVISE_VOLATILE is closer to a delayed MADVISE_REMOVE.


>> + * The goal behind volatile ranges is to allow applications to interact
>> + * with the kernel's cache management infrastructure.  In particular an
>> + * application can say "this memory contains data that might be useful in
>> + * the future, but can be reconstructed if necessary, so if the kernel
>> + * needs, it can zap and reclaim this memory without having to swap it out.
> This is what I mean - the definition of volatility is specific to a
> filesystem implementation - one that doesn't store persistent data.
Well,  I'd like to think that it could be extended to do delayed hole 
punching on disk backed persistent files, but again, currently there's 
no unified way to punch holes across the disk and memory backed 
filesystems.

If other filesystems implemented vmtruncate_range for hole punching, we 
could (modulo the circular mutex lock issue of calling vmtruncate_range 
from a shrinker) support this on other filesystems.

Are there inherent reasons why vmtruncate_range isn't implemented (or 
can't be sanely implemented) by non-tmpfs filesystems?


>> + * The proposed mechanism - at a high level - is for user-space to be able
>> + * to say "This memory is volatile" and then later "this memory is no longer
>> + * volatile".  If the content of the memory is still available the second
>> + * request succeeds.  If not, the memory is marked non-volatile and an
>> + * error is returned to denote that the contents have been lost.
> For a filesystem, it's not "memory" that is volatile - it is the
> *data* that we have to consider that these hints apply to, and that
> implies both in memory and on stable storage. because you are
> targetting a filesystem without persisten storage, you are using
> "memory" interchangably with "data". That basically results in an
> interface that can only be used by non-persistent filesystems.
> However, for managing on-disk caches of fixed sizes, being able to
> mark regions as volatile or not is just as helpful to them as it is
> to memory based caches on tmpfs....
>
> So why can't you implement this as fallocate() flags, and then make
> the tmpfs implementation of those fallocate flags do the right
> things?  I think fallocate is the right interface, because this is
> simply an extension of the existing hole punching implementation.
> IOWs, the specification you are describing means that FADV_VOLATILE
> could be correctly implemented as an immediate hole punch by every
> filesystem that supports hole punching.

So yea, I'm fine with changing interface as long as fallocate is where 
the consensus is.  I'm not sure I maybe understand the subtlety of the 
interface differences, and it doesn't necessarily seem more intuitive to 
me (as seems more advisory then allocation based).  But I can give it a 
shot.

Another way we could go is using madvise, somewhat mimicing the 
MADVISE_REMOVE call, which again, is not implemented everywhere.

Although as DaveH said, doing the hole punch on disk is extra overhead. 
But I agree it makes more sense from a least-surprise approach (no data 
is less surprising then old data after a purge).

As for your immediate hole punch thought, that could work, although 
FADV_VOLATILE would be just as correctly implemented by not purging any 
of data on disk backed files.  Either way, a difference might be 
slightly confusing for users (since either way changes the global LRU 
purge behavior).

> This probably won't perform wonderfully, which is where the range
> tracking and delayed punching (and the implied memory freeing)
> optimiation comes into play. Sure, for tmpfs this can be implemented
> as a shrinker, but for real filesystems that have to punch blocks a
> shrinker is really the wrong context to be running such
> transactions. However, using the fallocate() interface allows each
> filesytsem to optimise the delayed hole punching as they see best,
> something that cannot be done with this fadvise() interface.

So if a shrinker isn't the right context, what would be a good context 
for delayed hole punching?


> It's all great that this can replace a single function in ashmem,
> but focussing purely on ashmem misses the point that this
> functionality has wider use, and that using a different interface
> allows independently tailored and optimised implementations of that
> functionality....

Very much agreed, I'd like this to be more generically usable as well.

Thanks again for the helpful feedback! Let me know your thoughts on my 
questions above, and I'll start working on seeing what is required to 
switch over to fallocate().

thanks
-john




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-27 19:14     ` John Stultz
@ 2012-04-28  2:04       ` Dave Chinner
  2012-04-30 19:40         ` John Stultz
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2012-04-28  2:04 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
	Neil Brown, Andrea Righi, Aneesh Kumar K.V

On Fri, Apr 27, 2012 at 12:14:18PM -0700, John Stultz wrote:
> On 04/26/2012 05:39 PM, Dave Chinner wrote:
> >On Tue, Apr 24, 2012 at 10:49:46AM -0700, John Stultz wrote:
> >>@@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> >>  			invalidate_mapping_pages(mapping, start_index,
> >>  						end_index);
> >>  		break;
> >>+	case POSIX_FADV_VOLATILE:
> >>+		/* First and last PARTIAL page! */
> >>+		start_index = offset>>  PAGE_CACHE_SHIFT;
> >>+		end_index = endbyte>>  PAGE_CACHE_SHIFT;
> >>+		ret = mapping_range_volatile(mapping, start_index, end_index);
> >>+		break;
> >>+	case POSIX_FADV_NONVOLATILE:
> >>+		/* First and last PARTIAL page! */
> >>+		start_index = offset>>  PAGE_CACHE_SHIFT;
> >>+		end_index = endbyte>>  PAGE_CACHE_SHIFT;
> >>+		ret = mapping_range_nonvolatile(mapping, start_index,
> >>+								end_index);
> >As it is, I'm still not sold on these being an fadvise() interface
> >because all it really is a delayed hole punching interface whose
> >functionailty is currently specific to tmpfs. The behaviour cannot
> >be implemented sanely by anything else at this point.
> Yea. So I spent some time looking at the various hole punching
> mechanisms and they aren't all together consistent across
> filesystems. For instance, on some filesystems (ext4 and mostly disk
> backed fs) you have to use fallocate(fd,
> |FALLOC_FL_PUNCH_HOLE,...)|, while on tmpfs, its
> madvise(...,MADV_REMOVE).   So in a way, currently, the
> FADVISE_VOLATILE is closer to a delayed MADVISE_REMOVE.

The MADVISE_REMOVE functionality for hole punching works *only* for
tmpfs - no other filesystem implements the .truncate_range() method.
In fact, several filesystems *can't* implement .truncate_range()
because there is no callout from the page cache truncation code to
allow filesystems to punch out the underlying blocks. The
vmtruncate() code is deprecated for this reason (and various others
like a lack of error handling), and .truncate_range() is just as
nasty. .truncate_range() needs to die, IMO.

So, rather than building more infrastructure on a nasty, filesystem
specific mmap() hack, implement .fallocate() on tmpfs and use the
same interface that every other filesystem uses for punching holes.

> >>+ * The goal behind volatile ranges is to allow applications to interact
> >>+ * with the kernel's cache management infrastructure.  In particular an
> >>+ * application can say "this memory contains data that might be useful in
> >>+ * the future, but can be reconstructed if necessary, so if the kernel
> >>+ * needs, it can zap and reclaim this memory without having to swap it out.
> >This is what I mean - the definition of volatility is specific to a
> >filesystem implementation - one that doesn't store persistent data.
> Well,  I'd like to think that it could be extended to do delayed
> hole punching on disk backed persistent files, but again, currently
> there's no unified way to punch holes across the disk and memory
> backed filesystems.
> 
> If other filesystems implemented vmtruncate_range for hole punching,
> we could (modulo the circular mutex lock issue of calling
> vmtruncate_range from a shrinker) support this on other filesystems.

See above - vmtruncate() is *deprecated*.

> Are there inherent reasons why vmtruncate_range isn't implemented
> (or can't be sanely implemented) by non-tmpfs filesystems?

See above.

> >>+ * The proposed mechanism - at a high level - is for user-space to be able
> >>+ * to say "This memory is volatile" and then later "this memory is no longer
> >>+ * volatile".  If the content of the memory is still available the second
> >>+ * request succeeds.  If not, the memory is marked non-volatile and an
> >>+ * error is returned to denote that the contents have been lost.
> >For a filesystem, it's not "memory" that is volatile - it is the
> >*data* that we have to consider that these hints apply to, and that
> >implies both in memory and on stable storage. because you are
> >targetting a filesystem without persisten storage, you are using
> >"memory" interchangably with "data". That basically results in an
> >interface that can only be used by non-persistent filesystems.
> >However, for managing on-disk caches of fixed sizes, being able to
> >mark regions as volatile or not is just as helpful to them as it is
> >to memory based caches on tmpfs....
> >
> >So why can't you implement this as fallocate() flags, and then make
> >the tmpfs implementation of those fallocate flags do the right
> >things?  I think fallocate is the right interface, because this is
> >simply an extension of the existing hole punching implementation.
> >IOWs, the specification you are describing means that FADV_VOLATILE
> >could be correctly implemented as an immediate hole punch by every
> >filesystem that supports hole punching.
> 
> So yea, I'm fine with changing interface as long as fallocate is
> where the consensus is.  I'm not sure I maybe understand the
> subtlety of the interface differences, and it doesn't necessarily
> seem more intuitive to me (as seems more advisory then allocation
> based).  But I can give it a shot.
> 
> Another way we could go is using madvise, somewhat mimicing the
> MADVISE_REMOVE call, which again, is not implemented everywhere.

MADVISE_REMOVE is another tmpfs specifc interface, because it falls
down to vmtruncate_range(). fallocate()-based hole punching is the
only way this can be implemented on normal filesystems

> Although as DaveH said, doing the hole punch on disk is extra
> overhead. But I agree it makes more sense from a least-surprise
> approach (no data is less surprising then old data after a purge).

Exactly my point, though security rather least-surprise is the angle
I see this from.

> As for your immediate hole punch thought, that could work, although
> FADV_VOLATILE would be just as correctly implemented by not purging
> any of data on disk backed files.  Either way, a difference might be
> slightly confusing for users (since either way changes the global
> LRU purge behavior).

Right, it is equally valid to ignore them, which is why an initial
fallocate() based implementation wouldn't need to modify a single
filesystem. i.e. those that don't support fallocate behave as
expected (do nothing, not even purge the page cache), those that
support hole punching behave as expected (zeros for volatile
ranges), and tmpfs behaves as expected (zeros, really fast).

> >This probably won't perform wonderfully, which is where the range
> >tracking and delayed punching (and the implied memory freeing)
> >optimiation comes into play. Sure, for tmpfs this can be implemented
> >as a shrinker, but for real filesystems that have to punch blocks a
> >shrinker is really the wrong context to be running such
> >transactions. However, using the fallocate() interface allows each
> >filesytsem to optimise the delayed hole punching as they see best,
> >something that cannot be done with this fadvise() interface.
> 
> So if a shrinker isn't the right context, what would be a good
> context for delayed hole punching?

Like we in XFs for inode reclaim. We have a background workqueue
that frees aged inodes periodically in the fastest manner possible
(i.e. all async, no blocking on locks, etc), and the shrinker, when
run kicks that background thread first, and then enters into
synchronous reclaim. By the time a single sync reclaim cycle is run
and throttled reclaim sufficiently, the background thread has done a
great deal more work.

A similar mechanism can be used for this functionality within XFS.
Indeed, we could efficiently track which inodes have volatile ranges
on them via a bit in the radix trees than index the inode cache,
just like we do for reclaimable inodes. If we then used a bit in the
page cache radix tree index to indicate volatile pages, we could
then easily find the ranges we need to punch out without requiring
some new tree and more per-inode memory.

That's a very filesystem specific implementation - it's vastly
different to you tmpfs implementation - but this is exactly what I
mean about using fallocate to allow filesystems to optimise the
implementation in the most suitable manner for them....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-28  2:04       ` Dave Chinner
@ 2012-04-30 19:40         ` John Stultz
  2012-05-01  0:28           ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: John Stultz @ 2012-04-30 19:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
	Neil Brown, Andrea Righi, Aneesh Kumar K.V

On 04/27/2012 07:04 PM, Dave Chinner wrote:
> On Fri, Apr 27, 2012 at 12:14:18PM -0700, John Stultz wrote:
>> On 04/26/2012 05:39 PM, Dave Chinner wrote:
>>> On Tue, Apr 24, 2012 at 10:49:46AM -0700, John Stultz wrote:
>>>> @@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>>>>   			invalidate_mapping_pages(mapping, start_index,
>>>>   						end_index);
>>>>   		break;
>>>> +	case POSIX_FADV_VOLATILE:
>>>> +		/* First and last PARTIAL page! */
>>>> +		start_index = offset>>   PAGE_CACHE_SHIFT;
>>>> +		end_index = endbyte>>   PAGE_CACHE_SHIFT;
>>>> +		ret = mapping_range_volatile(mapping, start_index, end_index);
>>>> +		break;
>>>> +	case POSIX_FADV_NONVOLATILE:
>>>> +		/* First and last PARTIAL page! */
>>>> +		start_index = offset>>   PAGE_CACHE_SHIFT;
>>>> +		end_index = endbyte>>   PAGE_CACHE_SHIFT;
>>>> +		ret = mapping_range_nonvolatile(mapping, start_index,
>>>> +								end_index);
>>> As it is, I'm still not sold on these being an fadvise() interface
>>> because all it really is a delayed hole punching interface whose
>>> functionailty is currently specific to tmpfs. The behaviour cannot
>>> be implemented sanely by anything else at this point.
>> Yea. So I spent some time looking at the various hole punching
>> mechanisms and they aren't all together consistent across
>> filesystems. For instance, on some filesystems (ext4 and mostly disk
>> backed fs) you have to use fallocate(fd,
>> |FALLOC_FL_PUNCH_HOLE,...)|, while on tmpfs, its
>> madvise(...,MADV_REMOVE).   So in a way, currently, the
>> FADVISE_VOLATILE is closer to a delayed MADVISE_REMOVE.
> The MADVISE_REMOVE functionality for hole punching works *only* for
> tmpfs - no other filesystem implements the .truncate_range() method.
> In fact, several filesystems *can't* implement .truncate_range()
> because there is no callout from the page cache truncation code to
> allow filesystems to punch out the underlying blocks. The
> vmtruncate() code is deprecated for this reason (and various others
> like a lack of error handling), and .truncate_range() is just as
> nasty. .truncate_range() needs to die, IMO.
>
> So, rather than building more infrastructure on a nasty, filesystem
> specific mmap() hack, implement .fallocate() on tmpfs and use the
> same interface that every other filesystem uses for punching holes.

Ah. Ok.  I wasn't aware that vmtruncate was deprecated.  Thanks for 
cluing me in here!

>>> This probably won't perform wonderfully, which is where the range
>>> tracking and delayed punching (and the implied memory freeing)
>>> optimiation comes into play. Sure, for tmpfs this can be implemented
>>> as a shrinker, but for real filesystems that have to punch blocks a
>>> shrinker is really the wrong context to be running such
>>> transactions. However, using the fallocate() interface allows each
>>> filesytsem to optimise the delayed hole punching as they see best,
>>> something that cannot be done with this fadvise() interface.
>> So if a shrinker isn't the right context, what would be a good
>> context for delayed hole punching?
> Like we in XFs for inode reclaim. We have a background workqueue
> that frees aged inodes periodically in the fastest manner possible
> (i.e. all async, no blocking on locks, etc), and the shrinker, when
> run kicks that background thread first, and then enters into
> synchronous reclaim. By the time a single sync reclaim cycle is run
> and throttled reclaim sufficiently, the background thread has done a
> great deal more work.
>
> A similar mechanism can be used for this functionality within XFS.
> Indeed, we could efficiently track which inodes have volatile ranges
> on them via a bit in the radix trees than index the inode cache,
> just like we do for reclaimable inodes. If we then used a bit in the
> page cache radix tree index to indicate volatile pages, we could
> then easily find the ranges we need to punch out without requiring
> some new tree and more per-inode memory.
>
> That's a very filesystem specific implementation - it's vastly
> different to you tmpfs implementation - but this is exactly what I
> mean about using fallocate to allow filesystems to optimise the
> implementation in the most suitable manner for them....
>

So, just to make sure I'm folloiwng you, you're suggesting that there 
would be a filesystem specific implementation at the top level. 
Something like a  mark_volatile(struct inode *, bool, loff_t, loff_t) 
inode operation? And the filesystem would then be responsible for 
managing the ranges and appropriately purging them?

Thanks again for the feedback, I'll continue looking into this.

thanks
-john





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-30 19:40         ` John Stultz
@ 2012-05-01  0:28           ` Dave Chinner
  2012-05-01  1:15             ` John Stultz
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2012-05-01  0:28 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
	Neil Brown, Andrea Righi, Aneesh Kumar K.V

On Mon, Apr 30, 2012 at 12:40:13PM -0700, John Stultz wrote:
> On 04/27/2012 07:04 PM, Dave Chinner wrote:
> >On Fri, Apr 27, 2012 at 12:14:18PM -0700, John Stultz wrote:
> >>On 04/26/2012 05:39 PM, Dave Chinner wrote:
> >>>This probably won't perform wonderfully, which is where the range
> >>>tracking and delayed punching (and the implied memory freeing)
> >>>optimiation comes into play. Sure, for tmpfs this can be implemented
> >>>as a shrinker, but for real filesystems that have to punch blocks a
> >>>shrinker is really the wrong context to be running such
> >>>transactions. However, using the fallocate() interface allows each
> >>>filesytsem to optimise the delayed hole punching as they see best,
> >>>something that cannot be done with this fadvise() interface.
> >>So if a shrinker isn't the right context, what would be a good
> >>context for delayed hole punching?
> >Like we in XFs for inode reclaim. We have a background workqueue
> >that frees aged inodes periodically in the fastest manner possible
> >(i.e. all async, no blocking on locks, etc), and the shrinker, when
> >run kicks that background thread first, and then enters into
> >synchronous reclaim. By the time a single sync reclaim cycle is run
> >and throttled reclaim sufficiently, the background thread has done a
> >great deal more work.
> >
> >A similar mechanism can be used for this functionality within XFS.
> >Indeed, we could efficiently track which inodes have volatile ranges
> >on them via a bit in the radix trees than index the inode cache,
> >just like we do for reclaimable inodes. If we then used a bit in the
> >page cache radix tree index to indicate volatile pages, we could
> >then easily find the ranges we need to punch out without requiring
> >some new tree and more per-inode memory.
> >
> >That's a very filesystem specific implementation - it's vastly
> >different to you tmpfs implementation - but this is exactly what I
> >mean about using fallocate to allow filesystems to optimise the
> >implementation in the most suitable manner for them....
> >
> 
> So, just to make sure I'm folloiwng you, you're suggesting that
> there would be a filesystem specific implementation at the top
> level. Something like a  mark_volatile(struct inode *, bool, loff_t,
> loff_t) inode operation? And the filesystem would then be
> responsible for managing the ranges and appropriately purging them?

Not quite. I'm suggesting that you use the .fallocate() file
operation to call into the filesystem specific code, and from there
the filesystem code either calls a generic helper function to mark
ranges as volatile and provides a callback for implementing the
shrinker functionailty, or it implements it all itself.

i.e. userspace would do:

	err = fallocate(fd, FALLOC_FL_MARK_VOLATILE, off, len);
	err = fallocate(fd, FALLOC_FL_CLEAR_VOLATILE, off, len);

and that will get passed to the filesystem implementation of
.fallocate (from do_fallocate()). The filesystem callout for this:

0 btrfs/file.c   1898 .fallocate = btrfs_fallocate,
1 ext4/file.c     247 .fallocate = ext4_fallocate,
2 gfs2/file.c    1015 .fallocate = gfs2_fallocate,
3 gfs2/file.c    1045 .fallocate = gfs2_fallocate,
4 ocfs2/file.c   2727 .fallocate = ocfs2_fallocate,
5 ocfs2/file.c   2774 .fallocate = ocfs2_fallocate,
6 xfs/xfs_file.c 1026 .fallocate = xfs_file_fallocate,

can then call a generic helper like, say:

	filemap_mark_volatile_range(inode, off, len);
	filemap_clear_volatile_range(inode, off, len);

to be able to use the range tree tracking you have written for this
purpose. The filesystem is also free to track ranges however it
pleases.

The filesystem will need to be able to store a tree/list root for
tracking all it's inodes that have volatile ranges, and register a
shrinker to walk that list and do the work necessary when memory
becomes low, but that is simple to do for a basic implementation.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-05-01  0:28           ` Dave Chinner
@ 2012-05-01  1:15             ` John Stultz
  2012-05-01  1:51               ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: John Stultz @ 2012-05-01  1:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
	Neil Brown, Andrea Righi, Aneesh Kumar K.V

On 04/30/2012 05:28 PM, Dave Chinner wrote:
> On Mon, Apr 30, 2012 at 12:40:13PM -0700, John Stultz wrote:
>> On 04/27/2012 07:04 PM, Dave Chinner wrote:
>>> On Fri, Apr 27, 2012 at 12:14:18PM -0700, John Stultz wrote:
>>>> On 04/26/2012 05:39 PM, Dave Chinner wrote:
>>>>> This probably won't perform wonderfully, which is where the range
>>>>> tracking and delayed punching (and the implied memory freeing)
>>>>> optimiation comes into play. Sure, for tmpfs this can be implemented
>>>>> as a shrinker, but for real filesystems that have to punch blocks a
>>>>> shrinker is really the wrong context to be running such
>>>>> transactions. However, using the fallocate() interface allows each
>>>>> filesytsem to optimise the delayed hole punching as they see best,
>>>>> something that cannot be done with this fadvise() interface.
>>>> So if a shrinker isn't the right context, what would be a good
>>>> context for delayed hole punching?
>>> Like we in XFs for inode reclaim. We have a background workqueue
>>> that frees aged inodes periodically in the fastest manner possible
>>> (i.e. all async, no blocking on locks, etc), and the shrinker, when
>>> run kicks that background thread first, and then enters into
>>> synchronous reclaim. By the time a single sync reclaim cycle is run
>>> and throttled reclaim sufficiently, the background thread has done a
>>> great deal more work.
>>>
>>> A similar mechanism can be used for this functionality within XFS.
>>> Indeed, we could efficiently track which inodes have volatile ranges
>>> on them via a bit in the radix trees than index the inode cache,
>>> just like we do for reclaimable inodes. If we then used a bit in the
>>> page cache radix tree index to indicate volatile pages, we could
>>> then easily find the ranges we need to punch out without requiring
>>> some new tree and more per-inode memory.
>>>
>>> That's a very filesystem specific implementation - it's vastly
>>> different to you tmpfs implementation - but this is exactly what I
>>> mean about using fallocate to allow filesystems to optimise the
>>> implementation in the most suitable manner for them....
>>>
>> So, just to make sure I'm folloiwng you, you're suggesting that
>> there would be a filesystem specific implementation at the top
>> level. Something like a  mark_volatile(struct inode *, bool, loff_t,
>> loff_t) inode operation? And the filesystem would then be
>> responsible for managing the ranges and appropriately purging them?
> Not quite. I'm suggesting that you use the .fallocate() file
> operation to call into the filesystem specific code, and from there
> the filesystem code either calls a generic helper function to mark
> ranges as volatile and provides a callback for implementing the
> shrinker functionailty, or it implements it all itself.
>
> i.e. userspace would do:
>
> 	err = fallocate(fd, FALLOC_FL_MARK_VOLATILE, off, len);
> 	err = fallocate(fd, FALLOC_FL_CLEAR_VOLATILE, off, len);
>
> and that will get passed to the filesystem implementation of
> .fallocate (from do_fallocate()). The filesystem callout for this:
>
> 0 btrfs/file.c   1898 .fallocate = btrfs_fallocate,
> 1 ext4/file.c     247 .fallocate = ext4_fallocate,
> 2 gfs2/file.c    1015 .fallocate = gfs2_fallocate,
> 3 gfs2/file.c    1045 .fallocate = gfs2_fallocate,
> 4 ocfs2/file.c   2727 .fallocate = ocfs2_fallocate,
> 5 ocfs2/file.c   2774 .fallocate = ocfs2_fallocate,
> 6 xfs/xfs_file.c 1026 .fallocate = xfs_file_fallocate,
Ok.

Although noting that tmpfs doesn't implement fallocate, this isn't just 
a ruse to get me to implement fallocate for tmpfs, right? ;)

Out of curiosity: While for my uses, a tmpfs exclusive implementation 
isn't a problem for me,  I *really* appreciate your feedback and help 
focusing this into a more generic fs solution. But to get more context 
for your insights, I'm curious if you have any use cases in your mind 
that is directing this work to be more generic?

Again, you've been helpfully skeptical of the work, and also at the same 
time pushing it in a certain direction, and I want to make sure I better 
understand where you're coming from.  You've clarified the file security 
concern well enough, but is that all?  I just want to make sure that 
going the generic fallocate route is really worth it, as opposed to 
moving to madvise() and just failing file-data backed memory.

> can then call a generic helper like, say:
>
> 	filemap_mark_volatile_range(inode, off, len);
> 	filemap_clear_volatile_range(inode, off, len);
>
> to be able to use the range tree tracking you have written for this
> purpose. The filesystem is also free to track ranges however it
> pleases.
>
> The filesystem will need to be able to store a tree/list root for
> tracking all it's inodes that have volatile ranges, and register a
> shrinker to walk that list and do the work necessary when memory
> becomes low, but that is simple to do for a basic implementation.
>
Hrmm..  Currently I'm using a per-mapping range-tree along with a global 
LRU list that ties all the ranges together.

We could go for a per-filesystem filesystem controlled LRU as you 
suggest.  Although that along with the filesystem managed range-tree 
roots  would make the generic helpers a little complex. But it could 
probably be worked out.

I'll let you know when I have a first pass implementation or if I hit 
any walls.

Once again, thanks so much for all the feedback and guidance!
-john






^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-05-01  1:15             ` John Stultz
@ 2012-05-01  1:51               ` Dave Chinner
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2012-05-01  1:51 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
	Neil Brown, Andrea Righi, Aneesh Kumar K.V

On Mon, Apr 30, 2012 at 06:15:44PM -0700, John Stultz wrote:
> On 04/30/2012 05:28 PM, Dave Chinner wrote:
> >On Mon, Apr 30, 2012 at 12:40:13PM -0700, John Stultz wrote:
> >>On 04/27/2012 07:04 PM, Dave Chinner wrote:
> >>>On Fri, Apr 27, 2012 at 12:14:18PM -0700, John Stultz wrote:
> >>>>On 04/26/2012 05:39 PM, Dave Chinner wrote:
> >>>>>This probably won't perform wonderfully, which is where the range
> >>>>>tracking and delayed punching (and the implied memory freeing)
> >>>>>optimiation comes into play. Sure, for tmpfs this can be implemented
> >>>>>as a shrinker, but for real filesystems that have to punch blocks a
> >>>>>shrinker is really the wrong context to be running such
> >>>>>transactions. However, using the fallocate() interface allows each
> >>>>>filesytsem to optimise the delayed hole punching as they see best,
> >>>>>something that cannot be done with this fadvise() interface.
> >>>>So if a shrinker isn't the right context, what would be a good
> >>>>context for delayed hole punching?
> >>>Like we in XFs for inode reclaim. We have a background workqueue
> >>>that frees aged inodes periodically in the fastest manner possible
> >>>(i.e. all async, no blocking on locks, etc), and the shrinker, when
> >>>run kicks that background thread first, and then enters into
> >>>synchronous reclaim. By the time a single sync reclaim cycle is run
> >>>and throttled reclaim sufficiently, the background thread has done a
> >>>great deal more work.
> >>>
> >>>A similar mechanism can be used for this functionality within XFS.
> >>>Indeed, we could efficiently track which inodes have volatile ranges
> >>>on them via a bit in the radix trees than index the inode cache,
> >>>just like we do for reclaimable inodes. If we then used a bit in the
> >>>page cache radix tree index to indicate volatile pages, we could
> >>>then easily find the ranges we need to punch out without requiring
> >>>some new tree and more per-inode memory.
> >>>
> >>>That's a very filesystem specific implementation - it's vastly
> >>>different to you tmpfs implementation - but this is exactly what I
> >>>mean about using fallocate to allow filesystems to optimise the
> >>>implementation in the most suitable manner for them....
> >>>
> >>So, just to make sure I'm folloiwng you, you're suggesting that
> >>there would be a filesystem specific implementation at the top
> >>level. Something like a  mark_volatile(struct inode *, bool, loff_t,
> >>loff_t) inode operation? And the filesystem would then be
> >>responsible for managing the ranges and appropriately purging them?
> >Not quite. I'm suggesting that you use the .fallocate() file
> >operation to call into the filesystem specific code, and from there
> >the filesystem code either calls a generic helper function to mark
> >ranges as volatile and provides a callback for implementing the
> >shrinker functionailty, or it implements it all itself.
> >
> >i.e. userspace would do:
> >
> >	err = fallocate(fd, FALLOC_FL_MARK_VOLATILE, off, len);
> >	err = fallocate(fd, FALLOC_FL_CLEAR_VOLATILE, off, len);
> >
> >and that will get passed to the filesystem implementation of
> >.fallocate (from do_fallocate()). The filesystem callout for this:
> >
> >0 btrfs/file.c   1898 .fallocate = btrfs_fallocate,
> >1 ext4/file.c     247 .fallocate = ext4_fallocate,
> >2 gfs2/file.c    1015 .fallocate = gfs2_fallocate,
> >3 gfs2/file.c    1045 .fallocate = gfs2_fallocate,
> >4 ocfs2/file.c   2727 .fallocate = ocfs2_fallocate,
> >5 ocfs2/file.c   2774 .fallocate = ocfs2_fallocate,
> >6 xfs/xfs_file.c 1026 .fallocate = xfs_file_fallocate,
> Ok.
> 
> Although noting that tmpfs doesn't implement fallocate, this isn't
> just a ruse to get me to implement fallocate for tmpfs, right? ;)

No, I don't care about what tmpfs supports in most cases, and in
reality fallocate has not been implemented because tmpfs is generall
hacked on by MM people, not filesystem people. That's why it has all
the special interfaces that filesystems can't actually use...

> Out of curiosity: While for my uses, a tmpfs exclusive
> implementation isn't a problem for me,  I *really* appreciate your
> feedback and help focusing this into a more generic fs solution. But
> to get more context for your insights, I'm curious if you have any
> use cases in your mind that is directing this work to be more
> generic?

Managing large scale disk caches have exactly the same problems of
determining what to evict and/or move to secondary storage when
space is low. Being able to mark ranges of files as "remove this
first" woulxp dbe very advantageous for pro-active mangement of ENOSPC
conditions in the cache...

And being able to do space-demand hole-punching for stuff like
managing VM images would be really cool. For example, the loopback
device uses hole punching to implement TRIM commands, so turning
those into VOLATILE ranges for background dispatch will speed those
operations up immensely and avoid silly same block "TRIM - write -
TRIM - write" cyclic hole-punch/allocation in the backing file.  KVM
could do the same for implementing TRIM on image based block
devices...

There's lots of things that can be done with a generic advisory,
asynchornous hole-punching interface.

> Again, you've been helpfully skeptical of the work, and also at the
> same time pushing it in a certain direction, and I want to make sure
> I better understand where you're coming from.  You've clarified the
> file security concern well enough, but is that all?  I just want to
> make sure that going the generic fallocate route is really worth it,
> as opposed to moving to madvise() and just failing file-data backed
> memory.
> 
> >can then call a generic helper like, say:
> >
> >	filemap_mark_volatile_range(inode, off, len);
> >	filemap_clear_volatile_range(inode, off, len);
> >
> >to be able to use the range tree tracking you have written for this
> >purpose. The filesystem is also free to track ranges however it
> >pleases.
> >
> >The filesystem will need to be able to store a tree/list root for
> >tracking all it's inodes that have volatile ranges, and register a
> >shrinker to walk that list and do the work necessary when memory
> >becomes low, but that is simple to do for a basic implementation.
> >
> Hrmm..  Currently I'm using a per-mapping range-tree along with a
> global LRU list that ties all the ranges together.

/me smells another candidate for his in-progress generic
LRU-with-integrated-shrinker work.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges
  2012-04-24 17:49 [PATCH 0/3] Volatile Ranges John Stultz
  2012-04-24 17:49 ` [PATCH 1/3] Range tree implementation John Stultz
  2012-04-24 17:49 ` [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz
@ 2012-04-24 17:49 ` John Stultz
  2012-04-24 19:21   ` Peter Zijlstra
  2 siblings, 1 reply; 29+ messages in thread
From: John Stultz @ 2012-04-24 17:49 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V

First pass attempt at getting ashmem to utilize the volatile range
code.

In this implementaiton GET_PIN_STATUS is unimplemented, due to
the fact that adding a ISVOLATILE check wasn't considered
terribly useful in earlier reviews. It would be trivial to
re-add that functionality, but I wanted to check w/ the
Android developers to see how often GET_PIN_STATUS is actually
used?

Similarly the ashmem PURGE_ALL_CACHES ioctl does not function,
as the volatile range purging is no longer directly under its
control.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 drivers/staging/android/ashmem.c |  327 ++------------------------------------
 1 files changed, 10 insertions(+), 317 deletions(-)

diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 9f1f27e..4b3c9e8 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -28,6 +28,7 @@
 #include <linux/bitops.h>
 #include <linux/mutex.h>
 #include <linux/shmem_fs.h>
+#include <linux/volatile.h>
 #include "ashmem.h"
 
 #define ASHMEM_NAME_PREFIX "dev/ashmem/"
@@ -49,26 +50,6 @@ struct ashmem_area {
 };
 
 /*
- * ashmem_range - represents an interval of unpinned (evictable) pages
- * Lifecycle: From unpin to pin
- * Locking: Protected by `ashmem_mutex'
- */
-struct ashmem_range {
-	struct list_head lru;		/* entry in LRU list */
-	struct list_head unpinned;	/* entry in its area's unpinned list */
-	struct ashmem_area *asma;	/* associated area */
-	size_t pgstart;			/* starting page, inclusive */
-	size_t pgend;			/* ending page, inclusive */
-	unsigned int purged;		/* ASHMEM_NOT or ASHMEM_WAS_PURGED */
-};
-
-/* LRU list of unpinned pages, protected by ashmem_mutex */
-static LIST_HEAD(ashmem_lru_list);
-
-/* Count of pages on our LRU list, protected by ashmem_mutex */
-static unsigned long lru_count;
-
-/*
  * ashmem_mutex - protects the list of and each individual ashmem_area
  *
  * Lock Ordering: ashmex_mutex -> i_mutex -> i_alloc_sem
@@ -76,102 +57,9 @@ static unsigned long lru_count;
 static DEFINE_MUTEX(ashmem_mutex);
 
 static struct kmem_cache *ashmem_area_cachep __read_mostly;
-static struct kmem_cache *ashmem_range_cachep __read_mostly;
-
-#define range_size(range) \
-	((range)->pgend - (range)->pgstart + 1)
-
-#define range_on_lru(range) \
-	((range)->purged == ASHMEM_NOT_PURGED)
-
-#define page_range_subsumes_range(range, start, end) \
-	(((range)->pgstart >= (start)) && ((range)->pgend <= (end)))
-
-#define page_range_subsumed_by_range(range, start, end) \
-	(((range)->pgstart <= (start)) && ((range)->pgend >= (end)))
-
-#define page_in_range(range, page) \
-	(((range)->pgstart <= (page)) && ((range)->pgend >= (page)))
-
-#define page_range_in_range(range, start, end) \
-	(page_in_range(range, start) || page_in_range(range, end) || \
-		page_range_subsumes_range(range, start, end))
-
-#define range_before_page(range, page) \
-	((range)->pgend < (page))
 
 #define PROT_MASK		(PROT_EXEC | PROT_READ | PROT_WRITE)
 
-static inline void lru_add(struct ashmem_range *range)
-{
-	list_add_tail(&range->lru, &ashmem_lru_list);
-	lru_count += range_size(range);
-}
-
-static inline void lru_del(struct ashmem_range *range)
-{
-	list_del(&range->lru);
-	lru_count -= range_size(range);
-}
-
-/*
- * range_alloc - allocate and initialize a new ashmem_range structure
- *
- * 'asma' - associated ashmem_area
- * 'prev_range' - the previous ashmem_range in the sorted asma->unpinned list
- * 'purged' - initial purge value (ASMEM_NOT_PURGED or ASHMEM_WAS_PURGED)
- * 'start' - starting page, inclusive
- * 'end' - ending page, inclusive
- *
- * Caller must hold ashmem_mutex.
- */
-static int range_alloc(struct ashmem_area *asma,
-		       struct ashmem_range *prev_range, unsigned int purged,
-		       size_t start, size_t end)
-{
-	struct ashmem_range *range;
-
-	range = kmem_cache_zalloc(ashmem_range_cachep, GFP_KERNEL);
-	if (unlikely(!range))
-		return -ENOMEM;
-
-	range->asma = asma;
-	range->pgstart = start;
-	range->pgend = end;
-	range->purged = purged;
-
-	list_add_tail(&range->unpinned, &prev_range->unpinned);
-
-	if (range_on_lru(range))
-		lru_add(range);
-
-	return 0;
-}
-
-static void range_del(struct ashmem_range *range)
-{
-	list_del(&range->unpinned);
-	if (range_on_lru(range))
-		lru_del(range);
-	kmem_cache_free(ashmem_range_cachep, range);
-}
-
-/*
- * range_shrink - shrinks a range
- *
- * Caller must hold ashmem_mutex.
- */
-static inline void range_shrink(struct ashmem_range *range,
-				size_t start, size_t end)
-{
-	size_t pre = range_size(range);
-
-	range->pgstart = start;
-	range->pgend = end;
-
-	if (range_on_lru(range))
-		lru_count -= pre - range_size(range);
-}
 
 static int ashmem_open(struct inode *inode, struct file *file)
 {
@@ -197,12 +85,6 @@ static int ashmem_open(struct inode *inode, struct file *file)
 static int ashmem_release(struct inode *ignored, struct file *file)
 {
 	struct ashmem_area *asma = file->private_data;
-	struct ashmem_range *range, *next;
-
-	mutex_lock(&ashmem_mutex);
-	list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned)
-		range_del(range);
-	mutex_unlock(&ashmem_mutex);
 
 	if (asma->file)
 		fput(asma->file);
@@ -336,54 +218,6 @@ out:
 	return ret;
 }
 
-/*
- * ashmem_shrink - our cache shrinker, called from mm/vmscan.c :: shrink_slab
- *
- * 'nr_to_scan' is the number of objects (pages) to prune, or 0 to query how
- * many objects (pages) we have in total.
- *
- * 'gfp_mask' is the mask of the allocation that got us into this mess.
- *
- * Return value is the number of objects (pages) remaining, or -1 if we cannot
- * proceed without risk of deadlock (due to gfp_mask).
- *
- * We approximate LRU via least-recently-unpinned, jettisoning unpinned partial
- * chunks of ashmem regions LRU-wise one-at-a-time until we hit 'nr_to_scan'
- * pages freed.
- */
-static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
-{
-	struct ashmem_range *range, *next;
-
-	/* We might recurse into filesystem code, so bail out if necessary */
-	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
-		return -1;
-	if (!sc->nr_to_scan)
-		return lru_count;
-
-	mutex_lock(&ashmem_mutex);
-	list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
-		struct inode *inode = range->asma->file->f_dentry->d_inode;
-		loff_t start = range->pgstart * PAGE_SIZE;
-		loff_t end = (range->pgend + 1) * PAGE_SIZE - 1;
-
-		vmtruncate_range(inode, start, end);
-		range->purged = ASHMEM_WAS_PURGED;
-		lru_del(range);
-
-		sc->nr_to_scan -= range_size(range);
-		if (sc->nr_to_scan <= 0)
-			break;
-	}
-	mutex_unlock(&ashmem_mutex);
-
-	return lru_count;
-}
-
-static struct shrinker ashmem_shrinker = {
-	.shrink = ashmem_shrink,
-	.seeks = DEFAULT_SEEKS * 4,
-};
 
 static int set_prot_mask(struct ashmem_area *asma, unsigned long prot)
 {
@@ -457,131 +291,6 @@ static int get_name(struct ashmem_area *asma, void __user *name)
 	return ret;
 }
 
-/*
- * ashmem_pin - pin the given ashmem region, returning whether it was
- * previously purged (ASHMEM_WAS_PURGED) or not (ASHMEM_NOT_PURGED).
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_pin(struct ashmem_area *asma, size_t pgstart, size_t pgend)
-{
-	struct ashmem_range *range, *next;
-	int ret = ASHMEM_NOT_PURGED;
-
-	list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned) {
-		/* moved past last applicable page; we can short circuit */
-		if (range_before_page(range, pgstart))
-			break;
-
-		/*
-		 * The user can ask us to pin pages that span multiple ranges,
-		 * or to pin pages that aren't even unpinned, so this is messy.
-		 *
-		 * Four cases:
-		 * 1. The requested range subsumes an existing range, so we
-		 *    just remove the entire matching range.
-		 * 2. The requested range overlaps the start of an existing
-		 *    range, so we just update that range.
-		 * 3. The requested range overlaps the end of an existing
-		 *    range, so we just update that range.
-		 * 4. The requested range punches a hole in an existing range,
-		 *    so we have to update one side of the range and then
-		 *    create a new range for the other side.
-		 */
-		if (page_range_in_range(range, pgstart, pgend)) {
-			ret |= range->purged;
-
-			/* Case #1: Easy. Just nuke the whole thing. */
-			if (page_range_subsumes_range(range, pgstart, pgend)) {
-				range_del(range);
-				continue;
-			}
-
-			/* Case #2: We overlap from the start, so adjust it */
-			if (range->pgstart >= pgstart) {
-				range_shrink(range, pgend + 1, range->pgend);
-				continue;
-			}
-
-			/* Case #3: We overlap from the rear, so adjust it */
-			if (range->pgend <= pgend) {
-				range_shrink(range, range->pgstart, pgstart-1);
-				continue;
-			}
-
-			/*
-			 * Case #4: We eat a chunk out of the middle. A bit
-			 * more complicated, we allocate a new range for the
-			 * second half and adjust the first chunk's endpoint.
-			 */
-			range_alloc(asma, range, range->purged,
-				    pgend + 1, range->pgend);
-			range_shrink(range, range->pgstart, pgstart - 1);
-			break;
-		}
-	}
-
-	return ret;
-}
-
-/*
- * ashmem_unpin - unpin the given range of pages. Returns zero on success.
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_unpin(struct ashmem_area *asma, size_t pgstart, size_t pgend)
-{
-	struct ashmem_range *range, *next;
-	unsigned int purged = ASHMEM_NOT_PURGED;
-
-restart:
-	list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned) {
-		/* short circuit: this is our insertion point */
-		if (range_before_page(range, pgstart))
-			break;
-
-		/*
-		 * The user can ask us to unpin pages that are already entirely
-		 * or partially pinned. We handle those two cases here.
-		 */
-		if (page_range_subsumed_by_range(range, pgstart, pgend))
-			return 0;
-		if (page_range_in_range(range, pgstart, pgend)) {
-			pgstart = min_t(size_t, range->pgstart, pgstart),
-			pgend = max_t(size_t, range->pgend, pgend);
-			purged |= range->purged;
-			range_del(range);
-			goto restart;
-		}
-	}
-
-	return range_alloc(asma, range, purged, pgstart, pgend);
-}
-
-/*
- * ashmem_get_pin_status - Returns ASHMEM_IS_UNPINNED if _any_ pages in the
- * given interval are unpinned and ASHMEM_IS_PINNED otherwise.
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_get_pin_status(struct ashmem_area *asma, size_t pgstart,
-				 size_t pgend)
-{
-	struct ashmem_range *range;
-	int ret = ASHMEM_IS_PINNED;
-
-	list_for_each_entry(range, &asma->unpinned_list, unpinned) {
-		if (range_before_page(range, pgstart))
-			break;
-		if (page_range_in_range(range, pgstart, pgend)) {
-			ret = ASHMEM_IS_UNPINNED;
-			break;
-		}
-	}
-
-	return ret;
-}
-
 static int ashmem_pin_unpin(struct ashmem_area *asma, unsigned long cmd,
 			    void __user *p)
 {
@@ -615,13 +324,19 @@ static int ashmem_pin_unpin(struct ashmem_area *asma, unsigned long cmd,
 
 	switch (cmd) {
 	case ASHMEM_PIN:
-		ret = ashmem_pin(asma, pgstart, pgend);
+		ret = mapping_range_volatile(asma->file->f_mapping, pgstart,
+						pgend);
 		break;
 	case ASHMEM_UNPIN:
-		ret = ashmem_unpin(asma, pgstart, pgend);
+		ret = mapping_range_volatile(asma->file->f_mapping, pgstart,
+						pgend);
 		break;
 	case ASHMEM_GET_PIN_STATUS:
-		ret = ashmem_get_pin_status(asma, pgstart, pgend);
+		/*
+		 * XXX - volatile ranges currently don't provide status,
+		 * due to questionable utility
+		 */
+		ret = -EINVAL;
 		break;
 	}
 
@@ -665,15 +380,6 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		break;
 	case ASHMEM_PURGE_ALL_CACHES:
 		ret = -EPERM;
-		if (capable(CAP_SYS_ADMIN)) {
-			struct shrink_control sc = {
-				.gfp_mask = GFP_KERNEL,
-				.nr_to_scan = 0,
-			};
-			ret = ashmem_shrink(&ashmem_shrinker, &sc);
-			sc.nr_to_scan = ret;
-			ashmem_shrink(&ashmem_shrinker, &sc);
-		}
 		break;
 	}
 
@@ -709,22 +415,12 @@ static int __init ashmem_init(void)
 		return -ENOMEM;
 	}
 
-	ashmem_range_cachep = kmem_cache_create("ashmem_range_cache",
-					  sizeof(struct ashmem_range),
-					  0, 0, NULL);
-	if (unlikely(!ashmem_range_cachep)) {
-		printk(KERN_ERR "ashmem: failed to create slab cache\n");
-		return -ENOMEM;
-	}
-
 	ret = misc_register(&ashmem_misc);
 	if (unlikely(ret)) {
 		printk(KERN_ERR "ashmem: failed to register misc device!\n");
 		return ret;
 	}
 
-	register_shrinker(&ashmem_shrinker);
-
 	printk(KERN_INFO "ashmem: initialized\n");
 
 	return 0;
@@ -734,13 +430,10 @@ static void __exit ashmem_exit(void)
 {
 	int ret;
 
-	unregister_shrinker(&ashmem_shrinker);
-
 	ret = misc_deregister(&ashmem_misc);
 	if (unlikely(ret))
 		printk(KERN_ERR "ashmem: failed to unregister misc device!\n");
 
-	kmem_cache_destroy(ashmem_range_cachep);
 	kmem_cache_destroy(ashmem_area_cachep);
 
 	printk(KERN_INFO "ashmem: unloaded\n");
-- 
1.7.3.2.146.gca209


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges
  2012-04-24 17:49 ` [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
@ 2012-04-24 19:21   ` Peter Zijlstra
  2012-04-24 19:42     ` John Stultz
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2012-04-24 19:21 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
	Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On Tue, 2012-04-24 at 10:49 -0700, John Stultz wrote:
> First pass attempt at getting ashmem to utilize the volatile range
> code. 

One would think the purpose of the previous patches was to entirely do
away with the ashmem interface.. this patch is a temporary band-aid to
assist in testing without having to modify userspace?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges
  2012-04-24 19:21   ` Peter Zijlstra
@ 2012-04-24 19:42     ` John Stultz
  0 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2012-04-24 19:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
	Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On 04/24/2012 12:21 PM, Peter Zijlstra wrote:
> On Tue, 2012-04-24 at 10:49 -0700, John Stultz wrote:
>> First pass attempt at getting ashmem to utilize the volatile range
>> code.
> One would think the purpose of the previous patches was to entirely do
> away with the ashmem interface.. this patch is a temporary band-aid to
> assist in testing without having to modify userspace?
>
So in a way, yes. I'm trying to iteratively chip away at the staging 
drivers, providing a smooth transition path. This patch is just to show 
(or determine, depending on if the GET_PIN_STATUS is critical) that the 
fadvise volatile functionality is sufficient to replace the ashmem 
unpinning feature.

Part of the trouble with pushing the Android drivers upstream is that 
each driver doesn't just solve one issue, but usually handles a number 
of problems. Thus discussion about alternative solutions gets muddled as 
no one alternative is sufficient.

For the ashmem driver, one big use case is a way to provide fds to 
anonymous memory that can be shared between processes. This is the same 
as an unlinked tmpfs file, but allows them to not have tmpfs mounts, 
which could potentially be cluttered with poorly written or malicious 
applications.  There is also the feature of being able to unpin memory 
ranges of that fd, which I thought was interesting for wider use.

The fadivse volatile work provides that unpinning feature, but doesn't 
do anything to try to provide atomically unlinked tmpfs fds. To solve 
that problem,  we'll have to revisit if the simplified ashmem interface 
makes sense, or if there should be something like a permissions based 
userspace solution with a deamon that does the create/unlink and hands 
the fds out.  But hopefully that discussion will be more clear, as it 
will be only about one feature.

thanks
-john

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 0/3] Volatile Ranges (v7) & Lots of words
@ 2012-09-29  3:16 John Stultz
  2012-09-29  3:16 ` [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
  0 siblings, 1 reply; 29+ messages in thread
From: John Stultz @ 2012-09-29  3:16 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim,
	linux-mm@kvack.org

After Kernel Summit and Plumbers, I wanted to consider all the various
side-discussions and try to summarize my current thoughts here along
with sending out my current implementation for review.

Also: I'm going on four weeks of paternity leave in the very near
(but non-deterministic) future. So while I hope I still have time
for some discussion, I may have to deal with fussier complaints
then yours. :)  In any case, you'll have more time to chew on
the idea and come up with amazing suggestions. :)

General Interface semantics:
----------------------------------------------

The high level interface I've been pushing has so far stayed fairly
consistent:

Application marks a range of data as volatile. Volatile data may
be purged at any time. Accessing volatile data is undefined, so
applications should not do so. If the application wants to access
data in a volatile range, it should mark it as non-volatile. If any
of the pages in the range being marked non-volatile had been purged,
the kernel will return an error, notifying the application that the
data was lost.

But one interesting new tweak on this design, suggested by the Taras
Glek and others at Mozilla, is as follows:

Instead of leaving volatile data access as being undefined , when
accessing volatile data, either the data expected will be returned
if it has not been purged, or the application will get a SIGBUS when
it accesses volatile data that has been purged.

Everything else remains the same (error on marking non-volatile
if data was purged, etc). This model allows applications to avoid
having to unmark volatile data when it wants to access it, then
immediately re-mark it as volatile when its done. It is in effect
"lazy" with its marking, allowing the kernel to hit it with a signal
when it gets unlucky and touches purged data. From the signal handler,
the application can note the address it faulted on, unmark the range,
and regenerate the needed data before returning to execution.

Since this approach avoids the more explicit unmark/access/mark
pattern, it avoids the extra overhead required to ensure data is
non-volatile before being accessed.

However, If applications don't want to deal with handling the
sigbus, they can use the more straightforward (but more costly)
unmark/access/mark pattern in the same way as my earlier proposals.

This allows folks to balance the cost vs complexity in their
application appropriately.

So that's a general overview of how the idea I'm proposing could
be used.

Specific Interface semantics:
---------------------------------------------

Here are some of the open question about how the user interface
should look:

fadvise vs fallocate:

	So while originally I used fadvise, currently my
	implementation uses fallocate(fd, FALLOC_FL_MARK_VOLATILE,
	start, len) to mark a range as volatile and fallocate(fd,
	FALLOC_FL_UNMARK_VOLATILE, start, len) to unmark ranges.

	During kernel summit, the question was brought up if fallocate
	was really the right interface to be using, and if fadvise
	would be better. To me fadvise makes a little more sense,
	but earlier it was pointed out that marking data ranges as
	volatile could also be seen as a type of cancellable and lazy
	hole-punching, so from that perspective fallocate might make
	more sense.  This is still an open question and I'd appreciate
	further input here.

tmpfs vs non-shmem filesystems:
	Android's ashmem primarily provides a way to get unlinked
	tmpfs fds that can be shared between applications. Its
	just an additional feature that those pages can "unpinned"
	or marked volatile in my terminology. Thus in implementing
	volatile ranges, I've focused on getting it to work on tmpfs
	file descriptors.  However, there has been some interest in
	using volatile ranges with more traditional filesystems. The
	semantics for how volatile range purging would work on a
	real filesystem are not well established, and I can't say I
	understand the utility quite yet, but there may be a case for
	having data that you know won't be committed to disk until it
	is marked as non-volatile.  However, returning an EINVAL on
	non-tmpfs filesystems until such a use is established should
	be fine.

fd based interfaces vs madvise:
	In talking with Taras Glek, he pointed out that for his
	needs, the fd based interface is a little annoying, as it
	requires having to get access to tmpfs file and mmap it in,
	then instead of just referencing a pointer to the data he
	wants to mark volatile, he has to calculate the offset from
	start of the mmap and pass those file offsets to the interface.
	Instead he mentioned that using something like madvise would be
	much nicer, since they could just pass a pointer to the object
	in memory they want to make volatile and avoid the extra work.

	I'm not opposed to adding an madvise interface for this as
	well, but since we have a existing use case with Android's
	ashmem, I want to make sure we support this existing behavior.
	Specifically as with ashmem  applications can be sharing
	these tmpfs fds, and so file-relative volatile ranges make
	more sense if you need to coordinate what data is volatile
	between two applications.

	Also, while I agree that having an madvise interface for
	volatile ranges would be nice, it does open up some more
	complex implementation issues, since with files, there is a
	fixed relationship between pages and the files' address_space
	mapping, where you can't have pages shared between different
	mappings. This makes it easy to hang the volatile-range tree
	off of the mapping (well, indirectly via a hash table). With
	general anonymous memory, pages can be shared between multiple
	processes, and as far as I understand, don't have any grouping
	structure we could use to determine if the page is in a
	volatile range or not. We would also need to determine more
	complex questions like: What are the semantics of volatility
	with copy-on-write pages?  I'm hoping to investigate this
	idea more deeply soon so I can be sure whatever is pushed has
	a clear plan of how to address this idea. Further thoughts
	here would be appreciated.

It would really be great to get any thoughts on these issues, as they
are higher-priority to me then diving into the details of how we
implement this internally, which can shift over time.

Implementation Considerations:
---------------------------------------------

How best to manage volatile ranges internally in the kernel is still
an open question.

With this patch set, I'm really wanting to provide a proof of concept
of the general interface semantics above. This allows applications to
play with the idea and validate that it would work for them. Allowing
further discussion to continue on how to best implement or best allow
the implementation to evolve in the kernel.

Even so, I'm very interested in any discussion about how to manage
volatile ranges optimally.

Before describing the different management approaches I've tried,
there are some abstract properties and considerations that need to
be kept in mind:

* Range-granular Purging:
	Since volatile ranges can be reclaimed at any time, the
	question of how the kernel should reclaim volatile data
	needs to be addressed.	When a large data range  is marked
	as volatile, if any single page in that range is purged,
	the application will get an error when it marks the range
	as non-volatile.  Thus when any single page in a range
	is purged, the "value" of the entire range is destroyed.
	Because of this property, it makes most sense to purge the
	entire range together.

* Coalescing of adjacent volatile ranges:
	With volatile ranges, any overlapping ranges are always
	coalesced. However, there is an open question of what to
	do with neighboring ranges. With Android's approach, any
	neighboring ranges were coalesced into one range.  I've since
	tweaked this so that adjacent ranges are coalesced only if
	both have not yet been purged (or both are already purged).
	This avoids throwing away fine data just because its next
	to data that has already been tossed.  Not coalescing
	non-overlapping ranges is also an option I've considered,
	as it better follows the applications wishes, since as
	the application is providing these non-overlapping ranges
	separately, we should probably also purge them separately.
	The one complication here is that for userlands-sake, we
	manage volatile ranges at a byte level. So if an application
	marks one an a half pages of data as volatile, we only purge
	pages that are entirely volatile. This avoids accidentally
	purging non-volatile data on the rest of the page.  However,
	if an array of sub-page sized data is marked volatile one by
	one, coalescing the ranges allows us to purge a page that
	consists entirely of multiple volatile ranges.	So for now
	I'm still coalescing assuming the neighbors are both unpurged,
	but this behavior is open to being tweaked.

* Purging order between volatile ranges:
	Again, since it makes sense to purge all the complete
	pages in a range at the same time, we need to consider the
	subtle difference between the least-recently-used pages vs
	least-recently-used ranges. A single range could contain very
	frequently accessed data, as well as rarely accessed data.
	One must also consider that the act of marking a range as
	volatile may not actually touch the underlying pages. Thus
	purging ranges based on a least-recently-used page may also
	result in purging the most-recently used page.

	Android addressed the purging order question by purging ranges
	in the order they were marked volatile. Thus the oldest
	volatile range is the first range to be purged. This works
	well in the Android  model, as applications aren't supposed
	to access volatile data, so the least-recently-marked-volatile
	order maps well to the least-recently-used-range.

	However, this assumption doesn't hold with the lazy SIGBUS
	notification method, as pages in a volatile range may continue
	to be accessed after the range is marked volatile.  So the
	question as to what is the best order of purging volatile
	ranges is definitely open.

	Abstractly the ideal solution might be to evaluate the
	most-recently used page in each range, and to purge the range
	with the oldest recently-used-page, but I suspect this is
	not something that could be calculated efficiently.

	Additionally, in my conversations with Taras, he pointed out
	that if we are using a one-application-at-a-time UI model,
	it would be ideal to discourage purging volatile data used by
	the current application, instead prioritizing volatile ranges
	from applications that aren't active. However, I'm not sure
	what mechanism could be used to prioritize range purging in
	this fashion, especially considering volatile ranges can be
	on data that is shared between applications.

* Volatile range purging order relative to non-volatile pages:
	Initially I had proposed that since applications had offered
	data up as unused, volatile ranges should be purged before we
	try to free any other pages in the system.  At Plumbers, Andrea
	pointed out that this doesn't make much sense, as there may be
	inactive file pages from some streaming file data which are not
	going to be used any time soon, and would be a better candidate
	to free then an application's volatile pages. This sounded
	quite reasonable, so its likely we need to balance volatile
	purging with freeing other pages in the system. However, I do
	think it is advantageous to purge volatile pages before we
	free any dirty pages that must be laundered, as part of the
	goal of volatile pages is to avoid extra io. Although from
	my reading of shrink_page_list in vmscan.c I'm not sure I see
	if/how we prioritize freeing clean pages prior to dirty ones.

So with that background covered, on to discussing actual
implementations.

Implementation Details:
---------------------------------------------

There is two rough approaches that I have tried so far

1) Managing volatile range objects, in a tree or list, which are then
purged using a shrinker

2) Page based management, where pages marked volatile are moved to
a new LRU list and are purged from there.

1) This patchset is of the the shrinker-based approach. In many ways it
is simpler, but it does have a few drawbacks.  Basically when marking a
range as volatile, we create a range object, and add it to an rbtree.
This allows us to be able to quickly find ranges, given an address in
the file.  We also add each range object to the tail of a  filesystem
global linked list, which acts as an LRU allowing us to quickly find
the least recently created volatile range. We then use a shrinker
callback to trigger purging, where we'll select the range on the head
of the LRU list, purge the data, mark the range object as purged,
and remove it from the lru list.

This allows fairly efficient behavior, as marking and unmarking
a range are both O(logn) operation with respect to the number of
ranges, to insert and remove from the tree.  Purging the range is
also O(1) to select the range, and we purge the entire range in
least-recently-marked-volatile order.

The drawbacks with this approach is that it uses a shrinker, thus it is
numa un-aware. We track the virtual address of the pages in the file,
so we don't have a sense of what physical pages we're using, nor on
which node those pages may be on. So its possible on a multi-node
system that when one node was under pressure, we'd purge volatile
ranges that are all on a different node, in effect throwing data away
without helping anything. This is clearly non-ideal for numa systems.

One idea I discussed with Michel Lespinasse is that this might be
something we could improve by providing the shrinker some node context,
then keep track in the range  what node their first page is on. That
way we would be sure to at least free up one page on the node under
pressure when purging that range.

2) The second approach, which was more page based, was also tried. In
this case when we marked a range as volatile, the pages in that range
were moved to a new  lru list LRU _VOLATILE in vmscan.c.  This provided
a page lru list that could be used to free pages before looking at
the LRU_INACTIVE_FILE/ANONYMOUS lists.

This integrates the feature deeper in the mm code, which is nice,
especially as we have an LRU_VOLATILE list for each numa node. Thus
under pressure we won't purge ranges that are entirely on a different
node, as is possible with the other approach.

However, this approach is more costly.	When marking a range
as volatile, we have to migrate every page in that range to the
LRU_VOLATILE list, and similarly on unmarking we have to move each
page back. This ends up being O(n) with respect to the number of
pages in the range we're marking or unmarking. Similarly when purging,
we let the scanning code select a page off the lru, then we have to
map it back to the volatile range so we can purge the entire range,
making it a more expensive O(logn),  with respect to the number of
ranges, operation.

This is a particular concern as applications that want to mark and
unmark data as volatile with fine granularity will likely be calling
these operations frequently, adding quite a bit of overhead. This
makes it less likely that applications will choose to volunteer data
as volatile to the system.

However, with the new lazy SIGBUS notification, applications using
the SIGBUS method would avoid having to mark and unmark data when
accessing it, so this overhead may be less of a concern. However, for
cases where applications don't want to deal with the SIGBUS and would
rather have the more deterministic behavior of the unmark/access/mark
pattern, the performance is a concern.

Additionally, there may be ways to defer and batch the page migration
so that applications don't suffer the extra cost, but this solution
may be limited or could  cause some strange behavior, as we can't
defer the unmark method, as we don't want pages to be purged after
the application thinks they were unmarked.

Whew, that was long...

Anyway, if you got this far and are still interested, I'd be greatly
appreciate  hearing of any other suggested implementations, or ways
around the drawbacks of the already tried approaches.

thanks
-john

For this v7 patchset revision the changes are as follows:
* Dropped the LRU_VOLATILE approach for now so we can focus on
  getting the general interface semantics agreed upon
* Converted to using byte ranges rather then page ranges to make
  userland's life easier.	
* Add SIGBUS on purged page access behavior, allowing for access
  of volatile data without having to unmark it.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>

John Stultz (3):
  [RFC] Add volatile range management code
  [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
  [RFC] ashmem: Convert ashmem to use volatile ranges

 drivers/staging/android/ashmem.c |  335 +---------------------
 fs/open.c                        |    3 +-
 include/linux/falloc.h           |    7 +-
 include/linux/volatile.h         |   46 +++
 mm/Makefile                      |    2 +-
 mm/shmem.c                       |  120 ++++++++
 mm/volatile.c                    |  580 ++++++++++++++++++++++++++++++++++++++
 7 files changed, 763 insertions(+), 330 deletions(-)
 create mode 100644 include/linux/volatile.h
 create mode 100644 mm/volatile.c

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges
  2012-09-29  3:16 [PATCH 0/3] Volatile Ranges (v7) & Lots of words John Stultz
@ 2012-09-29  3:16 ` John Stultz
  0 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2012-09-29  3:16 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim,
	linux-mm@kvack.org

Rework of my first pass attempt at getting ashmem to utilize
the volatile range code, now using the fallocate interface.

In this implementation GET_PIN_STATUS is unimplemented, due to
the fact that adding a ISVOLATILE check wasn't considered
terribly useful in earlier reviews. It would be trivial to
re-add that functionality, but I wanted to check w/ the
Android developers to see how often GET_PIN_STATUS is actually
used?

Similarly the ashmem PURGE_ALL_CACHES ioctl does not function,
as the volatile range purging is no longer directly under its
control.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 drivers/staging/android/ashmem.c |  335 ++------------------------------------
 1 file changed, 10 insertions(+), 325 deletions(-)

diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 69cf2db..9f9654c 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -52,26 +52,6 @@ struct ashmem_area {
 };
 
 /*
- * ashmem_range - represents an interval of unpinned (evictable) pages
- * Lifecycle: From unpin to pin
- * Locking: Protected by `ashmem_mutex'
- */
-struct ashmem_range {
-	struct list_head lru;		/* entry in LRU list */
-	struct list_head unpinned;	/* entry in its area's unpinned list */
-	struct ashmem_area *asma;	/* associated area */
-	size_t pgstart;			/* starting page, inclusive */
-	size_t pgend;			/* ending page, inclusive */
-	unsigned int purged;		/* ASHMEM_NOT or ASHMEM_WAS_PURGED */
-};
-
-/* LRU list of unpinned pages, protected by ashmem_mutex */
-static LIST_HEAD(ashmem_lru_list);
-
-/* Count of pages on our LRU list, protected by ashmem_mutex */
-static unsigned long lru_count;
-
-/*
  * ashmem_mutex - protects the list of and each individual ashmem_area
  *
  * Lock Ordering: ashmex_mutex -> i_mutex -> i_alloc_sem
@@ -79,102 +59,9 @@ static unsigned long lru_count;
 static DEFINE_MUTEX(ashmem_mutex);
 
 static struct kmem_cache *ashmem_area_cachep __read_mostly;
-static struct kmem_cache *ashmem_range_cachep __read_mostly;
-
-#define range_size(range) \
-	((range)->pgend - (range)->pgstart + 1)
-
-#define range_on_lru(range) \
-	((range)->purged == ASHMEM_NOT_PURGED)
-
-#define page_range_subsumes_range(range, start, end) \
-	(((range)->pgstart >= (start)) && ((range)->pgend <= (end)))
-
-#define page_range_subsumed_by_range(range, start, end) \
-	(((range)->pgstart <= (start)) && ((range)->pgend >= (end)))
-
-#define page_in_range(range, page) \
-	(((range)->pgstart <= (page)) && ((range)->pgend >= (page)))
-
-#define page_range_in_range(range, start, end) \
-	(page_in_range(range, start) || page_in_range(range, end) || \
-		page_range_subsumes_range(range, start, end))
-
-#define range_before_page(range, page) \
-	((range)->pgend < (page))
 
 #define PROT_MASK		(PROT_EXEC | PROT_READ | PROT_WRITE)
 
-static inline void lru_add(struct ashmem_range *range)
-{
-	list_add_tail(&range->lru, &ashmem_lru_list);
-	lru_count += range_size(range);
-}
-
-static inline void lru_del(struct ashmem_range *range)
-{
-	list_del(&range->lru);
-	lru_count -= range_size(range);
-}
-
-/*
- * range_alloc - allocate and initialize a new ashmem_range structure
- *
- * 'asma' - associated ashmem_area
- * 'prev_range' - the previous ashmem_range in the sorted asma->unpinned list
- * 'purged' - initial purge value (ASMEM_NOT_PURGED or ASHMEM_WAS_PURGED)
- * 'start' - starting page, inclusive
- * 'end' - ending page, inclusive
- *
- * Caller must hold ashmem_mutex.
- */
-static int range_alloc(struct ashmem_area *asma,
-		       struct ashmem_range *prev_range, unsigned int purged,
-		       size_t start, size_t end)
-{
-	struct ashmem_range *range;
-
-	range = kmem_cache_zalloc(ashmem_range_cachep, GFP_KERNEL);
-	if (unlikely(!range))
-		return -ENOMEM;
-
-	range->asma = asma;
-	range->pgstart = start;
-	range->pgend = end;
-	range->purged = purged;
-
-	list_add_tail(&range->unpinned, &prev_range->unpinned);
-
-	if (range_on_lru(range))
-		lru_add(range);
-
-	return 0;
-}
-
-static void range_del(struct ashmem_range *range)
-{
-	list_del(&range->unpinned);
-	if (range_on_lru(range))
-		lru_del(range);
-	kmem_cache_free(ashmem_range_cachep, range);
-}
-
-/*
- * range_shrink - shrinks a range
- *
- * Caller must hold ashmem_mutex.
- */
-static inline void range_shrink(struct ashmem_range *range,
-				size_t start, size_t end)
-{
-	size_t pre = range_size(range);
-
-	range->pgstart = start;
-	range->pgend = end;
-
-	if (range_on_lru(range))
-		lru_count -= pre - range_size(range);
-}
 
 static int ashmem_open(struct inode *inode, struct file *file)
 {
@@ -200,12 +87,6 @@ static int ashmem_open(struct inode *inode, struct file *file)
 static int ashmem_release(struct inode *ignored, struct file *file)
 {
 	struct ashmem_area *asma = file->private_data;
-	struct ashmem_range *range, *next;
-
-	mutex_lock(&ashmem_mutex);
-	list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned)
-		range_del(range);
-	mutex_unlock(&ashmem_mutex);
 
 	if (asma->file)
 		fput(asma->file);
@@ -339,56 +220,6 @@ out:
 	return ret;
 }
 
-/*
- * ashmem_shrink - our cache shrinker, called from mm/vmscan.c :: shrink_slab
- *
- * 'nr_to_scan' is the number of objects (pages) to prune, or 0 to query how
- * many objects (pages) we have in total.
- *
- * 'gfp_mask' is the mask of the allocation that got us into this mess.
- *
- * Return value is the number of objects (pages) remaining, or -1 if we cannot
- * proceed without risk of deadlock (due to gfp_mask).
- *
- * We approximate LRU via least-recently-unpinned, jettisoning unpinned partial
- * chunks of ashmem regions LRU-wise one-at-a-time until we hit 'nr_to_scan'
- * pages freed.
- */
-static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
-{
-	struct ashmem_range *range, *next;
-
-	/* We might recurse into filesystem code, so bail out if necessary */
-	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
-		return -1;
-	if (!sc->nr_to_scan)
-		return lru_count;
-
-	mutex_lock(&ashmem_mutex);
-	list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
-		loff_t start = range->pgstart * PAGE_SIZE;
-		loff_t end = (range->pgend + 1) * PAGE_SIZE;
-
-		do_fallocate(range->asma->file,
-				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
-				start, end - start);
-		range->purged = ASHMEM_WAS_PURGED;
-		lru_del(range);
-
-		sc->nr_to_scan -= range_size(range);
-		if (sc->nr_to_scan <= 0)
-			break;
-	}
-	mutex_unlock(&ashmem_mutex);
-
-	return lru_count;
-}
-
-static struct shrinker ashmem_shrinker = {
-	.shrink = ashmem_shrink,
-	.seeks = DEFAULT_SEEKS * 4,
-};
-
 static int set_prot_mask(struct ashmem_area *asma, unsigned long prot)
 {
 	int ret = 0;
@@ -461,136 +292,10 @@ static int get_name(struct ashmem_area *asma, void __user *name)
 	return ret;
 }
 
-/*
- * ashmem_pin - pin the given ashmem region, returning whether it was
- * previously purged (ASHMEM_WAS_PURGED) or not (ASHMEM_NOT_PURGED).
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_pin(struct ashmem_area *asma, size_t pgstart, size_t pgend)
-{
-	struct ashmem_range *range, *next;
-	int ret = ASHMEM_NOT_PURGED;
-
-	list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned) {
-		/* moved past last applicable page; we can short circuit */
-		if (range_before_page(range, pgstart))
-			break;
-
-		/*
-		 * The user can ask us to pin pages that span multiple ranges,
-		 * or to pin pages that aren't even unpinned, so this is messy.
-		 *
-		 * Four cases:
-		 * 1. The requested range subsumes an existing range, so we
-		 *    just remove the entire matching range.
-		 * 2. The requested range overlaps the start of an existing
-		 *    range, so we just update that range.
-		 * 3. The requested range overlaps the end of an existing
-		 *    range, so we just update that range.
-		 * 4. The requested range punches a hole in an existing range,
-		 *    so we have to update one side of the range and then
-		 *    create a new range for the other side.
-		 */
-		if (page_range_in_range(range, pgstart, pgend)) {
-			ret |= range->purged;
-
-			/* Case #1: Easy. Just nuke the whole thing. */
-			if (page_range_subsumes_range(range, pgstart, pgend)) {
-				range_del(range);
-				continue;
-			}
-
-			/* Case #2: We overlap from the start, so adjust it */
-			if (range->pgstart >= pgstart) {
-				range_shrink(range, pgend + 1, range->pgend);
-				continue;
-			}
-
-			/* Case #3: We overlap from the rear, so adjust it */
-			if (range->pgend <= pgend) {
-				range_shrink(range, range->pgstart, pgstart-1);
-				continue;
-			}
-
-			/*
-			 * Case #4: We eat a chunk out of the middle. A bit
-			 * more complicated, we allocate a new range for the
-			 * second half and adjust the first chunk's endpoint.
-			 */
-			range_alloc(asma, range, range->purged,
-				    pgend + 1, range->pgend);
-			range_shrink(range, range->pgstart, pgstart - 1);
-			break;
-		}
-	}
-
-	return ret;
-}
-
-/*
- * ashmem_unpin - unpin the given range of pages. Returns zero on success.
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_unpin(struct ashmem_area *asma, size_t pgstart, size_t pgend)
-{
-	struct ashmem_range *range, *next;
-	unsigned int purged = ASHMEM_NOT_PURGED;
-
-restart:
-	list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned) {
-		/* short circuit: this is our insertion point */
-		if (range_before_page(range, pgstart))
-			break;
-
-		/*
-		 * The user can ask us to unpin pages that are already entirely
-		 * or partially pinned. We handle those two cases here.
-		 */
-		if (page_range_subsumed_by_range(range, pgstart, pgend))
-			return 0;
-		if (page_range_in_range(range, pgstart, pgend)) {
-			pgstart = min_t(size_t, range->pgstart, pgstart),
-			pgend = max_t(size_t, range->pgend, pgend);
-			purged |= range->purged;
-			range_del(range);
-			goto restart;
-		}
-	}
-
-	return range_alloc(asma, range, purged, pgstart, pgend);
-}
-
-/*
- * ashmem_get_pin_status - Returns ASHMEM_IS_UNPINNED if _any_ pages in the
- * given interval are unpinned and ASHMEM_IS_PINNED otherwise.
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_get_pin_status(struct ashmem_area *asma, size_t pgstart,
-				 size_t pgend)
-{
-	struct ashmem_range *range;
-	int ret = ASHMEM_IS_PINNED;
-
-	list_for_each_entry(range, &asma->unpinned_list, unpinned) {
-		if (range_before_page(range, pgstart))
-			break;
-		if (page_range_in_range(range, pgstart, pgend)) {
-			ret = ASHMEM_IS_UNPINNED;
-			break;
-		}
-	}
-
-	return ret;
-}
-
 static int ashmem_pin_unpin(struct ashmem_area *asma, unsigned long cmd,
 			    void __user *p)
 {
 	struct ashmem_pin pin;
-	size_t pgstart, pgend;
 	int ret = -EINVAL;
 
 	if (unlikely(!asma->file))
@@ -612,25 +317,25 @@ static int ashmem_pin_unpin(struct ashmem_area *asma, unsigned long cmd,
 	if (unlikely(PAGE_ALIGN(asma->size) < pin.offset + pin.len))
 		return -EINVAL;
 
-	pgstart = pin.offset / PAGE_SIZE;
-	pgend = pgstart + (pin.len / PAGE_SIZE) - 1;
-
-	mutex_lock(&ashmem_mutex);
 
 	switch (cmd) {
 	case ASHMEM_PIN:
-		ret = ashmem_pin(asma, pgstart, pgend);
+		ret = do_fallocate(asma->file, FALLOC_FL_MARK_VOLATILE,
+					pin.offset, pin.len);
 		break;
 	case ASHMEM_UNPIN:
-		ret = ashmem_unpin(asma, pgstart, pgend);
+		ret = do_fallocate(asma->file, FALLOC_FL_UNMARK_VOLATILE,
+					pin.offset, pin.len);
 		break;
 	case ASHMEM_GET_PIN_STATUS:
-		ret = ashmem_get_pin_status(asma, pgstart, pgend);
+		/*
+		 * XXX - volatile ranges currently don't provide status,
+		 * due to questionable utility
+		 */
+		ret = -EINVAL;
 		break;
 	}
 
-	mutex_unlock(&ashmem_mutex);
-
 	return ret;
 }
 
@@ -669,15 +374,6 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		break;
 	case ASHMEM_PURGE_ALL_CACHES:
 		ret = -EPERM;
-		if (capable(CAP_SYS_ADMIN)) {
-			struct shrink_control sc = {
-				.gfp_mask = GFP_KERNEL,
-				.nr_to_scan = 0,
-			};
-			ret = ashmem_shrink(&ashmem_shrinker, &sc);
-			sc.nr_to_scan = ret;
-			ashmem_shrink(&ashmem_shrinker, &sc);
-		}
 		break;
 	}
 
@@ -713,21 +409,13 @@ static int __init ashmem_init(void)
 		return -ENOMEM;
 	}
 
-	ashmem_range_cachep = kmem_cache_create("ashmem_range_cache",
-					  sizeof(struct ashmem_range),
-					  0, 0, NULL);
-	if (unlikely(!ashmem_range_cachep)) {
-		pr_err("failed to create slab cache\n");
-		return -ENOMEM;
-	}
-
 	ret = misc_register(&ashmem_misc);
 	if (unlikely(ret)) {
 		pr_err("failed to register misc device!\n");
 		return ret;
 	}
 
-	register_shrinker(&ashmem_shrinker);
+
 
 	pr_info("initialized\n");
 
@@ -738,13 +426,10 @@ static void __exit ashmem_exit(void)
 {
 	int ret;
 
-	unregister_shrinker(&ashmem_shrinker);
-
 	ret = misc_deregister(&ashmem_misc);
 	if (unlikely(ret))
 		pr_err("failed to unregister misc device!\n");
 
-	kmem_cache_destroy(ashmem_range_cachep);
 	kmem_cache_destroy(ashmem_area_cachep);
 
 	pr_info("unloaded\n");
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2012-09-29  3:17 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-04-24 17:49 [PATCH 0/3] Volatile Ranges John Stultz
2012-04-24 17:49 ` [PATCH 1/3] Range tree implementation John Stultz
2012-04-24 19:14   ` Peter Zijlstra
2012-04-24 19:25     ` John Stultz
2012-04-24 19:33       ` Peter Zijlstra
2012-04-25 12:16   ` Dmitry Adamushko
2012-04-25 16:19     ` John Stultz
2012-04-26 10:00       ` Dmitry Adamushko
2012-04-27 19:34         ` John Stultz
2012-04-24 17:49 ` [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz
2012-04-24 19:20   ` Peter Zijlstra
2012-04-24 19:50     ` John Stultz
2012-04-27  0:39   ` Dave Chinner
2012-04-27 15:25     ` Dave Hansen
2012-04-28  1:36       ` Dave Chinner
2012-04-30 21:07         ` John Stultz
2012-05-01  0:08           ` Dave Chinner
2012-05-01  0:46             ` John Stultz
2012-05-01  1:28               ` Dave Chinner
2012-04-27 19:14     ` John Stultz
2012-04-28  2:04       ` Dave Chinner
2012-04-30 19:40         ` John Stultz
2012-05-01  0:28           ` Dave Chinner
2012-05-01  1:15             ` John Stultz
2012-05-01  1:51               ` Dave Chinner
2012-04-24 17:49 ` [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
2012-04-24 19:21   ` Peter Zijlstra
2012-04-24 19:42     ` John Stultz
  -- strict thread matches above, loose matches on Subject: below --
2012-09-29  3:16 [PATCH 0/3] Volatile Ranges (v7) & Lots of words John Stultz
2012-09-29  3:16 ` [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).