* [PATCH 0/2] [RFC] Volatile ranges (v4) @ 2012-03-16 22:51 John Stultz 2012-03-16 22:51 ` [PATCH 1/2] [RFC] Range tree implementation John Stultz ` (2 more replies) 0 siblings, 3 replies; 31+ messages in thread From: John Stultz @ 2012-03-16 22:51 UTC (permalink / raw) To: linux-kernel Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V Ok. So here's another iteration of the fadvise volatile range code. I realize this is still a way off from being ready, but I wanted to post what I have to share with folks working on the various range/interval management ideas as well as update folks who've provided feedback on the volatile range code. So just on the premise: Ideally, I want delayed reclaim based hole punching. Application has a possibly shared mmapped cache file, which it can mark chunks of which volatile or nonvolatile as it uses it. If the kernel needs memory, it can zap any ranges that are currently marked volatile. Some examples would be rendering of images or web pages that are not on-screen. This allows the application to volunteer memory for reclaiming, and the kernel to grab it only when it needs. This differs from some of the memory notification schemes, in that it allows the kernel to immediately reclaim what it needs, rather then having to request applications to give up memory (which may add further memory load) until enough is free. However, unlike the notification scheme, it does require applications to mark and unmark pages as volatile as they use them. Current use cases (ie: users of Android's ashmem) only use shmfs/tmpfs. However, I don't see right off why it should be limited to shm. As long as punching a hole in a file can be done w/ minimal memory overhead this could be useful and have somewhat sane behavior. We could also only zap the page cache, not writing any dirty data out. However, w/ non-shm files, discarding dirty data without hole punching would obviously leave persistent files in a non-coherent state. This may further re-inforce that the design should be shm only if we don't do hole punching. On the topic of hole punching, the kernel doesn't seem completely unified in this as well. As I understand, there are two methods to do hole punching: FALLOCATE_FL_PUNCH_HOLE vs MADV_REMOVE, and they don't necessarily overlap in support. For the most part, it seems persistent filesystems require FALLOCATE_FL_PUNCH, where as shmfs/tmpfs uses MADV_REMOVE. But I may be misunderstanding the subtle difference here, so if anyone wants to clarify this, it would be great. One concern was that if the design is shm only, fadvise might not be the right interface, as it should be generic. The madvise(MADV_REMOVE,...) interface gives some precedence to shmfs/tmpfs only calls, but I'd like to get some further feedback as to what folks think of this. If we are shm/tmpfs only, I could rework this design to use madvise instead of fadvise if folks would prefer. Also, there's still the issue that lockdep doesn't like me calling vmtruncate_range from the shrinker due to any allocations being done while the i_mutex is taken could cause the shrinker to run and need the i_mutex again. I did try using invalidate_inode_pages2_range() but it always returns EBUSY in this context, so I suspect I want something else. I'm currently reading shmem_truncate_range() and zap_page_range() to get a better idea of how to this might be best accomplished. Regarding feedback suggesting dropping the LRU ranges, and instead keeping the volatile/purged data in radix tags and to manage things at writeout time. My concern there is having the LRU behavior on the entire range from when it was marked volatile instead of the actual last page access (you might have ranges that have frequent use areas and non-frequent use). Also sorting out how to evict the entire range when one page is dropped might be funky. However, I'll likely revisit this soon, but for this iteration I didn't get to it. I still also realize I have the issue of bloating the address_space structure to handle, and I suspect if I continue w/ this approach I'll use a separate hash table to store the range-tree roots in my next revision. Anyway, thanks for the continued advice and feedback! -john CC: Andrew Morton <akpm@linux-foundation.org> CC: Android Kernel Team <kernel-team@android.com> CC: Robert Love <rlove@google.com> CC: Mel Gorman <mel@csn.ul.ie> CC: Hugh Dickins <hughd@google.com> CC: Dave Hansen <dave@linux.vnet.ibm.com> CC: Rik van Riel <riel@redhat.com> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com> CC: Dave Chinner <david@fromorbit.com> CC: Neil Brown <neilb@suse.de> CC: Andrea Righi <andrea@betterlinux.com> CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> John Stultz (2): [RFC] Range tree implementation [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags fs/inode.c | 4 + include/linux/fadvise.h | 5 + include/linux/fs.h | 2 + include/linux/rangetree.h | 53 ++++++++ include/linux/volatile.h | 14 ++ lib/Makefile | 2 +- lib/rangetree.c | 105 +++++++++++++++ mm/Makefile | 2 +- mm/fadvise.c | 16 ++- mm/volatile.c | 313 +++++++++++++++++++++++++++++++++++++++++++++ 10 files changed, 513 insertions(+), 3 deletions(-) create mode 100644 include/linux/rangetree.h create mode 100644 include/linux/volatile.h create mode 100644 lib/rangetree.c create mode 100644 mm/volatile.c -- 1.7.3.2.146.gca209 ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH 1/2] [RFC] Range tree implementation 2012-03-16 22:51 [PATCH 0/2] [RFC] Volatile ranges (v4) John Stultz @ 2012-03-16 22:51 ` John Stultz 2012-03-20 10:00 ` Dmitry Adamushko 2012-03-20 16:44 ` Aneesh Kumar K.V 2012-03-16 22:51 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz 2012-07-19 10:13 ` [PATCH 0/2] [RFC] Volatile ranges (v4) Dmitry Vyukov 2 siblings, 2 replies; 31+ messages in thread From: John Stultz @ 2012-03-16 22:51 UTC (permalink / raw) To: linux-kernel Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V After Andrew suggested something like his mumbletree idea to better store a list of ranges, I worked on a few different approaches, and this is what I've finally managed to get working. I suspect range-tree isn't a totally accurate name, but I couldn't quite make out the difference between range trees and interval trees, so I just picked one to call it. Do let me know if you have a better name. The idea of storing ranges in a tree is nice, but has a number of complications. When adding a range, its possible that a large range will consume and merge a number of smaller ranges. When removing a range, its possible you may end up splitting an existing range, causing one range to become two. This makes it very difficult to provide generic list_head like behavior, as the parent structures would need to be duplicated and removed, and that has lots of memory ownership issues. So, this is a much simplified and more list_head like implementation. You can add a node to a tree, or remove a node to a tree, but the generic implementation doesn't do the merging or splitting for you. But it does provide helpers to find overlapping and adjacent ranges. Andrew also really wanted this range-tree implementation to be resuable so we don't duplicate the file locking logic. I'm not totally convinced that the requirements between the volatile ranges and file locking are really equivelent, but this reduced impelementation may make it possible. Do let me know what you think or if you have other ideas for better ways to do the same. Changelog: v2: * Reworked code to use an rbtree instead of splaying CC: Andrew Morton <akpm@linux-foundation.org> CC: Android Kernel Team <kernel-team@android.com> CC: Robert Love <rlove@google.com> CC: Mel Gorman <mel@csn.ul.ie> CC: Hugh Dickins <hughd@google.com> CC: Dave Hansen <dave@linux.vnet.ibm.com> CC: Rik van Riel <riel@redhat.com> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com> CC: Dave Chinner <david@fromorbit.com> CC: Neil Brown <neilb@suse.de> CC: Andrea Righi <andrea@betterlinux.com> CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: John Stultz <john.stultz@linaro.org> --- include/linux/rangetree.h | 53 +++++++++++++++++++++++ lib/Makefile | 2 +- lib/rangetree.c | 105 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 159 insertions(+), 1 deletions(-) create mode 100644 include/linux/rangetree.h create mode 100644 lib/rangetree.c diff --git a/include/linux/rangetree.h b/include/linux/rangetree.h new file mode 100644 index 0000000..ca03821 --- /dev/null +++ b/include/linux/rangetree.h @@ -0,0 +1,53 @@ +#ifndef _LINUX_RANGETREE_H +#define _LINUX_RANGETREE_H + +#include <linux/types.h> +#include <linux/rbtree.h> + +struct range_tree_node { + struct rb_node rb; + u64 start; + u64 end; +}; + +struct range_tree_root { + struct rb_root head; +}; + +static inline void range_tree_init(struct range_tree_root *root) +{ + root->head = RB_ROOT; +} + +static inline void range_tree_node_init(struct range_tree_node *node) +{ + rb_init_node(&node->rb); + node->start = 0; + node->end = 0; +} + +static inline int range_tree_empty(struct range_tree_root *root) +{ + return RB_EMPTY_ROOT(&root->head); +} + +static inline +struct range_tree_node *range_tree_root_node(struct range_tree_root *root) +{ + struct range_tree_node *ret; + ret = container_of(root->head.rb_node, struct range_tree_node, rb); + return ret; +} + +extern struct range_tree_node *range_tree_in_range(struct range_tree_root *root, + u64 start, u64 end); +extern struct range_tree_node *range_tree_in_range_adjacent( + struct range_tree_root *root, + u64 start, u64 end); +extern void range_tree_add(struct range_tree_root *root, + struct range_tree_node *node); +extern void range_tree_remove(struct range_tree_root *root, + struct range_tree_node *node); +#endif + + diff --git a/lib/Makefile b/lib/Makefile index 18515f0..f43ef0d 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ idr.o int_sqrt.o extable.o prio_tree.o \ sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ proportions.o prio_heap.o ratelimit.o show_mem.o \ - is_single_threaded.o plist.o decompress.o + is_single_threaded.o plist.o decompress.o rangetree.o lib-$(CONFIG_MMU) += ioremap.o lib-$(CONFIG_SMP) += cpumask.o diff --git a/lib/rangetree.c b/lib/rangetree.c new file mode 100644 index 0000000..0f6208a --- /dev/null +++ b/lib/rangetree.c @@ -0,0 +1,105 @@ +#include <linux/rangetree.h> +#include <linux/kernel.h> +#include <linux/slab.h> + + +/** + * range_tree_in_range - Returns the first node that overlaps with the + * given range + * @root: range_tree root + * @start: range start + * @end: range end + * + */ +struct range_tree_node *range_tree_in_range(struct range_tree_root *root, + u64 start, u64 end) +{ + struct rb_node **p = &root->head.rb_node; + struct range_tree_node *candidate; + + while (*p) { + candidate = rb_entry(*p, struct range_tree_node, rb); + if (end < candidate->start) + p = &(*p)->rb_left; + else if (start > candidate->end) + p = &(*p)->rb_right; + else + return candidate; + } + + return 0; +} + + +/** + * range_tree_in_range - Returns the first node that overlaps or is adjacent + * with the given range + * @root: range_tree root + * @start: range start + * @end: range end + * + */ +struct range_tree_node *range_tree_in_range_adjacent( + struct range_tree_root *root, + u64 start, u64 end) +{ + struct rb_node **p = &root->head.rb_node; + struct range_tree_node *candidate; + + while (*p) { + candidate = rb_entry(*p, struct range_tree_node, rb); + if (end+1 < candidate->start) + p = &(*p)->rb_left; + else if (start > candidate->end + 1) + p = &(*p)->rb_right; + else + return candidate; + } + return 0; +} + +/** + * range_tree_add - Add a node to a range tree + * @root: range tree to be added to + * @node: range_tree_node to be added + * + * Adds a node to the range tree. + */ +void range_tree_add(struct range_tree_root *root, + struct range_tree_node *node) +{ + struct rb_node **p = &root->head.rb_node; + struct rb_node *parent = NULL; + struct range_tree_node *ptr; + + WARN_ON_ONCE(!RB_EMPTY_NODE(&node->rb)); + + while (*p) { + parent = *p; + ptr = rb_entry(parent, struct range_tree_node, rb); + if (node->start < ptr->start) + p = &(*p)->rb_left; + else + p = &(*p)->rb_right; + } + rb_link_node(&node->rb, parent, p); + rb_insert_color(&node->rb, &root->head); + +} + + +/** + * range_tree_remove: Removes a given node from the tree + * @root: root of tree + * @node: Node to be removed + * + * Removes a node and splays the tree + */ +void range_tree_remove(struct range_tree_root *root, + struct range_tree_node *node) +{ + WARN_ON_ONCE(RB_EMPTY_NODE(&node->rb)); + + rb_erase(&node->rb, &root->head); + RB_CLEAR_NODE(&node->rb); +} -- 1.7.3.2.146.gca209 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH 1/2] [RFC] Range tree implementation 2012-03-16 22:51 ` [PATCH 1/2] [RFC] Range tree implementation John Stultz @ 2012-03-20 10:00 ` Dmitry Adamushko 2012-03-20 18:04 ` John Stultz 2012-03-20 16:44 ` Aneesh Kumar K.V 1 sibling, 1 reply; 31+ messages in thread From: Dmitry Adamushko @ 2012-03-20 10:00 UTC (permalink / raw) To: John Stultz Cc: linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V Hi John, On 16 March 2012 23:51, John Stultz <john.stultz@linaro.org> wrote: > After Andrew suggested something like his mumbletree idea > to better store a list of ranges, I worked on a few different > approaches, and this is what I've finally managed to get working. > > I suspect range-tree isn't a totally accurate name, but I > couldn't quite make out the difference between range trees > and interval trees, so I just picked one to call it. Do > let me know if you have a better name. > > The idea of storing ranges in a tree is nice, but has a number > of complications. When adding a range, its possible that a > large range will consume and merge a number of smaller ranges. Have you considered using 'prio_tree' (include/linux/prio_tree.h)? If we aim at addressing a wide range of possible use-cases (different patterns of adding/removing volatile ranges), then, at first glance, prio_tree looks like a better approach. e.g. for the "consume and merge a number of smaller ranges" scenario above, prio_tree gives O(log n) [ O(log n + m) ] behavior iso O(m log n) in your case. --Dmitry ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/2] [RFC] Range tree implementation 2012-03-20 10:00 ` Dmitry Adamushko @ 2012-03-20 18:04 ` John Stultz 0 siblings, 0 replies; 31+ messages in thread From: John Stultz @ 2012-03-20 18:04 UTC (permalink / raw) To: Dmitry Adamushko Cc: linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V On 03/20/2012 03:00 AM, Dmitry Adamushko wrote: > Hi John, > > On 16 March 2012 23:51, John Stultz<john.stultz@linaro.org> wrote: >> After Andrew suggested something like his mumbletree idea >> to better store a list of ranges, I worked on a few different >> approaches, and this is what I've finally managed to get working. >> >> I suspect range-tree isn't a totally accurate name, but I >> couldn't quite make out the difference between range trees >> and interval trees, so I just picked one to call it. Do >> let me know if you have a better name. >> >> The idea of storing ranges in a tree is nice, but has a number >> of complications. When adding a range, its possible that a >> large range will consume and merge a number of smaller ranges. > Have you considered using 'prio_tree' (include/linux/prio_tree.h)? If > we aim at addressing a wide range of possible use-cases (different > patterns of adding/removing volatile ranges), then, at first glance, > prio_tree looks like a better approach. I'll take a closer look at that! > e.g. for the "consume and merge a number of smaller ranges" scenario > above, prio_tree gives O(log n) [ O(log n + m) ] behavior iso O(m log > n) in your case. Yea, one of the items I was looking at yesterday was to improve the range insert/remove usage, since I end up starting each lookup from the root node over and over. I'm thinking of adding a iterate-next type call so that we don't re-start the lookup each iteration of the loop once we've found an item. Thanks again for the great feedback! -john ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/2] [RFC] Range tree implementation 2012-03-16 22:51 ` [PATCH 1/2] [RFC] Range tree implementation John Stultz 2012-03-20 10:00 ` Dmitry Adamushko @ 2012-03-20 16:44 ` Aneesh Kumar K.V 1 sibling, 0 replies; 31+ messages in thread From: Aneesh Kumar K.V @ 2012-03-20 16:44 UTC (permalink / raw) To: John Stultz, linux-kernel Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi John Stultz <john.stultz@linaro.org> writes: .... > +/** > + * range_tree_in_range - Returns the first node that overlaps with the > + * given range > + * @root: range_tree root > + * @start: range start > + * @end: range end > + * > + */ > +struct range_tree_node *range_tree_in_range(struct range_tree_root *root, > + u64 start, u64 end) > +{ > + struct rb_node **p = &root->head.rb_node; > + struct range_tree_node *candidate; > + > + while (*p) { > + candidate = rb_entry(*p, struct range_tree_node, rb); > + if (end < candidate->start) > + p = &(*p)->rb_left; > + else if (start > candidate->end) > + p = &(*p)->rb_right; > + else > + return candidate; > + } > + > + return 0; return NULL ? > +} > + > + > +/** > + * range_tree_in_range - Returns the first node that overlaps or is adjacent > + * with the given range > + * @root: range_tree root > + * @start: range start > + * @end: range end > + * > + */ The comment needs update > +struct range_tree_node *range_tree_in_range_adjacent( > + struct range_tree_root *root, > + u64 start, u64 end) > +{ > + struct rb_node **p = &root->head.rb_node; > + struct range_tree_node *candidate; > + > + while (*p) { > + candidate = rb_entry(*p, struct range_tree_node, rb); > + if (end+1 < candidate->start) > + p = &(*p)->rb_left; > + else if (start > candidate->end + 1) > + p = &(*p)->rb_right; > + else > + return candidate; > + } > + return 0; > +} > + Below is my hack to get hugetlbfs code converted. The patch compiles. Will test and send a signed-off-by version later. not-signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> fs/hugetlbfs/inode.c | 3 +- include/linux/hugetlb.h | 2 + mm/hugetlb.c | 291 ++++++++++++++++++++++------------------------- 3 files changed, 139 insertions(+), 157 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index ca4fa70..8309f5e 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -455,6 +455,7 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb, inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; info = HUGETLBFS_I(inode); mpol_shared_policy_init(&info->policy, NULL); + range_tree_init(&info->rg_root); inode->i_op = &hugetlbfs_dir_inode_operations; inode->i_fop = &simple_dir_operations; /* directory inodes start off with i_nlink == 2 (for "." entry) */ @@ -478,7 +479,6 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, inode->i_mapping->a_ops = &hugetlbfs_aops; inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info; inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; - INIT_LIST_HEAD(&inode->i_mapping->private_list); info = HUGETLBFS_I(inode); /* * The policy is initialized here even if we are creating a @@ -488,6 +488,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, * the rb tree will still be empty. */ mpol_shared_policy_init(&info->policy, NULL); + range_tree_init(&info->rg_root); switch (mode & S_IFMT) { default: init_special_inode(inode, mode, dev); diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 32e948c..b785541a 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -5,6 +5,7 @@ #include <linux/fs.h> #include <linux/hugetlb_inline.h> #include <linux/cgroup.h> +#include <linux/rangetree.h> struct ctl_table; struct user_struct; @@ -150,6 +151,7 @@ struct hugetlbfs_sb_info { struct hugetlbfs_inode_info { struct shared_policy policy; + struct range_tree_root rg_root; struct inode vfs_inode; }; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 4e1462d..a83727d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -69,148 +69,94 @@ static DEFINE_SPINLOCK(hugetlb_lock); * down_read(&mm->mmap_sem); * mutex_lock(&hugetlb_instantiation_mutex); */ -struct file_region { - struct list_head link; - long from; - long to; -}; - -static long region_add(struct list_head *head, long f, long t) +static long region_chg(struct range_tree_root *rg_root, long start, long end, + struct range_tree_node **rg_nodep) { - struct file_region *rg, *nrg, *trg; + long chg = 0; + struct range_tree_node *rg_node; - /* Locate the region we are either in or before. */ - list_for_each_entry(rg, head, link) - if (f <= rg->to) - break; + rg_node = range_tree_in_range_adjacent(rg_root, start, end); + /* + * If we need to allocate a new range_tree_node, we return a charge + * with NULL *rg_node; + */ + if (rg_node == NULL) + return end - start; - /* Round our left edge to the current segment if it encloses us. */ - if (f > rg->from) - f = rg->from; + if (start < rg_node->start) + chg += rg_node->start - start; + if (rg_node->end < end) + chg += end - rg_node->end; - /* Check for and consume any regions we now overlap with. */ - nrg = rg; - list_for_each_entry_safe(rg, trg, rg->link.prev, link) { - if (&rg->link == head) - break; - if (rg->from > t) - break; - - /* If this area reaches higher then extend our area to - * include it completely. If this is not the first area - * which we intend to reuse, free it. */ - if (rg->to > t) - t = rg->to; - if (rg != nrg) { - list_del(&rg->link); - kfree(rg); - } - } - nrg->from = f; - nrg->to = t; - return 0; + *rg_nodep = rg_node; + return chg; } -static long region_chg(struct list_head *head, long f, long t) +static void region_add(struct range_tree_root *rg_root, long start, long end, + struct range_tree_node *rg_node) { - struct file_region *rg, *nrg; - long chg = 0; - - /* Locate the region we are before or in. */ - list_for_each_entry(rg, head, link) - if (f <= rg->to) - break; + if (rg_node == NULL) + return; - /* If we are below the current region then a new region is required. - * Subtle, allocate a new region at the position but make it zero - * size such that we can guarantee to record the reservation. */ - if (&rg->link == head || t < rg->from) { - nrg = kmalloc(sizeof(*nrg), GFP_KERNEL); - if (!nrg) - return -ENOMEM; - nrg->from = f; - nrg->to = f; - INIT_LIST_HEAD(&nrg->link); - list_add(&nrg->link, rg->link.prev); + if (start < rg_node->start) + rg_node->start = start; - return t - f; - } + if (end > rg_node->end) + rg_node->end = end; - /* Round our left edge to the current segment if it encloses us. */ - if (f > rg->from) - f = rg->from; - chg = t - f; - - /* Check for and consume any regions we now overlap with. */ - list_for_each_entry(rg, rg->link.prev, link) { - if (&rg->link == head) - break; - if (rg->from > t) - return chg; - - /* We overlap with this area, if it extends further than - * us then we must extend ourselves. Account for its - * existing reservation. */ - if (rg->to > t) { - chg += rg->to - t; - t = rg->to; - } - chg -= rg->to - rg->from; - } - return chg; + range_tree_add(rg_root, rg_node); } -static long region_truncate(struct list_head *head, long end) +static long region_truncate(struct range_tree_root *rg_root, long off) { - struct file_region *rg, *trg; long chg = 0; - - /* Locate the region we are either in or before. */ - list_for_each_entry(rg, head, link) - if (end <= rg->to) - break; - if (&rg->link == head) - return 0; - - /* If we are in the middle of a region then adjust it. */ - if (end > rg->from) { - chg = rg->to - end; - rg->to = end; - rg = list_entry(rg->link.next, typeof(*rg), link); - } - - /* Drop any remaining regions. */ - list_for_each_entry_safe(rg, trg, rg->link.prev, link) { - if (&rg->link == head) - break; - chg += rg->to - rg->from; - list_del(&rg->link); - kfree(rg); + struct rb_node *rb_node; + +restart: + rb_node = rb_first(&rg_root->head); + while (rb_node) { + struct range_tree_node *rg_node; + rg_node = rb_entry(rb_node, struct range_tree_node, rb); + if (rg_node->end <= off) { + rb_node = rb_next(rb_node); + continue; + } + if (rg_node->start < off) { + chg += rg_node->end - off; + rg_node->end = off; + rb_node = rb_next(rb_node); + continue; + } + chg += rg_node->end - rg_node->start; + rb_erase(rb_node, &rg_root->head); + goto restart; } return chg; } -static long region_count(struct list_head *head, long f, long t) +static long region_count(struct range_tree_root *rg_root, long start, long end) { - struct file_region *rg; long chg = 0; + struct rb_node *rb_node; - /* Locate each segment we overlap with, and count that overlap. */ - list_for_each_entry(rg, head, link) { - int seg_from; - int seg_to; + rb_node = rb_first(&rg_root->head); + while (rb_node) { + int seg_from, seg_to; + struct range_tree_node *rg_node; - if (rg->to <= f) + rg_node = rb_entry(rb_node, struct range_tree_node, rb); + if (rg_node->end <= start) { + rb_node = rb_next(rb_node); continue; - if (rg->from >= t) + } + if (rg_node->start >= end) break; - - seg_from = max(rg->from, f); - seg_to = min(rg->to, t); + seg_from = max(rg_node->start, (u64)start); + seg_to = min(rg_node->end, (u64)end); chg += seg_to - seg_from; + rb_node = rb_next(rb_node); } - return chg; } @@ -302,7 +248,7 @@ static void set_vma_private_data(struct vm_area_struct *vma, struct resv_map { struct kref refs; - struct list_head regions; + struct range_tree_root rg_root; }; static struct resv_map *resv_map_alloc(void) @@ -312,7 +258,7 @@ static struct resv_map *resv_map_alloc(void) return NULL; kref_init(&resv_map->refs); - INIT_LIST_HEAD(&resv_map->regions); + range_tree_init(&resv_map->rg_root); return resv_map; } @@ -322,7 +268,7 @@ static void resv_map_release(struct kref *ref) struct resv_map *resv_map = container_of(ref, struct resv_map, refs); /* Clear out any active regions before we release the map. */ - region_truncate(&resv_map->regions, 0); + region_truncate(&resv_map->rg_root, 0); kfree(resv_map); } @@ -980,16 +926,19 @@ static void return_unused_surplus_pages(struct hstate *h, * No action is required on failure. */ static long vma_needs_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) + struct vm_area_struct *vma, + unsigned long addr, + struct range_tree_node **rg_node) { struct address_space *mapping = vma->vm_file->f_mapping; struct inode *inode = mapping->host; + struct hugetlbfs_inode_info *hinfo = HUGETLBFS_I(inode); + *rg_node = NULL; if (vma->vm_flags & VM_MAYSHARE) { pgoff_t idx = vma_hugecache_offset(h, vma, addr); - return region_chg(&inode->i_mapping->private_list, - idx, idx + 1); + return region_chg(&hinfo->rg_root, idx, idx + 1, rg_node); } else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { return 1; @@ -998,28 +947,34 @@ static long vma_needs_reservation(struct hstate *h, pgoff_t idx = vma_hugecache_offset(h, vma, addr); struct resv_map *reservations = vma_resv_map(vma); - err = region_chg(&reservations->regions, idx, idx + 1); + err = region_chg(&reservations->rg_root, idx, idx + 1, rg_node); if (err < 0) return err; return 0; } } static void vma_commit_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) + struct vm_area_struct *vma, + unsigned long addr, + struct range_tree_node *rg_node) { struct address_space *mapping = vma->vm_file->f_mapping; struct inode *inode = mapping->host; + struct hugetlbfs_inode_info *hinfo = HUGETLBFS_I(inode); + + if (rg_node == NULL) + return; if (vma->vm_flags & VM_MAYSHARE) { pgoff_t idx = vma_hugecache_offset(h, vma, addr); - region_add(&inode->i_mapping->private_list, idx, idx + 1); + region_add(&hinfo->rg_root, idx, idx + 1, rg_node); } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { pgoff_t idx = vma_hugecache_offset(h, vma, addr); struct resv_map *reservations = vma_resv_map(vma); /* Mark this page used in the map. */ - region_add(&reservations->regions, idx, idx + 1); + region_add(&reservations->rg_root, idx, idx + 1, rg_node); } } @@ -1027,6 +982,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve) { int ret, idx; + struct range_tree_node *rg_node; struct hstate *h = hstate_vma(vma); struct page *page; struct mem_cgroup *memcg; @@ -1042,18 +998,24 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma, * MAP_NORESERVE mappings may also need pages and quota allocated * if no reserve mapping overlaps. */ - chg = vma_needs_reservation(h, vma, addr); - if (chg < 0) - return ERR_PTR(-ENOMEM); - if (chg) - if (hugetlb_get_quota(inode->i_mapping, chg)) - return ERR_PTR(-ENOSPC); + chg = vma_needs_reservation(h, vma, addr, &rg_node); + if (chg > 0 && rg_node == NULL) { + rg_node = kzalloc(sizeof(*rg_node), GFP_KERNEL); + if (rg_node == NULL) + return ERR_PTR(-ENOMEM); + } + if (chg) { + if (hugetlb_get_quota(inode->i_mapping, chg)) { + ret = -ENOSPC; + goto err_out; + } + } ret = mem_cgroup_hugetlb_charge_page(idx, pages_per_huge_page(h), &memcg); if (ret) { - hugetlb_put_quota(inode->i_mapping, chg); - return ERR_PTR(-ENOSPC); + ret = -ENOSPC; + goto err_out_quota; } spin_lock(&hugetlb_lock); page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve); @@ -1062,21 +1024,26 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma, if (!page) { page = alloc_buddy_huge_page(h, NUMA_NO_NODE); if (!page) { - mem_cgroup_hugetlb_uncharge_memcg(idx, - pages_per_huge_page(h), - memcg); - hugetlb_put_quota(inode->i_mapping, chg); - return ERR_PTR(-ENOSPC); + ret = -ENOSPC; + goto err_out_uncharge; } } set_page_private(page, (unsigned long) mapping); - vma_commit_reservation(h, vma, addr); + vma_commit_reservation(h, vma, addr, rg_node); /* update page cgroup details */ mem_cgroup_hugetlb_commit_charge(idx, pages_per_huge_page(h), memcg, page); return page; + +err_out_uncharge: + mem_cgroup_hugetlb_uncharge_memcg(idx, pages_per_huge_page(h), memcg); +err_out_quota: + hugetlb_put_quota(inode->i_mapping, chg); +err_out: + kfree(rg_node); + return ERR_PTR(ret); } int __weak alloc_bootmem_huge_page(struct hstate *h) @@ -2170,7 +2137,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma) end = vma_hugecache_offset(h, vma, vma->vm_end); reserve = (end - start) - - region_count(&reservations->regions, start, end); + region_count(&reservations->rg_root, start, end); kref_put(&reservations->refs, resv_map_release); @@ -2697,11 +2664,13 @@ retry: * any allocations necessary to record that reservation occur outside * the spinlock. */ - if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) - if (vma_needs_reservation(h, vma, address) < 0) { + if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { + struct range_tree_node *rg_node; + if (vma_needs_reservation(h, vma, address, &rg_node) < 0) { ret = VM_FAULT_OOM; goto backout_unlocked; } + } spin_lock(&mm->page_table_lock); size = i_size_read(mapping->host) >> huge_page_shift(h); @@ -2789,7 +2758,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, * consumed. */ if ((flags & FAULT_FLAG_WRITE) && !pte_write(entry)) { - if (vma_needs_reservation(h, vma, address) < 0) { + struct range_tree_node *rg_node; + if (vma_needs_reservation(h, vma, address, &rg_node) < 0) { ret = VM_FAULT_OOM; goto out_mutex; } @@ -2975,7 +2945,9 @@ int hugetlb_reserve_pages(struct inode *inode, vm_flags_t vm_flags) { long ret, chg; + struct range_tree_node *rg_node; struct hstate *h = hstate_inode(inode); + struct hugetlbfs_inode_info *hinfo = HUGETLBFS_I(inode); /* * Only apply hugepage reservation if asked. At fault time, an @@ -2992,25 +2964,27 @@ int hugetlb_reserve_pages(struct inode *inode, * called to make the mapping read-write. Assume !vma is a shm mapping */ if (!vma || vma->vm_flags & VM_MAYSHARE) - chg = region_chg(&inode->i_mapping->private_list, from, to); + chg = region_chg(&hinfo->rg_root, from, to, &rg_node); else { struct resv_map *resv_map = resv_map_alloc(); if (!resv_map) return -ENOMEM; - chg = to - from; - + chg = region_chg(&resv_map->rg_root, from, to, &rg_node); set_vma_resv_map(vma, resv_map); set_vma_resv_flags(vma, HPAGE_RESV_OWNER); } - - if (chg < 0) - return chg; - + if (chg > 0 && rg_node == NULL ) { + /* We need allocate a new node */ + rg_node = kzalloc(sizeof(*rg_node), GFP_KERNEL); + if (rg_node == NULL) + return -ENOMEM; + } /* There must be enough filesystem quota for the mapping */ - if (hugetlb_get_quota(inode->i_mapping, chg)) - return -ENOSPC; - + if (hugetlb_get_quota(inode->i_mapping, chg)) { + ret = -ENOSPC; + goto err_out; + } /* * Check enough hugepages are available for the reservation. * Hand back the quota if there are not @@ -3018,7 +2992,7 @@ int hugetlb_reserve_pages(struct inode *inode, ret = hugetlb_acct_memory(h, chg); if (ret < 0) { hugetlb_put_quota(inode->i_mapping, chg); - return ret; + goto err_out; } /* @@ -3033,14 +3007,19 @@ int hugetlb_reserve_pages(struct inode *inode, * else has to be done for private mappings here */ if (!vma || vma->vm_flags & VM_MAYSHARE) - region_add(&inode->i_mapping->private_list, from, to); + region_add(&hinfo->rg_root, from, to, rg_node); return 0; +err_out: + kfree(rg_node); + return ret; } void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed) { struct hstate *h = hstate_inode(inode); - long chg = region_truncate(&inode->i_mapping->private_list, offset); + struct hugetlbfs_inode_info *hinfo = HUGETLBFS_I(inode); + + long chg = region_truncate(&hinfo->rg_root, offset); spin_lock(&inode->i_lock); inode->i_blocks -= (blocks_per_huge_page(h) * freed); ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-03-16 22:51 [PATCH 0/2] [RFC] Volatile ranges (v4) John Stultz 2012-03-16 22:51 ` [PATCH 1/2] [RFC] Range tree implementation John Stultz @ 2012-03-16 22:51 ` John Stultz 2012-03-17 0:47 ` [PATCH] fadvise volatile fixes from Dmitry John Stultz 2012-03-17 16:21 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags Dmitry Adamushko 2012-07-19 10:13 ` [PATCH 0/2] [RFC] Volatile ranges (v4) Dmitry Vyukov 2 siblings, 2 replies; 31+ messages in thread From: John Stultz @ 2012-03-16 22:51 UTC (permalink / raw) To: linux-kernel Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V This patch provides new fadvise flags that can be used to mark file pages as volatile, which will allow it to be discarded if the kernel wants to reclaim memory. This is useful for userspace to allocate things like caches, and lets the kernel destructively (but safely) reclaim them when there's memory pressure. It's different from FADV_DONTNEED since the pages are not immediately discarded; they are only discarded under pressure. This is very much influenced by the Android Ashmem interface by Robert Love so credits to him and the Android developers. In many cases the code & logic come directly from the ashmem patch. The intent of this patch is to allow for ashmem-like behavior, but embeds the idea a little deeper into the VM code, instead of isolating it into a specific driver. I'm very much a newbie at the VM code, so At this point, I just want to try to get some input on the patch, so if you have another idea for using something other then fadvise, or other thoughts on how the volatile ranges are stored, I'd be really interested in hearing them. So let me know if you have any comments for feedback! Also many thanks to Dave Hansen who helped design and develop the initial version of this patch, and has provided continued review and mentoring for me in the VM code. v2: * After the valid critique that just dropping pages would poke holes in volatile ranges, and instead we should zap an entire range if we drop any of it, I changed the code to more closely mimic the ashmem implementation, which zaps entire ranges via a shrinker using an lru list that tracks which range has been marked volatile the longest. v3: * Reworked to use range tree implementation. v4: * Renamed functions to avoid confusion. * More consistant PAGE_CACHE_SHIFT usage, suggested by Dmitry Adamushko * Fixes exit without unlocking issue found by Dmitry Adamushko * Migrate to rbtree based rangetree implementation * Simplified locking to use global lock (we were grabbing global lru lock every time anyway). * Avoid ENOMEM isses by allocating before we get into complicated code. * Add some documentation to the volatile.c file from Neil Brown Known issues: * Lockdep doesn't like calling vmtruncate_range() from a shrinker. Any help here on how to address this would be appreciated. I've tried switching to invalidate_inode_pages2_range, but that always returns EBUSY in my testing, and I don't really want to launder dirty pages, instead I want to zap them. * Concern over bloating the address_space struct CC: Andrew Morton <akpm@linux-foundation.org> CC: Android Kernel Team <kernel-team@android.com> CC: Robert Love <rlove@google.com> CC: Mel Gorman <mel@csn.ul.ie> CC: Hugh Dickins <hughd@google.com> CC: Dave Hansen <dave@linux.vnet.ibm.com> CC: Rik van Riel <riel@redhat.com> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com> CC: Dave Chinner <david@fromorbit.com> CC: Neil Brown <neilb@suse.de> CC: Andrea Righi <andrea@betterlinux.com> CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: John Stultz <john.stultz@linaro.org> --- fs/inode.c | 4 + include/linux/fadvise.h | 5 + include/linux/fs.h | 2 + include/linux/volatile.h | 14 ++ mm/Makefile | 2 +- mm/fadvise.c | 16 +++- mm/volatile.c | 313 ++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 354 insertions(+), 2 deletions(-) create mode 100644 include/linux/volatile.h create mode 100644 mm/volatile.c diff --git a/fs/inode.c b/fs/inode.c index d3ebdbe..f602dc2 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -27,6 +27,7 @@ #include <linux/cred.h> #include <linux/buffer_head.h> /* for inode_has_buffers */ #include <linux/ratelimit.h> +#include <linux/volatile.h> #include "internal.h" /* @@ -254,6 +255,7 @@ void __destroy_inode(struct inode *inode) if (inode->i_default_acl && inode->i_default_acl != ACL_NOT_CACHED) posix_acl_release(inode->i_default_acl); #endif + mapping_clear_volatile_ranges(&inode->i_data); this_cpu_dec(nr_inodes); } EXPORT_SYMBOL(__destroy_inode); @@ -360,6 +362,8 @@ void address_space_init_once(struct address_space *mapping) spin_lock_init(&mapping->private_lock); INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap); INIT_LIST_HEAD(&mapping->i_mmap_nonlinear); + range_tree_init(&mapping->volatile_root); + } EXPORT_SYMBOL(address_space_init_once); diff --git a/include/linux/fadvise.h b/include/linux/fadvise.h index e8e7471..443951c 100644 --- a/include/linux/fadvise.h +++ b/include/linux/fadvise.h @@ -18,4 +18,9 @@ #define POSIX_FADV_NOREUSE 5 /* Data will be accessed once. */ #endif +#define POSIX_FADV_VOLATILE 8 /* _can_ toss, but don't toss now */ +#define POSIX_FADV_NONVOLATILE 9 /* Remove VOLATILE flag */ + + + #endif /* FADVISE_H_INCLUDED */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 69cd5bb..2e20be1 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -10,6 +10,7 @@ #include <linux/ioctl.h> #include <linux/blk_types.h> #include <linux/types.h> +#include <linux/rangetree.h> /* * It's silly to have NR_OPEN bigger than NR_FILE, but you can change @@ -655,6 +656,7 @@ struct address_space { spinlock_t private_lock; /* for use by the address_space */ struct list_head private_list; /* ditto */ struct address_space *assoc_mapping; /* ditto */ + struct range_tree_root volatile_root; /* volatile range list */ } __attribute__((aligned(sizeof(long)))); /* * On most architectures that alignment is already the case; but diff --git a/include/linux/volatile.h b/include/linux/volatile.h new file mode 100644 index 0000000..5460d7b --- /dev/null +++ b/include/linux/volatile.h @@ -0,0 +1,14 @@ +#ifndef _LINUX_VOLATILE_H +#define _LINUX_VOLATILE_H + +#include <linux/fs.h> + +extern long mapping_range_volatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index); +extern long mapping_range_nonvolatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index); +extern long mapping_range_isvolatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index); +extern void mapping_clear_volatile_ranges(struct address_space *mapping); + +#endif /* _LINUX_VOLATILE_H */ diff --git a/mm/Makefile b/mm/Makefile index 50ec00e..7b6c7a8 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -13,7 +13,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ readahead.o swap.o truncate.o vmscan.o shmem.o \ prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \ page_isolation.o mm_init.o mmu_context.o percpu.o \ - $(mmu-y) + volatile.o $(mmu-y) obj-y += init-mm.o ifdef CONFIG_NO_BOOTMEM diff --git a/mm/fadvise.c b/mm/fadvise.c index 469491e0..3e33845 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -17,6 +17,7 @@ #include <linux/fadvise.h> #include <linux/writeback.h> #include <linux/syscalls.h> +#include <linux/volatile.h> #include <asm/unistd.h> @@ -106,7 +107,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) nrpages = end_index - start_index + 1; if (!nrpages) nrpages = ~0UL; - + ret = force_page_cache_readahead(mapping, file, start_index, nrpages); @@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) invalidate_mapping_pages(mapping, start_index, end_index); break; + case POSIX_FADV_VOLATILE: + /* First and last PARTIAL page! */ + start_index = offset >> PAGE_CACHE_SHIFT; + end_index = endbyte >> PAGE_CACHE_SHIFT; + ret = mapping_range_volatile(mapping, start_index, end_index); + break; + case POSIX_FADV_NONVOLATILE: + /* First and last PARTIAL page! */ + start_index = offset >> PAGE_CACHE_SHIFT; + end_index = endbyte >> PAGE_CACHE_SHIFT; + ret = mapping_range_nonvolatile(mapping, start_index, + end_index); + break; default: ret = -EINVAL; } diff --git a/mm/volatile.c b/mm/volatile.c new file mode 100644 index 0000000..e412a8b --- /dev/null +++ b/mm/volatile.c @@ -0,0 +1,313 @@ +/* mm/volatile.c + * + * Volatile page range managment. + * Copyright 2011 Linaro + * + * Based on mm/ashmem.c + * by Robert Love <rlove@google.com> + * Copyright (C) 2008 Google, Inc. + * + * + * This software is licensed under the terms of the GNU General Public + * License version 2, as published by the Free Software Foundation, and + * may be copied, distributed, and modified under those terms. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * + * The goal behind volatile ranges is to allow applications to interact + * with the kernel's cache management infrastructure. In particular an + * application can say "this memory contains data that might be useful in + * the future, but can be reconstructed if necessary, so if the kernel + * needs, it can zap and reclaim this memory without having to swap it out. + * + * The proposed mechanism - at a high level - is for user-space to be able + * to say "This memory is volatile" and then later "this memory is no longer + * volatile". If the content of the memory is still available the second + * request succeeds. If not, the memory is marked non-volatile and an + * error is returned to denote that the contents have been lost. + * + * Credits to Neil Brown for the above description. + * + */ + +#include <linux/kernel.h> +#include <linux/fs.h> +#include <linux/mm.h> +#include <linux/slab.h> +#include <linux/pagemap.h> +#include <linux/volatile.h> + +static DEFINE_MUTEX(volatile_mutex); + +struct volatile_range { + struct list_head lru; + struct range_tree_node range_node; + + unsigned int purged; + struct address_space *mapping; +}; + +/* LRU list of volatile page ranges */ +static LIST_HEAD(volatile_lru_list); + +/* Count of pages on our LRU list */ +static u64 lru_count; + + +static inline u64 range_size(struct volatile_range *range) +{ + return range->range_node.end - range->range_node.start + 1; +} + +static inline void lru_add(struct volatile_range *range) +{ + list_add_tail(&range->lru, &volatile_lru_list); + lru_count += range_size(range); +} + +static inline void lru_del(struct volatile_range *range) +{ + list_del(&range->lru); + lru_count -= range_size(range); +} + +#define range_on_lru(range) (!(range)->purged) + + +static inline void volatile_range_resize(struct volatile_range *range, + pgoff_t start_index, pgoff_t end_index) +{ + size_t pre = range_size(range); + + range->range_node.start = start_index; + range->range_node.end = end_index; + + if (range_on_lru(range)) + lru_count -= pre - range_size(range); +} + +static struct volatile_range *vrange_alloc(void) +{ + struct volatile_range *new; + + new = kzalloc(sizeof(struct volatile_range), GFP_KERNEL); + if (!new) + return 0; + range_tree_node_init(&new->range_node); + return new; +} + +static void vrange_del(struct volatile_range *vrange) +{ + struct address_space *mapping; + mapping = vrange->mapping; + + if (range_on_lru(vrange)) + lru_del(vrange); + range_tree_remove(&mapping->volatile_root, &vrange->range_node); + kfree(vrange); +} + + + +/* + * Mark a region as volatile, allowing dirty pages to be purged + * under memory pressure + */ +long mapping_range_volatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index) +{ + struct volatile_range *new; + struct range_tree_node *node; + + u64 start, end; + int purged = 0; + start = (u64)start_index; + end = (u64)end_index; + + new = vrange_alloc(); + if (!new) + return -ENOMEM; + + mutex_lock(&volatile_mutex); + + node = range_tree_in_range_adjacent(&mapping->volatile_root, + start, end); + while (node) { + struct volatile_range *vrange; + + /* Already entirely marked volatile, so we're done */ + if (node->start < start && node->end > end) { + /* don't need the allocated value */ + kfree(new); + goto out; + } + + /* Grab containing volatile range */ + vrange = container_of(node, struct volatile_range, range_node); + + /* resize range */ + start = min_t(u64, start, node->start); + end = max_t(u64, end, node->end); + purged |= vrange->purged; + + + vrange_del(vrange); + + /* get the next possible overlap */ + node = range_tree_in_range(&mapping->volatile_root, start, end); + } + + new->mapping = mapping; + new->range_node.start = start; + new->range_node.end = end; + new->purged = purged; + + range_tree_add(&mapping->volatile_root, &new->range_node); + if (range_on_lru(new)) + lru_add(new); + +out: + mutex_unlock(&volatile_mutex); + + return 0; +} + +/* + * Mark a region as nonvolatile, returns 1 if any pages in the region + * were purged. + */ +long mapping_range_nonvolatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index) +{ + struct volatile_range *new; + struct range_tree_node *node; + int ret = 0; + u64 start, end; + int used_new = 0; + + + start = (u64)start_index; + end = (u64)end_index; + + /* create new node */ + new = vrange_alloc(); + if (!new) + return -ENOMEM; + + mutex_lock(&volatile_mutex); + node = range_tree_in_range(&mapping->volatile_root, start, end); + while (node) { + struct volatile_range *vrange; + vrange = container_of(node, struct volatile_range, range_node); + + + ret |= vrange->purged; + + if (start <= node->start && end >= node->end) { + vrange_del(vrange); + } else if (node->start >= start) { + volatile_range_resize(vrange, end+1, node->end); + } else if (node->end <= end) { + volatile_range_resize(vrange, node->start, start-1); + } else { + /* we only do this once */ + used_new = 1; + new->mapping = mapping; + new->range_node.start = end + 1; + new->range_node.end = node->end; + volatile_range_resize(vrange, node->start, start-1); + range_tree_add(&mapping->volatile_root, + &new->range_node); + if (range_on_lru(new)) + lru_add(new); + + break; + } + node = range_tree_in_range(&mapping->volatile_root, start, end); + } + mutex_unlock(&volatile_mutex); + + if (!used_new) + kfree(new); + + return ret; +} + + +/* + * Cleans up any volatile ranges. + */ +void mapping_clear_volatile_ranges(struct address_space *mapping) +{ + struct volatile_range *tozap; + + mutex_lock(&volatile_mutex); + while (!range_tree_empty(&mapping->volatile_root)) { + struct range_tree_node *tmp; + tmp = range_tree_root_node(&mapping->volatile_root); + tozap = container_of(tmp, struct volatile_range, range_node); + vrange_del(tozap); + + } + mutex_unlock(&volatile_mutex); +} + +/* + * Purges volatile ranges when under memory pressure + */ +static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc) +{ + struct volatile_range *range, *next; + unsigned long nr_to_scan = sc->nr_to_scan; + const gfp_t gfp_mask = sc->gfp_mask; + + if (nr_to_scan && !(gfp_mask & __GFP_FS)) + return -1; + if (!nr_to_scan) + return lru_count; + + mutex_lock(&volatile_mutex); + list_for_each_entry_safe(range, next, &volatile_lru_list, lru) { + struct inode *inode; + loff_t start, end; + + inode = range->mapping->host; + + start = range->range_node.start << PAGE_CACHE_SHIFT; + end = ((range->range_node.end + 1) << PAGE_CACHE_SHIFT) - 1; + + /* + * XXX - calling vmtruncate_range from a shrinker causes + * lockdep warnings. Revisit this! + */ + if (!vmtruncate_range(inode, start, end)) { + lru_del(range); + range->purged = 1; + nr_to_scan -= range_size(range); + } + + if (nr_to_scan <= 0) + break; + } + mutex_unlock(&volatile_mutex); + + return lru_count; +} + +static struct shrinker volatile_shrinker = { + .shrink = volatile_shrink, + .seeks = DEFAULT_SEEKS, +}; + +static int __init volatile_init(void) +{ + register_shrinker(&volatile_shrinker); + return 0; +} + +arch_initcall(volatile_init); -- 1.7.3.2.146.gca209 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH] fadvise volatile fixes from Dmitry 2012-03-16 22:51 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz @ 2012-03-17 0:47 ` John Stultz 2012-03-17 16:21 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags Dmitry Adamushko 1 sibling, 0 replies; 31+ messages in thread From: John Stultz @ 2012-03-17 0:47 UTC (permalink / raw) To: John Stultz Cc: linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V On 03/16/2012 03:51 PM, John Stultz wrote: > This patch provides new fadvise flags that can be used to mark > file pages as volatile, which will allow it to be discarded if the > kernel wants to reclaim memory. Right after sending this I realized I had forgotten to include some fixes for issues Dmitry pointed out. So I've included them here. Signed-off-by: John Stultz<john.stultz@linaro.org> --- mm/volatile.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/volatile.c b/mm/volatile.c index e412a8b..f40c02e 100644 --- a/mm/volatile.c +++ b/mm/volatile.c @@ -220,11 +220,12 @@ long mapping_range_nonvolatile(struct address_space *mapping, new->mapping = mapping; new->range_node.start = end + 1; new->range_node.end = node->end; - volatile_range_resize(vrange, node->start, start-1); + new->purged = vrange->purged; range_tree_add(&mapping->volatile_root, &new->range_node); if (range_on_lru(new)) lru_add(new); + volatile_range_resize(vrange, node->start, start-1); break; } @@ -263,7 +264,7 @@ void mapping_clear_volatile_ranges(struct address_space *mapping) static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc) { struct volatile_range *range, *next; - unsigned long nr_to_scan = sc->nr_to_scan; + s64 nr_to_scan = sc->nr_to_scan; const gfp_t gfp_mask = sc->gfp_mask; if (nr_to_scan&& !(gfp_mask& __GFP_FS)) -- 1.7.3.2.146.gca209 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-03-16 22:51 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz 2012-03-17 0:47 ` [PATCH] fadvise volatile fixes from Dmitry John Stultz @ 2012-03-17 16:21 ` Dmitry Adamushko 2012-03-18 9:13 ` Dmitry Adamushko 2012-03-20 0:18 ` John Stultz 1 sibling, 2 replies; 31+ messages in thread From: Dmitry Adamushko @ 2012-03-17 16:21 UTC (permalink / raw) To: John Stultz Cc: linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V Hi John, [ ... ] > +/* > + * Mark a region as volatile, allowing dirty pages to be purged > + * under memory pressure > + */ > +long mapping_range_volatile(struct address_space *mapping, > + pgoff_t start_index, pgoff_t end_index) > +{ > + struct volatile_range *new; > + struct range_tree_node *node; > + > + u64 start, end; > + int purged = 0; > + start = (u64)start_index; > + end = (u64)end_index; > + > + new = vrange_alloc(); > + if (!new) > + return -ENOMEM; > + > + mutex_lock(&volatile_mutex); > + > + node = range_tree_in_range_adjacent(&mapping->volatile_root, > + start, end); > + while (node) { > + struct volatile_range *vrange; > + > + /* Already entirely marked volatile, so we're done */ > + if (node->start < start && node->end > end) { > + /* don't need the allocated value */ > + kfree(new); > + goto out; > + } > + > + /* Grab containing volatile range */ > + vrange = container_of(node, struct volatile_range, range_node); > + > + /* resize range */ > + start = min_t(u64, start, node->start); > + end = max_t(u64, end, node->end); > + purged |= vrange->purged; > + > + > + vrange_del(vrange); > + > + /* get the next possible overlap */ > + node = range_tree_in_range(&mapping->volatile_root, start, end); I guess range_tree_in_range_adjacent() should be used here again. There can be 2 adjacent regions (left and right), and we'll miss one of them with range_tree_in_range(). Also (as I had already mentioned before), I think that new ranges must not be merged with the existing "vrange->purged == 1" ranges. Otherwise, for some use cases, the whole idea of 'volatility' gets broken. For example, when an application is processing a big buffer in small consequent chunks (marking a chunk as volatile when done with it), and the range gets 'purged' by the kernel early in this process (when it's still small). -- Dmitry ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-03-17 16:21 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags Dmitry Adamushko @ 2012-03-18 9:13 ` Dmitry Adamushko 2012-03-20 0:18 ` John Stultz 1 sibling, 0 replies; 31+ messages in thread From: Dmitry Adamushko @ 2012-03-18 9:13 UTC (permalink / raw) To: John Stultz Cc: linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V On 17 March 2012 17:21, Dmitry Adamushko <dmitry.adamushko@gmail.com> wrote: > Hi John, > > [ ... ] > >> +/* >> + * Mark a region as volatile, allowing dirty pages to be purged >> + * under memory pressure >> + */ >> +long mapping_range_volatile(struct address_space *mapping, >> + pgoff_t start_index, pgoff_t end_index) >> +{ >> + struct volatile_range *new; >> + struct range_tree_node *node; >> + >> + u64 start, end; >> + int purged = 0; >> + start = (u64)start_index; >> + end = (u64)end_index; >> + >> + new = vrange_alloc(); >> + if (!new) >> + return -ENOMEM; >> + >> + mutex_lock(&volatile_mutex); >> + >> + node = range_tree_in_range_adjacent(&mapping->volatile_root, >> + start, end); >> + while (node) { >> + struct volatile_range *vrange; >> + >> + /* Already entirely marked volatile, so we're done */ >> + if (node->start < start && node->end > end) { >> + /* don't need the allocated value */ >> + kfree(new); >> + goto out; >> + } >> + >> + /* Grab containing volatile range */ >> + vrange = container_of(node, struct volatile_range, range_node); >> + >> + /* resize range */ >> + start = min_t(u64, start, node->start); >> + end = max_t(u64, end, node->end); >> + purged |= vrange->purged; >> + >> + >> + vrange_del(vrange); >> + >> + /* get the next possible overlap */ >> + node = range_tree_in_range(&mapping->volatile_root, start, end); > > I guess range_tree_in_range_adjacent() should be used here again. > There can be 2 adjacent regions (left and right), and we'll miss one > of them with range_tree_in_range(). > > Also (as I had already mentioned before), I think that new ranges must > not be merged with the existing "vrange->purged == 1" ranges. > Otherwise, for some use cases, the whole idea of 'volatility' gets > broken. For example, when an application is processing a big buffer in > small consequent chunks (marking a chunk as volatile when done with > it), and the range gets 'purged' by the kernel early in this process > (when it's still small). Alternatively, we could immediately truncate purged==0 ranges (including the one for which mapping_range_volatile() is called) when merging them with purged==1 ranges. This would result in a more consistent, but perhaps too aggressive behavior. Let's consider the following use case: [1, 10] is an existing purged==1 volatile region, and an application declares [11, 12] as volatile. 1) current approach: [1, 12] a single purged==1 region, where [11, 12] was not really truncated (and it could have been [11, 100]); 2) aggressive purge-it-all approach: a single [1, 12] purged==1 region. The newly added region gets truncated even when there is no shortage of memory at the moment of addition. 3) do-not-merge approach: [1, 10] purged==1 and [11, 12] purged==0 regions; the later is on the lru list. it adds extra complexities though (e.g. the need to merge purged ranges in the shrinker code). But what's more important, do we have a model of application behavior that is expected to be observed in, say, 90+% of cases? What patterns are more common? For example, 1) make_volatile [1, 10] ; ... ; make_volatile [5, 15] // overlapping volatile regions 2) make_volatile [1, 10] ; ... ; make_volatile [1, 15] // explicit merge 3) make_volatile [1, 10] ; ... ; make_volatile [11, 15] // adjacent volatile regions I guess (2) and (3) would be more common, and (3) even more so with independently used regions (say, by different threads). For (3), do we really want to merge purged==0 regions when they are adjacent? What if they are used independently? Let's consider this case: (a) make_volatile [1, 100] ; ... ; (z) make_volatile [101, 102] (z) region is used much more frequently by the application [ make_nonvolatile -> do-smth -> make_volatile ], and (a) is not used often - it's volatile most of the time. If we merge both regions when they are still purged==0, the whole [1, 102] will tend to be at the end of the LRU list => - we miss an opportunity to truncate (a); - other regions that are more frequently used than (a) get truncated. In this light, (3) would be better off behaving as if (a) and (z) were not adjacent, i.e. no merge... -- Dmitry ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-03-17 16:21 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags Dmitry Adamushko 2012-03-18 9:13 ` Dmitry Adamushko @ 2012-03-20 0:18 ` John Stultz 1 sibling, 0 replies; 31+ messages in thread From: John Stultz @ 2012-03-20 0:18 UTC (permalink / raw) To: Dmitry Adamushko Cc: linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel, Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V On 03/17/2012 09:21 AM, Dmitry Adamushko wrote: > Hi John, > > [ ... ] > >> +/* >> + * Mark a region as volatile, allowing dirty pages to be purged >> + * under memory pressure >> + */ >> +long mapping_range_volatile(struct address_space *mapping, >> + pgoff_t start_index, pgoff_t end_index) >> +{ >> + struct volatile_range *new; >> + struct range_tree_node *node; >> + >> + u64 start, end; >> + int purged = 0; >> + start = (u64)start_index; >> + end = (u64)end_index; >> + >> + new = vrange_alloc(); >> + if (!new) >> + return -ENOMEM; >> + >> + mutex_lock(&volatile_mutex); >> + >> + node = range_tree_in_range_adjacent(&mapping->volatile_root, >> + start, end); >> + while (node) { >> + struct volatile_range *vrange; >> + >> + /* Already entirely marked volatile, so we're done */ >> + if (node->start< start&& node->end> end) { >> + /* don't need the allocated value */ >> + kfree(new); >> + goto out; >> + } >> + >> + /* Grab containing volatile range */ >> + vrange = container_of(node, struct volatile_range, range_node); >> + >> + /* resize range */ >> + start = min_t(u64, start, node->start); >> + end = max_t(u64, end, node->end); >> + purged |= vrange->purged; >> + >> + >> + vrange_del(vrange); >> + >> + /* get the next possible overlap */ >> + node = range_tree_in_range(&mapping->volatile_root, start, end); > I guess range_tree_in_range_adjacent() should be used here again. > There can be 2 adjacent regions (left and right), and we'll miss one > of them with range_tree_in_range(). Good catch, thank you! > Also (as I had already mentioned before), I think that new ranges must > not be merged with the existing "vrange->purged == 1" ranges. > Otherwise, for some use cases, the whole idea of 'volatility' gets > broken. For example, when an application is processing a big buffer in > small consequent chunks (marking a chunk as volatile when done with > it), and the range gets 'purged' by the kernel early in this process > (when it's still small). > I agree that this seems like a much more intelligent way coalesce regions. I hadn't yet implemented it, as I was hoping for some comment from the Android folks if there was a specific use for the design they selected for ashmem, but I suspect there isn't. I'll go ahead and integrate this for the next revision. Thanks again for the feedback! -john ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/2] [RFC] Volatile ranges (v4) 2012-03-16 22:51 [PATCH 0/2] [RFC] Volatile ranges (v4) John Stultz 2012-03-16 22:51 ` [PATCH 1/2] [RFC] Range tree implementation John Stultz 2012-03-16 22:51 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz @ 2012-07-19 10:13 ` Dmitry Vyukov [not found] ` <CAO6Zf6BSpq53UqYjCkq0b3pTPW=WDTnCorQ59tONnV7U-U6EOg@mail.gmail.com> 2 siblings, 1 reply; 31+ messages in thread From: Dmitry Vyukov @ 2012-07-19 10:13 UTC (permalink / raw) To: linux-kernel John Stultz <john.stultz <at> linaro.org> writes: > Ok. So here's another iteration of the fadvise volatile range code. > > I realize this is still a way off from being ready, but I wanted to post > what I have to share with folks working on the various range/interval > management ideas as well as update folks who've provided feedback on the > volatile range code. Hi John, May you please confirm whether the fadvise(FADV_VOLATILE) will work for me in the following case? I am developing a dynamic data race detector (ThreadSanitizer). It needs to associate some meta-data (shadow) with each byte of application memory (4 bytes of shadow for 1 byte of app memory). We mmap(MAP_ANONYMOUS|MAP_NORESERVE) 40TB of virtual address space for shadow, and then just access it as needed. It works. But for some apps that consume too much memory, it eventually leads to swapping/OOM kills. There is a property of shadow memory that I would like to exploit - any region of shadow memory can be reset to zero at any point w/o any bad consequences (it can lead to missed data races, but it's better than OOM kill). I've tried to execute madvise(MADV_DONTNEED) every X seconds for whole shadow memory. It almost works. The problem is that madvise() seems to be not atomic, occasionally I see missed writes, that's not acceptable, I need either zero pages or consistent pages. Your FADV_VOLATILE looks like it may be the solution. To summarize: I have a huge region of memory that I would like to mark as "volatile" at program startup. The region is anonymous (not backed by any file). The kernel must be able to take away any pages in the range, and then return zero pages later. I do have concurrent writes to the range, and missed writes are unacceptable. Ideally, pages are revoked on LRU basis TIA ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <CAO6Zf6BSpq53UqYjCkq0b3pTPW=WDTnCorQ59tONnV7U-U6EOg@mail.gmail.com>]
[parent not found: <CACT4Y+ZgBo9=HX5MHhmWBiQcdiGMss9RSS_reF4gJimivJx7sQ@mail.gmail.com>]
* Re: [PATCH 0/2] [RFC] Volatile ranges (v4) [not found] ` <CACT4Y+ZgBo9=HX5MHhmWBiQcdiGMss9RSS_reF4gJimivJx7sQ@mail.gmail.com> @ 2012-07-21 11:17 ` Dmitry Adamushko 0 siblings, 0 replies; 31+ messages in thread From: Dmitry Adamushko @ 2012-07-21 11:17 UTC (permalink / raw) To: Dmitry Vyukov; +Cc: John Stultz, LKML [ cc: lkml ] >> > There is a property of shadow memory that I would like to exploit >> > - any region of shadow memory can be reset to zero at any point >> > w/o any bad consequences (it can lead to missed data >> > races, but it's better than OOM kill). >> > I've tried to execute madvise(MADV_DONTNEED) every X >> > seconds for whole shadow memory. It almost works. >> > The problem is that madvise() seems to be >> > not atomic, occasionally I see missed writes, that's not acceptable, >> >> Just to be sure, you mean that if you do, say >> >> *ptr = 1; // [1] >> ... >> value = *ptr; // [2] value is 1 here >> ... >> *ptr = 2; // possibly from another thread, but we can be certain that >> it's after [1], perhaps because we checked the content with [2] >> ... >> // madvise(..., MADV_DONTNEED); _might_ have been called >> ... >> value = *ptr; >> >> so here we expect 'value' to be either 2 or 0 (zero page iff madvise() >> did take place), but you get '1' occasionally? >> Is that what you mean or something else? > > > > Yes, that's what I mean. > Basically I observed inconsistent state of memory that must never happen. I > had no other explanations except that the madvise() call works as a time > machine. I executed madvise() every 3 seconds, and the inconsistencies > happened exactly at the same times. When I turned off madvise(), the > problems disappear. Did you try disabling swapping? The "time machine" (if it's not a problem somewhere else) should take old stuff from somewhere. mmap's man indicates "zero-fill-on-demand pages for mappings without an underlying file" and the comment in madvise_dontneed() says "Be sure to free swap resources too", but then looking at the code in zap_pte_ranges(), there are a few corner cases where swap entries seem to be left over intact. In any case, given that you can reproduce it easily, it'll be a quick check. Also, it's a MAP_PRIVATE mapping, isn't it? > > I am not sure whether it's a bug or not, because the man says "For the time > being, the application is finished with the given range", and we are > obviously do not since we have concurrent accesses. However I would be great > if it is "fixed". > > > >> > I need either zero pages or >> > consistent pages. >> > Your FADV_VOLATILE looks like it may be the solution. >> > To summarize: I have a huge region of memory that >> > I would like to mark as "volatile" at program >> > startup. The region is anonymous (not backed by any file). >> > The kernel must be able to take away >> > any pages in the range, and then return zero pages later. >> >> I guess that for the use-cases that people have considered so far, >> users are supposed to mark regions NONVOLATILE before accessing them >> again. If I understand correctly, that's not what you want to do... > > > No, it's not what I want to do. I can't do any tracking during accesses. > Ideally I just mark the range during startup, it's also possible to do some > work on a periodic basis. > > >> >> does it mean that your 'transactions' are always write-a-single-word? >> i.e. you don't need to make sure that, say, in >> >> ptr[0] = val_a; >> ptr[1] = val_b; >> ... // no accesses to ptr [0] and [1] >> c = ptr[0]; >> d = ptr[1]; >> >> either c == val_a and d == val_b or both are 0? > > > Exactly. Any transaction first issues N independent 8-byte atomic reads, and > then optionally 1 atomic 8-byte write. Value of 0 especially means "no data > here", because, well, I do not want to setup 40TB of memory to some other > value :) > > >> >> >> Also, the current implementation of volatile-ranges will try to 'zap' >> the whole volatile range... not just some of its pages. Perhaps, it's >> not something you need too. Of course, this can be addressed one way >> or another. > > > Well, yes, it's not ideal (but we had lived with MADV_DONTNEED for the whole > range for some time). Ideally, kernel just takes pages away as it needs > them. > > > >> Basically, in your specific case, the pages of your region should be, >> kind of, swapped out to /dev/zero :-)) meaning that once the system >> decides to swap such a page out, no actual swap is required and, once >> the area is being accessed again, a zero-page is mapped into it. > > > Yes, I believe that's how MADV_DONTNEED currently works (... or > zero-fill-on-demand pages for mappings without an underlying file). > Also the pages must be "swapped out" with higher prio. > I think what I want is somewhat similar to page cache. Kind of best effort > LRU caching. > -- Dmitry ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH 0/2][RFC] Volatile Ranges (v7) @ 2012-04-14 1:07 John Stultz 2012-04-14 1:08 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz 0 siblings, 1 reply; 31+ messages in thread From: John Stultz @ 2012-04-14 1:07 UTC (permalink / raw) To: linux-kernel Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V Another week, another volatile range patch iteration. So I think this is starting to shape up, and given the muted response to the last few iterations, next time I may need to drop the RFC to scare folks into taking a serious look at this. This round tries to address the outstanding lockdep issue of calling vmtruncate_range form a shrinker. My solution here is to call shmem_truncate_range directly, which results in this functionality being a tmpfs only feature for now. I know there was some concern over using a generic fadvise interface for a tmpfs only feature, and while I'd like this to be more generic, it may really only make sense for tmpfs files. Also the MADV_REMOVE interface provides similar effective tmpfs only (well, nilfs2 supports it too) precedent. Thoughts here about what would be the most appropriate interface would be appreciated (does madvise make more sense for tmpfs only?). Also I reworked the code so the volatile ranges won't persist if all the fds have been closed. I think this avoids possible surprising effects of volatile pages if they were allowed to persist across multiple non-concurrent opens. Finally Dmitry Adamushko pointed out a race and some other minor fixes that I corrected. As always, your feedback is greatly appreciated. thanks -john CC: Andrew Morton <akpm@linux-foundation.org> CC: Android Kernel Team <kernel-team@android.com> CC: Robert Love <rlove@google.com> CC: Mel Gorman <mel@csn.ul.ie> CC: Hugh Dickins <hughd@google.com> CC: Dave Hansen <dave@linux.vnet.ibm.com> CC: Rik van Riel <riel@redhat.com> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com> CC: Dave Chinner <david@fromorbit.com> CC: Neil Brown <neilb@suse.de> CC: Andrea Righi <andrea@betterlinux.com> CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> John Stultz (2): [RFC] Range tree implementation [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags fs/file_table.c | 4 + include/linux/fadvise.h | 5 + include/linux/rangetree.h | 56 ++++++ include/linux/volatile.h | 12 ++ lib/Makefile | 2 +- lib/rangetree.c | 128 +++++++++++++ mm/Makefile | 2 +- mm/fadvise.c | 16 ++- mm/volatile.c | 457 +++++++++++++++++++++++++++++++++++++++++++++ 9 files changed, 679 insertions(+), 3 deletions(-) create mode 100644 include/linux/rangetree.h create mode 100644 include/linux/volatile.h create mode 100644 lib/rangetree.c create mode 100644 mm/volatile.c -- 1.7.3.2.146.gca209 ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-04-14 1:07 [PATCH 0/2][RFC] Volatile Ranges (v7) John Stultz @ 2012-04-14 1:08 ` John Stultz 0 siblings, 0 replies; 31+ messages in thread From: John Stultz @ 2012-04-14 1:08 UTC (permalink / raw) To: linux-kernel Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V This patch provides new fadvise flags that can be used to mark file pages as volatile, which will allow it to be discarded if the kernel wants to reclaim memory. This is useful for userspace to allocate things like caches, and lets the kernel destructively (but safely) reclaim them when there's memory pressure. It's different from FADV_DONTNEED since the pages are not immediately discarded; they are only discarded under pressure. This is very much influenced by the Android Ashmem interface by Robert Love so credits to him and the Android developers. In many cases the code & logic come directly from the ashmem patch. The intent of this patch is to allow for ashmem-like behavior, but embeds the idea a little deeper into the VM code, instead of isolating it into a specific driver. I'm very much a newbie at the VM code, so At this point, I just want to try to get some input on the patch, so if you have another idea for using something other then fadvise, or other thoughts on how the volatile ranges are stored, I'd be really interested in hearing them. So let me know if you have any comments for feedback! Also many thanks to Dave Hansen who helped design and develop the initial version of this patch, and has provided continued review and mentoring for me in the VM code. v2: * After the valid critique that just dropping pages would poke holes in volatile ranges, and instead we should zap an entire range if we drop any of it, I changed the code to more closely mimic the ashmem implementation, which zaps entire ranges via a shrinker using an lru list that tracks which range has been marked volatile the longest. v3: * Reworked to use range tree implementation. v4: * Renamed functions to avoid confusion. * More consistant PAGE_CACHE_SHIFT usage, suggested by Dmitry Adamushko * Fixes exit without unlocking issue found by Dmitry Adamushko * Migrate to rbtree based rangetree implementation * Simplified locking to use global lock (we were grabbing global lru lock every time anyway). * Avoid ENOMEM isses by allocating before we get into complicated code. * Add some documentation to the volatile.c file from Neil Brown v5: * More fixes suggested by Dmitry Adamushko * Improve range colescing so that we don't coalesce neighboring purged ranges. * Utilize range_tree_next_in_range to avoid doing every lookup from the tree's root. v6: * Immediately zap range if we coalesce overlapping purged range. * Use hash table to do mapping->rangetree lookup instead of bloating the address_space structure v7: * Race fix noted by Dmitry * Clear volatile ranges on fput, so volatile ranges don't persist if no one has the file open * Made it tmpfs only, using shmem_truncate_range() instead of vmtruncate_range(). This avoids the lockdep warnings caused by calling vmtruncate_range() from the shrinker. Seems to work ok, but I'd not be surprised if this isn't correct. Extra eyes would be appreciated here. Known issues: * None? I think this is getting close to dropping the RFC, and taking a stab at actually submitting this for inclusion. CC: Andrew Morton <akpm@linux-foundation.org> CC: Android Kernel Team <kernel-team@android.com> CC: Robert Love <rlove@google.com> CC: Mel Gorman <mel@csn.ul.ie> CC: Hugh Dickins <hughd@google.com> CC: Dave Hansen <dave@linux.vnet.ibm.com> CC: Rik van Riel <riel@redhat.com> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com> CC: Dave Chinner <david@fromorbit.com> CC: Neil Brown <neilb@suse.de> CC: Andrea Righi <andrea@betterlinux.com> CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: John Stultz <john.stultz@linaro.org> --- fs/file_table.c | 4 + include/linux/fadvise.h | 5 + include/linux/volatile.h | 12 ++ mm/Makefile | 2 +- mm/fadvise.c | 16 ++- mm/volatile.c | 457 ++++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 494 insertions(+), 2 deletions(-) create mode 100644 include/linux/volatile.h create mode 100644 mm/volatile.c diff --git a/fs/file_table.c b/fs/file_table.c index 70f2a0f..ada2c88 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -24,6 +24,7 @@ #include <linux/percpu_counter.h> #include <linux/percpu.h> #include <linux/ima.h> +#include <linux/volatile.h> #include <linux/atomic.h> @@ -238,6 +239,9 @@ static void __fput(struct file *file) eventpoll_release(file); locks_remove_flock(file); + /* Volatile ranges should not persist after all fds are closed */ + mapping_clear_volatile_ranges(&inode->i_data); + if (unlikely(file->f_flags & FASYNC)) { if (file->f_op && file->f_op->fasync) file->f_op->fasync(-1, file, 0); diff --git a/include/linux/fadvise.h b/include/linux/fadvise.h index e8e7471..443951c 100644 --- a/include/linux/fadvise.h +++ b/include/linux/fadvise.h @@ -18,4 +18,9 @@ #define POSIX_FADV_NOREUSE 5 /* Data will be accessed once. */ #endif +#define POSIX_FADV_VOLATILE 8 /* _can_ toss, but don't toss now */ +#define POSIX_FADV_NONVOLATILE 9 /* Remove VOLATILE flag */ + + + #endif /* FADVISE_H_INCLUDED */ diff --git a/include/linux/volatile.h b/include/linux/volatile.h new file mode 100644 index 0000000..85a9249 --- /dev/null +++ b/include/linux/volatile.h @@ -0,0 +1,12 @@ +#ifndef _LINUX_VOLATILE_H +#define _LINUX_VOLATILE_H + +#include <linux/fs.h> + +extern long mapping_range_volatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index); +extern long mapping_range_nonvolatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index); +extern void mapping_clear_volatile_ranges(struct address_space *mapping); + +#endif /* _LINUX_VOLATILE_H */ diff --git a/mm/Makefile b/mm/Makefile index 50ec00e..7b6c7a8 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -13,7 +13,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ readahead.o swap.o truncate.o vmscan.o shmem.o \ prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \ page_isolation.o mm_init.o mmu_context.o percpu.o \ - $(mmu-y) + volatile.o $(mmu-y) obj-y += init-mm.o ifdef CONFIG_NO_BOOTMEM diff --git a/mm/fadvise.c b/mm/fadvise.c index 469491e0..3e33845 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -17,6 +17,7 @@ #include <linux/fadvise.h> #include <linux/writeback.h> #include <linux/syscalls.h> +#include <linux/volatile.h> #include <asm/unistd.h> @@ -106,7 +107,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) nrpages = end_index - start_index + 1; if (!nrpages) nrpages = ~0UL; - + ret = force_page_cache_readahead(mapping, file, start_index, nrpages); @@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) invalidate_mapping_pages(mapping, start_index, end_index); break; + case POSIX_FADV_VOLATILE: + /* First and last PARTIAL page! */ + start_index = offset >> PAGE_CACHE_SHIFT; + end_index = endbyte >> PAGE_CACHE_SHIFT; + ret = mapping_range_volatile(mapping, start_index, end_index); + break; + case POSIX_FADV_NONVOLATILE: + /* First and last PARTIAL page! */ + start_index = offset >> PAGE_CACHE_SHIFT; + end_index = endbyte >> PAGE_CACHE_SHIFT; + ret = mapping_range_nonvolatile(mapping, start_index, + end_index); + break; default: ret = -EINVAL; } diff --git a/mm/volatile.c b/mm/volatile.c new file mode 100644 index 0000000..e94e980 --- /dev/null +++ b/mm/volatile.c @@ -0,0 +1,457 @@ +/* mm/volatile.c + * + * Volatile page range managment. + * Copyright 2011 Linaro + * + * Based on mm/ashmem.c + * by Robert Love <rlove@google.com> + * Copyright (C) 2008 Google, Inc. + * + * + * This software is licensed under the terms of the GNU General Public + * License version 2, as published by the Free Software Foundation, and + * may be copied, distributed, and modified under those terms. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * + * The goal behind volatile ranges is to allow applications to interact + * with the kernel's cache management infrastructure. In particular an + * application can say "this memory contains data that might be useful in + * the future, but can be reconstructed if necessary, so if the kernel + * needs, it can zap and reclaim this memory without having to swap it out. + * + * The proposed mechanism - at a high level - is for user-space to be able + * to say "This memory is volatile" and then later "this memory is no longer + * volatile". If the content of the memory is still available the second + * request succeeds. If not, the memory is marked non-volatile and an + * error is returned to denote that the contents have been lost. + * + * Credits to Neil Brown for the above description. + * + */ + +#include <linux/kernel.h> +#include <linux/fs.h> +#include <linux/mm.h> +#include <linux/slab.h> +#include <linux/pagemap.h> +#include <linux/volatile.h> +#include <linux/rangetree.h> +#include <linux/hash.h> +#include <linux/shmem_fs.h> + +static DEFINE_MUTEX(volatile_mutex); + +struct volatile_range { + struct list_head lru; + struct range_tree_node range_node; + unsigned int purged; + struct address_space *mapping; +}; + +/* LRU list of volatile page ranges */ +static LIST_HEAD(volatile_lru_list); + +/* Count of pages on our LRU list */ +static u64 lru_count; + + +/* + * To avoid bloating the address_space structure, we use + * a hash structure to map from address_space mappings to + * the range_tree root that stores volatile ranges + */ +static struct hlist_head *mapping_hash; +static long mapping_hash_shift = 8; + +struct mapping_hash_entry { + struct range_tree_root root; + struct address_space *mapping; + struct hlist_node hnode; +}; + +static inline +struct range_tree_root *mapping_to_root(struct address_space *mapping) +{ + struct hlist_node *elem; + struct mapping_hash_entry *entry; + + hlist_for_each_entry_rcu(entry, elem, + &mapping_hash[hash_ptr(mapping, mapping_hash_shift)], + hnode) + if (entry->mapping == mapping) + return &entry->root; + + return NULL; +} + +static inline +struct range_tree_root *mapping_allocate_root(struct address_space *mapping) +{ + struct mapping_hash_entry *entry; + struct range_tree_root *dblchk; + + /* Drop the volatile_mutex to avoid lockdep deadlock warnings */ + mutex_unlock(&volatile_mutex); + entry = kzalloc(sizeof(*entry), GFP_KERNEL); + mutex_lock(&volatile_mutex); + + /* Since we dropped the lock, double check that no one has + * created the same hash entry. + */ + dblchk = mapping_to_root(mapping); + if (dblchk) { + kfree(entry); + return dblchk; + } + + INIT_HLIST_NODE(&entry->hnode); + entry->mapping = mapping; + range_tree_init(&entry->root); + + hlist_add_head_rcu(&entry->hnode, + &mapping_hash[hash_ptr(mapping, mapping_hash_shift)]); + + return &entry->root; +} + +static inline void mapping_free_root(struct range_tree_root *root) +{ + struct mapping_hash_entry *entry; + + entry = container_of(root, struct mapping_hash_entry, root); + + hlist_del_rcu(&entry->hnode); + kfree(entry); +} + + +/* Range tree helpers */ +static inline u64 range_size(struct volatile_range *range) +{ + return range->range_node.end - range->range_node.start + 1; +} + +static inline void lru_add(struct volatile_range *range) +{ + list_add_tail(&range->lru, &volatile_lru_list); + lru_count += range_size(range); +} + +static inline void lru_del(struct volatile_range *range) +{ + list_del(&range->lru); + lru_count -= range_size(range); +} + +#define range_on_lru(range) (!(range)->purged) + + +static inline void volatile_range_resize(struct volatile_range *range, + pgoff_t start_index, pgoff_t end_index) +{ + size_t pre = range_size(range); + + range->range_node.start = start_index; + range->range_node.end = end_index; + + if (range_on_lru(range)) + lru_count -= pre - range_size(range); +} + +static struct volatile_range *vrange_alloc(void) +{ + struct volatile_range *new; + + new = kzalloc(sizeof(struct volatile_range), GFP_KERNEL); + if (!new) + return 0; + range_tree_node_init(&new->range_node); + return new; +} + +static void vrange_del(struct range_tree_root *root, + struct volatile_range *vrange) +{ + if (range_on_lru(vrange)) + lru_del(vrange); + range_tree_remove(root, &vrange->range_node); + kfree(vrange); +} + + + +/* + * Mark a region as volatile, allowing dirty pages to be purged + * under memory pressure + */ +long mapping_range_volatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index) +{ + struct volatile_range *new; + struct range_tree_node *node; + struct volatile_range *vrange; + struct range_tree_root *root; + u64 start, end; + int purged = 0; + start = (u64)start_index; + end = (u64)end_index; + + if (strncmp(mapping->host->i_sb->s_type->name, "tmpfs", + strlen("tmpfs"))) + return -EINVAL; + + new = vrange_alloc(); + if (!new) + return -ENOMEM; + + mutex_lock(&volatile_mutex); + + + root = mapping_to_root(mapping); + if (!root) + root = mapping_allocate_root(mapping); + + /* Find any existing ranges that overlap */ + node = range_tree_in_range(root, start, end); + while (node) { + /* Already entirely marked volatile, so we're done */ + if (node->start < start && node->end > end) { + /* don't need the allocated value */ + kfree(new); + goto out; + } + + /* Grab containing volatile range */ + vrange = container_of(node, struct volatile_range, range_node); + + /* resize range */ + start = min_t(u64, start, node->start); + end = max_t(u64, end, node->end); + purged |= vrange->purged; + + node = range_tree_next_in_range(&vrange->range_node, + start, end); + vrange_del(root, vrange); + } + + /* Coalesce left-adjacent ranges */ + node = range_tree_in_range(root, start-1, start); + if (node) { + vrange = container_of(node, struct volatile_range, range_node); + /* Only coalesce if both are either purged or unpurged */ + if (vrange->purged == purged) { + /* resize range */ + start = min_t(u64, start, node->start); + end = max_t(u64, end, node->end); + vrange_del(root, vrange); + } + } + + /* Coalesce right-adjacent ranges */ + node = range_tree_in_range(root, end, end+1); + if (node) { + vrange = container_of(node, struct volatile_range, range_node); + /* Only coalesce if both are either purged or unpurged */ + if (vrange->purged == purged) { + /* resize range */ + start = min_t(u64, start, node->start); + end = max_t(u64, end, node->end); + vrange_del(root, vrange); + } + } + + new->mapping = mapping; + new->range_node.start = start; + new->range_node.end = end; + new->purged = purged; + + if (purged) { + struct inode *inode; + loff_t pstart, pend; + + inode = mapping->host; + pstart = start << PAGE_CACHE_SHIFT; + pend = ((end + 1) << PAGE_CACHE_SHIFT) - 1; + vmtruncate_range(inode, pstart, pend); + } + range_tree_add(root, &new->range_node); + if (range_on_lru(new)) + lru_add(new); + +out: + mutex_unlock(&volatile_mutex); + + return 0; +} + +/* + * Mark a region as nonvolatile, returns 1 if any pages in the region + * were purged. + */ +long mapping_range_nonvolatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index) +{ + struct volatile_range *new; + struct range_tree_node *node; + struct range_tree_root *root; + int ret = 0; + u64 start, end; + int used_new = 0; + + start = (u64)start_index; + end = (u64)end_index; + + if (strncmp(mapping->host->i_sb->s_type->name, "tmpfs", + strlen("tmpfs"))) + return -EINVAL; + + /* create new node */ + new = vrange_alloc(); + if (!new) + return -ENOMEM; + + mutex_lock(&volatile_mutex); + root = mapping_to_root(mapping); + if (!root) + goto out; /* if no range tree root, there's nothing to unmark */ + + node = range_tree_in_range(root, start, end); + while (node) { + struct volatile_range *vrange; + vrange = container_of(node, struct volatile_range, range_node); + + ret |= vrange->purged; + + if (start <= node->start && end >= node->end) { + /* delete: volatile range is totally within range */ + node = range_tree_next_in_range(&vrange->range_node, + start, end); + vrange_del(root, vrange); + } else if (node->start >= start) { + /* resize: volatile range right-overlaps range */ + volatile_range_resize(vrange, end+1, node->end); + node = range_tree_next_in_range(&vrange->range_node, + start, end); + + } else if (node->end <= end) { + /* resize: volatile range left-overlaps range */ + volatile_range_resize(vrange, node->start, start-1); + node = range_tree_next_in_range(&vrange->range_node, + start, end); + } else { + /* split: range is totally within a volatile range */ + used_new = 1; /* we only do this once */ + new->mapping = mapping; + new->range_node.start = end + 1; + new->range_node.end = node->end; + new->purged = vrange->purged; + range_tree_add(root, &new->range_node); + if (range_on_lru(new)) + lru_add(new); + volatile_range_resize(vrange, node->start, start-1); + + break; + } + } + +out: + mutex_unlock(&volatile_mutex); + + if (!used_new) + kfree(new); + + return ret; +} + + +/* + * Cleans up any volatile ranges. + */ +void mapping_clear_volatile_ranges(struct address_space *mapping) +{ + struct volatile_range *tozap; + struct range_tree_root *root; + + mutex_lock(&volatile_mutex); + + root = mapping_to_root(mapping); + if (!root) + goto out; + + while (!range_tree_empty(root)) { + struct range_tree_node *tmp; + tmp = range_tree_root_node(root); + tozap = container_of(tmp, struct volatile_range, range_node); + vrange_del(root, tozap); + } + mapping_free_root(root); +out: + mutex_unlock(&volatile_mutex); +} + +/* + * Purges volatile ranges when under memory pressure + */ +static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc) +{ + struct volatile_range *range, *next; + s64 nr_to_scan = sc->nr_to_scan; + const gfp_t gfp_mask = sc->gfp_mask; + + if (nr_to_scan && !(gfp_mask & __GFP_FS)) + return -1; + if (!nr_to_scan) + return lru_count; + + mutex_lock(&volatile_mutex); + list_for_each_entry_safe(range, next, &volatile_lru_list, lru) { + struct inode *inode; + loff_t start, end; + + inode = range->mapping->host; + + start = range->range_node.start << PAGE_CACHE_SHIFT; + end = ((range->range_node.end + 1) << PAGE_CACHE_SHIFT) - 1; + + shmem_truncate_range(inode, start, end); + + lru_del(range); + range->purged = 1; + nr_to_scan -= range_size(range); + + if (nr_to_scan <= 0) + break; + } + mutex_unlock(&volatile_mutex); + + return lru_count; +} + +static struct shrinker volatile_shrinker = { + .shrink = volatile_shrink, + .seeks = DEFAULT_SEEKS, +}; + +static int __init volatile_init(void) +{ + int i, size; + + size = 1U << mapping_hash_shift; + + mapping_hash = kzalloc(sizeof(mapping_hash)*size, GFP_KERNEL); + + for (i = 0; i < size; i++) + INIT_HLIST_HEAD(&mapping_hash[i]); + + register_shrinker(&volatile_shrinker); + + + return 0; +} + +arch_initcall(volatile_init); -- 1.7.3.2.146.gca209 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 0/2] [RFC] Volatile Ranges (v6) @ 2012-04-07 0:08 John Stultz 2012-04-07 0:08 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz 0 siblings, 1 reply; 31+ messages in thread From: John Stultz @ 2012-04-07 0:08 UTC (permalink / raw) To: linux-kernel Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V Just wanted to send out another iteration of the volatile range code for review and comment. This revision handles some improved coalescing logic, fixes some bugs in the range tree management, and also utilizes a hash tabe to avoid bloating the address_space structure with range_tree pointers. Still looking for guidence on what a better interface for this might be as well as thoughts on how to best drop the ranges under memory pressure (vmtruncate_region is pretty close to what I want, but lockdep makes it clear that its not really safe to call from a shrinker). Another detail is that by hanging the volatile ranges off of the address_space, the volatility for tmpfs files persists even when no one has an open fd on the file. This could cause some surprises if application A marked some pages volatile and died, then application B opened the file and had pages dropped out underneith it while it was being used. I suspect I need to clean up the volatility when all fds are dropped. But any extra insight would be useful here. Thanks for the continued advice and feedback! -john CC: Andrew Morton <akpm@linux-foundation.org> CC: Android Kernel Team <kernel-team@android.com> CC: Robert Love <rlove@google.com> CC: Mel Gorman <mel@csn.ul.ie> CC: Hugh Dickins <hughd@google.com> CC: Dave Hansen <dave@linux.vnet.ibm.com> CC: Rik van Riel <riel@redhat.com> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com> CC: Dave Chinner <david@fromorbit.com> CC: Neil Brown <neilb@suse.de> CC: Andrea Righi <andrea@betterlinux.com> CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> John Stultz (2): [RFC] Range tree implementation [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags fs/inode.c | 2 + include/linux/fadvise.h | 5 + include/linux/rangetree.h | 56 ++++++ include/linux/volatile.h | 14 ++ lib/Makefile | 2 +- lib/rangetree.c | 128 +++++++++++++ mm/Makefile | 2 +- mm/fadvise.c | 16 ++- mm/volatile.c | 440 +++++++++++++++++++++++++++++++++++++++++++++ 9 files changed, 662 insertions(+), 3 deletions(-) create mode 100644 include/linux/rangetree.h create mode 100644 include/linux/volatile.h create mode 100644 lib/rangetree.c create mode 100644 mm/volatile.c -- 1.7.3.2.146.gca209 ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-04-07 0:08 [PATCH 0/2] [RFC] Volatile Ranges (v6) John Stultz @ 2012-04-07 0:08 ` John Stultz 0 siblings, 0 replies; 31+ messages in thread From: John Stultz @ 2012-04-07 0:08 UTC (permalink / raw) To: linux-kernel Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V This patch provides new fadvise flags that can be used to mark file pages as volatile, which will allow it to be discarded if the kernel wants to reclaim memory. This is useful for userspace to allocate things like caches, and lets the kernel destructively (but safely) reclaim them when there's memory pressure. It's different from FADV_DONTNEED since the pages are not immediately discarded; they are only discarded under pressure. This is very much influenced by the Android Ashmem interface by Robert Love so credits to him and the Android developers. In many cases the code & logic come directly from the ashmem patch. The intent of this patch is to allow for ashmem-like behavior, but embeds the idea a little deeper into the VM code, instead of isolating it into a specific driver. I'm very much a newbie at the VM code, so At this point, I just want to try to get some input on the patch, so if you have another idea for using something other then fadvise, or other thoughts on how the volatile ranges are stored, I'd be really interested in hearing them. So let me know if you have any comments for feedback! Also many thanks to Dave Hansen who helped design and develop the initial version of this patch, and has provided continued review and mentoring for me in the VM code. v2: * After the valid critique that just dropping pages would poke holes in volatile ranges, and instead we should zap an entire range if we drop any of it, I changed the code to more closely mimic the ashmem implementation, which zaps entire ranges via a shrinker using an lru list that tracks which range has been marked volatile the longest. v3: * Reworked to use range tree implementation. v4: * Renamed functions to avoid confusion. * More consistant PAGE_CACHE_SHIFT usage, suggested by Dmitry Adamushko * Fixes exit without unlocking issue found by Dmitry Adamushko * Migrate to rbtree based rangetree implementation * Simplified locking to use global lock (we were grabbing global lru lock every time anyway). * Avoid ENOMEM isses by allocating before we get into complicated code. * Add some documentation to the volatile.c file from Neil Brown v5: * More fixes suggested by Dmitry Adamushko * Improve range colescing so that we don't coalesce neighboring purged ranges. * Utilize range_tree_next_in_range to avoid doing every lookup from the tree's root. v6: * Immediately zap range if we coalesce overlapping purged range. * Use hash table to do mapping->rangetree lookup instead of bloating the address_space structure Known issues: * Lockdep doesn't like calling vmtruncate_range() from a shrinker. Any help here on how to address this would be appreciated. I've tried switching to invalidate_inode_pages2_range, but that always returns EBUSY in my testing, and I don't really want to launder dirty pages, instead I want to zap them. * Volatile range persistence needs to be though through. Currently the volatility follows the inode in memory, which for tmpfs sticks around. This means application A could open a file, mark it volatile, close it. Then application B opens the file, and as its using it finds the pages earlier marked volatile disappearing under it. I think it probably makes more sense to drop all volatile ranges after all the fds to the file have been closed. This may also be something that changes if we switch up to a different interface. Suggestions here would be great. CC: Andrew Morton <akpm@linux-foundation.org> CC: Android Kernel Team <kernel-team@android.com> CC: Robert Love <rlove@google.com> CC: Mel Gorman <mel@csn.ul.ie> CC: Hugh Dickins <hughd@google.com> CC: Dave Hansen <dave@linux.vnet.ibm.com> CC: Rik van Riel <riel@redhat.com> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com> CC: Dave Chinner <david@fromorbit.com> CC: Neil Brown <neilb@suse.de> CC: Andrea Righi <andrea@betterlinux.com> CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: John Stultz <john.stultz@linaro.org> --- fs/inode.c | 2 + include/linux/fadvise.h | 5 + include/linux/volatile.h | 14 ++ mm/Makefile | 2 +- mm/fadvise.c | 16 ++- mm/volatile.c | 440 ++++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 477 insertions(+), 2 deletions(-) create mode 100644 include/linux/volatile.h create mode 100644 mm/volatile.c diff --git a/fs/inode.c b/fs/inode.c index 9f4f5fe..590d314 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -17,6 +17,7 @@ #include <linux/prefetch.h> #include <linux/buffer_head.h> /* for inode_has_buffers */ #include <linux/ratelimit.h> +#include <linux/volatile.h> #include "internal.h" /* @@ -244,6 +245,7 @@ void __destroy_inode(struct inode *inode) if (inode->i_default_acl && inode->i_default_acl != ACL_NOT_CACHED) posix_acl_release(inode->i_default_acl); #endif + mapping_clear_volatile_ranges(&inode->i_data); this_cpu_dec(nr_inodes); } EXPORT_SYMBOL(__destroy_inode); diff --git a/include/linux/fadvise.h b/include/linux/fadvise.h index e8e7471..443951c 100644 --- a/include/linux/fadvise.h +++ b/include/linux/fadvise.h @@ -18,4 +18,9 @@ #define POSIX_FADV_NOREUSE 5 /* Data will be accessed once. */ #endif +#define POSIX_FADV_VOLATILE 8 /* _can_ toss, but don't toss now */ +#define POSIX_FADV_NONVOLATILE 9 /* Remove VOLATILE flag */ + + + #endif /* FADVISE_H_INCLUDED */ diff --git a/include/linux/volatile.h b/include/linux/volatile.h new file mode 100644 index 0000000..5460d7b --- /dev/null +++ b/include/linux/volatile.h @@ -0,0 +1,14 @@ +#ifndef _LINUX_VOLATILE_H +#define _LINUX_VOLATILE_H + +#include <linux/fs.h> + +extern long mapping_range_volatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index); +extern long mapping_range_nonvolatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index); +extern long mapping_range_isvolatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index); +extern void mapping_clear_volatile_ranges(struct address_space *mapping); + +#endif /* _LINUX_VOLATILE_H */ diff --git a/mm/Makefile b/mm/Makefile index 50ec00e..7b6c7a8 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -13,7 +13,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ readahead.o swap.o truncate.o vmscan.o shmem.o \ prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \ page_isolation.o mm_init.o mmu_context.o percpu.o \ - $(mmu-y) + volatile.o $(mmu-y) obj-y += init-mm.o ifdef CONFIG_NO_BOOTMEM diff --git a/mm/fadvise.c b/mm/fadvise.c index 469491e0..3e33845 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -17,6 +17,7 @@ #include <linux/fadvise.h> #include <linux/writeback.h> #include <linux/syscalls.h> +#include <linux/volatile.h> #include <asm/unistd.h> @@ -106,7 +107,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) nrpages = end_index - start_index + 1; if (!nrpages) nrpages = ~0UL; - + ret = force_page_cache_readahead(mapping, file, start_index, nrpages); @@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) invalidate_mapping_pages(mapping, start_index, end_index); break; + case POSIX_FADV_VOLATILE: + /* First and last PARTIAL page! */ + start_index = offset >> PAGE_CACHE_SHIFT; + end_index = endbyte >> PAGE_CACHE_SHIFT; + ret = mapping_range_volatile(mapping, start_index, end_index); + break; + case POSIX_FADV_NONVOLATILE: + /* First and last PARTIAL page! */ + start_index = offset >> PAGE_CACHE_SHIFT; + end_index = endbyte >> PAGE_CACHE_SHIFT; + ret = mapping_range_nonvolatile(mapping, start_index, + end_index); + break; default: ret = -EINVAL; } diff --git a/mm/volatile.c b/mm/volatile.c new file mode 100644 index 0000000..12450be --- /dev/null +++ b/mm/volatile.c @@ -0,0 +1,440 @@ +/* mm/volatile.c + * + * Volatile page range managment. + * Copyright 2011 Linaro + * + * Based on mm/ashmem.c + * by Robert Love <rlove@google.com> + * Copyright (C) 2008 Google, Inc. + * + * + * This software is licensed under the terms of the GNU General Public + * License version 2, as published by the Free Software Foundation, and + * may be copied, distributed, and modified under those terms. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * + * The goal behind volatile ranges is to allow applications to interact + * with the kernel's cache management infrastructure. In particular an + * application can say "this memory contains data that might be useful in + * the future, but can be reconstructed if necessary, so if the kernel + * needs, it can zap and reclaim this memory without having to swap it out. + * + * The proposed mechanism - at a high level - is for user-space to be able + * to say "This memory is volatile" and then later "this memory is no longer + * volatile". If the content of the memory is still available the second + * request succeeds. If not, the memory is marked non-volatile and an + * error is returned to denote that the contents have been lost. + * + * Credits to Neil Brown for the above description. + * + */ + +#include <linux/kernel.h> +#include <linux/fs.h> +#include <linux/mm.h> +#include <linux/slab.h> +#include <linux/pagemap.h> +#include <linux/volatile.h> +#include <linux/rangetree.h> +#include <linux/hash.h> + +static DEFINE_MUTEX(volatile_mutex); + +struct volatile_range { + struct list_head lru; + struct range_tree_node range_node; + unsigned int purged; + struct address_space *mapping; +}; + +/* LRU list of volatile page ranges */ +static LIST_HEAD(volatile_lru_list); + +/* Count of pages on our LRU list */ +static u64 lru_count; + + +/* + * To avoid bloating the address_space structure, we use + * a hash structure to map from address_space mappings to + * the range_tree root that stores volatile ranges + */ +static struct hlist_head *mapping_hash; +static long mapping_hash_shift = 8; + +struct mapping_hash_entry { + struct range_tree_root root; + struct address_space *mapping; + struct hlist_node hnode; +}; + +static inline +struct range_tree_root *mapping_allocate_root(struct address_space *mapping) +{ + struct mapping_hash_entry *entry; + + /* Drop the volatile_mutex to avoid lockdep deadlock warnings */ + mutex_unlock(&volatile_mutex); + entry = kzalloc(sizeof(*entry), GFP_KERNEL); + mutex_lock(&volatile_mutex); + + INIT_HLIST_NODE(&entry->hnode); + entry->mapping = mapping; + range_tree_init(&entry->root); + + hlist_add_head_rcu(&entry->hnode, + &mapping_hash[hash_ptr(mapping, mapping_hash_shift)]); + + return &entry->root; +} + +static inline +struct range_tree_root *mapping_to_root(struct address_space *mapping) +{ + struct hlist_node *elem; + struct mapping_hash_entry *entry; + + hlist_for_each_entry_rcu(entry, elem, + &mapping_hash[hash_ptr(mapping, mapping_hash_shift)], + hnode) + if (entry->mapping == mapping) + return &entry->root; + + return NULL; +} + +static inline void mapping_free_root(struct range_tree_root *root) +{ + struct mapping_hash_entry *entry; + + entry = container_of(root, struct mapping_hash_entry, root); + + hlist_del_rcu(&entry->hnode); + kfree(entry); +} + + +/* Range tree helpers */ +static inline u64 range_size(struct volatile_range *range) +{ + return range->range_node.end - range->range_node.start + 1; +} + +static inline void lru_add(struct volatile_range *range) +{ + list_add_tail(&range->lru, &volatile_lru_list); + lru_count += range_size(range); +} + +static inline void lru_del(struct volatile_range *range) +{ + list_del(&range->lru); + lru_count -= range_size(range); +} + +#define range_on_lru(range) (!(range)->purged) + + +static inline void volatile_range_resize(struct volatile_range *range, + pgoff_t start_index, pgoff_t end_index) +{ + size_t pre = range_size(range); + + range->range_node.start = start_index; + range->range_node.end = end_index; + + if (range_on_lru(range)) + lru_count -= pre - range_size(range); +} + +static struct volatile_range *vrange_alloc(void) +{ + struct volatile_range *new; + + new = kzalloc(sizeof(struct volatile_range), GFP_KERNEL); + if (!new) + return 0; + range_tree_node_init(&new->range_node); + return new; +} + +static void vrange_del(struct range_tree_root *root, + struct volatile_range *vrange) +{ + if (range_on_lru(vrange)) + lru_del(vrange); + range_tree_remove(root, &vrange->range_node); + kfree(vrange); +} + + + +/* + * Mark a region as volatile, allowing dirty pages to be purged + * under memory pressure + */ +long mapping_range_volatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index) +{ + struct volatile_range *new; + struct range_tree_node *node; + struct volatile_range *vrange; + struct range_tree_root *root; + u64 start, end; + int purged = 0; + start = (u64)start_index; + end = (u64)end_index; + + new = vrange_alloc(); + if (!new) + return -ENOMEM; + + mutex_lock(&volatile_mutex); + + + root = mapping_to_root(mapping); + if (!root) + root = mapping_allocate_root(mapping); + + /* Find any existing ranges that overlap */ + node = range_tree_in_range(root, start, end); + while (node) { + /* Already entirely marked volatile, so we're done */ + if (node->start < start && node->end > end) { + /* don't need the allocated value */ + kfree(new); + goto out; + } + + /* Grab containing volatile range */ + vrange = container_of(node, struct volatile_range, range_node); + + /* resize range */ + start = min_t(u64, start, node->start); + end = max_t(u64, end, node->end); + purged |= vrange->purged; + + node = range_tree_next_in_range(&vrange->range_node, + start, end); + vrange_del(root, vrange); + } + + /* Coalesce left-adjacent ranges */ + node = range_tree_in_range(root, start-1, start); + if (node) { + vrange = container_of(node, struct volatile_range, range_node); + /* Only coalesce if both are either purged or unpurged */ + if (vrange->purged == purged) { + /* resize range */ + start = min_t(u64, start, node->start); + end = max_t(u64, end, node->end); + vrange_del(root, vrange); + } + } + + /* Coalesce right-adjacent ranges */ + node = range_tree_in_range(root, end, end+1); + if (node) { + vrange = container_of(node, struct volatile_range, range_node); + /* Only coalesce if both are either purged or unpurged */ + if (vrange->purged == purged) { + /* resize range */ + start = min_t(u64, start, node->start); + end = max_t(u64, end, node->end); + vrange_del(root, vrange); + } + } + + new->mapping = mapping; + new->range_node.start = start; + new->range_node.end = end; + new->purged = purged; + + if (purged) { + struct inode *inode; + loff_t pstart, pend; + + inode = mapping->host; + pstart = start << PAGE_CACHE_SHIFT; + pend = ((end + 1) << PAGE_CACHE_SHIFT) - 1; + vmtruncate_range(inode, pstart, pend); + } + range_tree_add(root, &new->range_node); + if (range_on_lru(new)) + lru_add(new); + +out: + mutex_unlock(&volatile_mutex); + + return 0; +} + +/* + * Mark a region as nonvolatile, returns 1 if any pages in the region + * were purged. + */ +long mapping_range_nonvolatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index) +{ + struct volatile_range *new; + struct range_tree_node *node; + struct range_tree_root *root; + int ret = 0; + u64 start, end; + int used_new = 0; + + start = (u64)start_index; + end = (u64)end_index; + + /* create new node */ + new = vrange_alloc(); + if (!new) + return -ENOMEM; + + mutex_lock(&volatile_mutex); + root = mapping_to_root(mapping); + if (!root) + root = mapping_allocate_root(mapping); + + node = range_tree_in_range(root, start, end); + while (node) { + struct volatile_range *vrange; + vrange = container_of(node, struct volatile_range, range_node); + + ret |= vrange->purged; + + if (start <= node->start && end >= node->end) { + /* delete: volatile range is totally within range */ + node = range_tree_next_in_range(&vrange->range_node, + start, end); + vrange_del(root, vrange); + } else if (node->start >= start) { + /* resize: volatile range right-overlaps range */ + volatile_range_resize(vrange, end+1, node->end); + node = range_tree_next_in_range(&vrange->range_node, + start, end); + + } else if (node->end <= end) { + /* resize: volatile range left-overlaps range */ + volatile_range_resize(vrange, node->start, start-1); + node = range_tree_next_in_range(&vrange->range_node, + start, end); + } else { + /* split: range is totally within a volatile range */ + used_new = 1; /* we only do this once */ + new->mapping = mapping; + new->range_node.start = end + 1; + new->range_node.end = node->end; + new->purged = vrange->purged; + range_tree_add(root, &new->range_node); + if (range_on_lru(new)) + lru_add(new); + volatile_range_resize(vrange, node->start, start-1); + + break; + } + } + mutex_unlock(&volatile_mutex); + + if (!used_new) + kfree(new); + + return ret; +} + + +/* + * Cleans up any volatile ranges. + */ +void mapping_clear_volatile_ranges(struct address_space *mapping) +{ + struct volatile_range *tozap; + struct range_tree_root *root; + + mutex_lock(&volatile_mutex); + + root = mapping_to_root(mapping); + if (!root) + goto out; + + while (!range_tree_empty(root)) { + struct range_tree_node *tmp; + tmp = range_tree_root_node(root); + tozap = container_of(tmp, struct volatile_range, range_node); + vrange_del(root, tozap); + } + mapping_free_root(root); +out: + mutex_unlock(&volatile_mutex); +} + +/* + * Purges volatile ranges when under memory pressure + */ +static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc) +{ + struct volatile_range *range, *next; + s64 nr_to_scan = sc->nr_to_scan; + const gfp_t gfp_mask = sc->gfp_mask; + + if (nr_to_scan && !(gfp_mask & __GFP_FS)) + return -1; + if (!nr_to_scan) + return lru_count; + + mutex_lock(&volatile_mutex); + list_for_each_entry_safe(range, next, &volatile_lru_list, lru) { + struct inode *inode; + loff_t start, end; + + inode = range->mapping->host; + + start = range->range_node.start << PAGE_CACHE_SHIFT; + end = ((range->range_node.end + 1) << PAGE_CACHE_SHIFT) - 1; + + /* + * XXX - calling vmtruncate_range from a shrinker causes + * lockdep warnings. Revisit this! + */ + if (!vmtruncate_range(inode, start, end)) { + lru_del(range); + range->purged = 1; + nr_to_scan -= range_size(range); + } + + if (nr_to_scan <= 0) + break; + } + mutex_unlock(&volatile_mutex); + + return lru_count; +} + +static struct shrinker volatile_shrinker = { + .shrink = volatile_shrink, + .seeks = DEFAULT_SEEKS, +}; + +static int __init volatile_init(void) +{ + int i, size; + + size = 1U << mapping_hash_shift; + + mapping_hash = kzalloc(sizeof(mapping_hash)*size, GFP_KERNEL); + + for (i = 0; i < size; i++) + INIT_HLIST_HEAD(&mapping_hash[i]); + + register_shrinker(&volatile_shrinker); + + + return 0; +} + +arch_initcall(volatile_init); -- 1.7.3.2.146.gca209 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 0/2] [RFC] fadivse volatile & range tree (v5) @ 2012-03-21 4:15 John Stultz 2012-03-21 4:15 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz 0 siblings, 1 reply; 31+ messages in thread From: John Stultz @ 2012-03-21 4:15 UTC (permalink / raw) To: linux-kernel Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V Just another quick iteration here trying to address some of the comments made on my last patchset. I've merged most of the easy fixes and feedback in. Also I've improved the volatile range coalescing/removing code to avoid re-starting lookups every time from the root node. We also avoid coalescing with neighboring ranges that have been purged. Although since we coalesce overlapping purged ranges, there are still some semantics to work out there (ie: do we immediately zap the newly volatile range if we coalesce w/ a purged range? or break it into smaller un-purged sub-ranges?). I haven't yet been able to really digest the prio_tree code, that Dmitry suggested, but it seem like it might be applicable here. One issue there is that the start,last tuples are longs, so changes there might be necessary to handle 64bit file ranges. We'll see. Also, I still want to try implementing DaveC's suggested radix tree tag method. It would be significantly different from this, so I wanted to get all the feedback for this "branch" of investigation merged and published, so I can come back to it if the radix tree tag idea doesn't pan out. Thanks again for all the great feedback! -john CC: Andrew Morton <akpm@linux-foundation.org> CC: Android Kernel Team <kernel-team@android.com> CC: Robert Love <rlove@google.com> CC: Mel Gorman <mel@csn.ul.ie> CC: Hugh Dickins <hughd@google.com> CC: Dave Hansen <dave@linux.vnet.ibm.com> CC: Rik van Riel <riel@redhat.com> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com> CC: Dave Chinner <david@fromorbit.com> CC: Neil Brown <neilb@suse.de> CC: Andrea Righi <andrea@betterlinux.com> CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> John Stultz (2): [RFC] Range tree implementation [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags fs/inode.c | 4 + include/linux/fadvise.h | 5 + include/linux/fs.h | 2 + include/linux/rangetree.h | 56 ++++++++ include/linux/volatile.h | 14 ++ lib/Makefile | 2 +- lib/rangetree.c | 124 ++++++++++++++++ mm/Makefile | 2 +- mm/fadvise.c | 16 ++- mm/volatile.c | 342 +++++++++++++++++++++++++++++++++++++++++++++ 10 files changed, 564 insertions(+), 3 deletions(-) create mode 100644 include/linux/rangetree.h create mode 100644 include/linux/volatile.h create mode 100644 lib/rangetree.c create mode 100644 mm/volatile.c -- 1.7.3.2.146.gca209 ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-03-21 4:15 [PATCH 0/2] [RFC] fadivse volatile & range tree (v5) John Stultz @ 2012-03-21 4:15 ` John Stultz 0 siblings, 0 replies; 31+ messages in thread From: John Stultz @ 2012-03-21 4:15 UTC (permalink / raw) To: linux-kernel Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V This patch provides new fadvise flags that can be used to mark file pages as volatile, which will allow it to be discarded if the kernel wants to reclaim memory. This is useful for userspace to allocate things like caches, and lets the kernel destructively (but safely) reclaim them when there's memory pressure. It's different from FADV_DONTNEED since the pages are not immediately discarded; they are only discarded under pressure. This is very much influenced by the Android Ashmem interface by Robert Love so credits to him and the Android developers. In many cases the code & logic come directly from the ashmem patch. The intent of this patch is to allow for ashmem-like behavior, but embeds the idea a little deeper into the VM code, instead of isolating it into a specific driver. I'm very much a newbie at the VM code, so At this point, I just want to try to get some input on the patch, so if you have another idea for using something other then fadvise, or other thoughts on how the volatile ranges are stored, I'd be really interested in hearing them. So let me know if you have any comments for feedback! Also many thanks to Dave Hansen who helped design and develop the initial version of this patch, and has provided continued review and mentoring for me in the VM code. v2: * After the valid critique that just dropping pages would poke holes in volatile ranges, and instead we should zap an entire range if we drop any of it, I changed the code to more closely mimic the ashmem implementation, which zaps entire ranges via a shrinker using an lru list that tracks which range has been marked volatile the longest. v3: * Reworked to use range tree implementation. v4: * Renamed functions to avoid confusion. * More consistant PAGE_CACHE_SHIFT usage, suggested by Dmitry Adamushko * Fixes exit without unlocking issue found by Dmitry Adamushko * Migrate to rbtree based rangetree implementation * Simplified locking to use global lock (we were grabbing global lru lock every time anyway). * Avoid ENOMEM isses by allocating before we get into complicated code. * Add some documentation to the volatile.c file from Neil Brown v5: * More fixes suggested by Dmitry Adamushko * Improve range colescing so that we don't coalesce neighboring purged ranges. * Utilize range_tree_next_in_range to avoid doing every lookup from the tree's root. Known issues: * We still coalesce with overlapping ranges, need to figure out best way to handle purged state. * Lockdep doesn't like calling vmtruncate_range() from a shrinker. Any help here on how to address this would be appreciated. I've tried switching to invalidate_inode_pages2_range, but that always returns EBUSY in my testing, and I don't really want to launder dirty pages, instead I want to zap them. * Concern over bloating the address_space struct CC: Andrew Morton <akpm@linux-foundation.org> CC: Android Kernel Team <kernel-team@android.com> CC: Robert Love <rlove@google.com> CC: Mel Gorman <mel@csn.ul.ie> CC: Hugh Dickins <hughd@google.com> CC: Dave Hansen <dave@linux.vnet.ibm.com> CC: Rik van Riel <riel@redhat.com> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com> CC: Dave Chinner <david@fromorbit.com> CC: Neil Brown <neilb@suse.de> CC: Andrea Righi <andrea@betterlinux.com> CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: John Stultz <john.stultz@linaro.org> fadvise volatile fixes from Dmitry Right after sending this I realized I had forgotten to include some fixes for issues Dmitry pointed out. So I've included them here. Signed-off-by: John Stultz <john.stultz@linaro.org> another fix from demitry Handle coalescing in a slightly smarter way use range_tree_nexT_in_range --- fs/inode.c | 4 + include/linux/fadvise.h | 5 + include/linux/fs.h | 2 + include/linux/volatile.h | 14 ++ mm/Makefile | 2 +- mm/fadvise.c | 16 ++- mm/volatile.c | 342 ++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 383 insertions(+), 2 deletions(-) create mode 100644 include/linux/volatile.h create mode 100644 mm/volatile.c diff --git a/fs/inode.c b/fs/inode.c index d3ebdbe..f602dc2 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -27,6 +27,7 @@ #include <linux/cred.h> #include <linux/buffer_head.h> /* for inode_has_buffers */ #include <linux/ratelimit.h> +#include <linux/volatile.h> #include "internal.h" /* @@ -254,6 +255,7 @@ void __destroy_inode(struct inode *inode) if (inode->i_default_acl && inode->i_default_acl != ACL_NOT_CACHED) posix_acl_release(inode->i_default_acl); #endif + mapping_clear_volatile_ranges(&inode->i_data); this_cpu_dec(nr_inodes); } EXPORT_SYMBOL(__destroy_inode); @@ -360,6 +362,8 @@ void address_space_init_once(struct address_space *mapping) spin_lock_init(&mapping->private_lock); INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap); INIT_LIST_HEAD(&mapping->i_mmap_nonlinear); + range_tree_init(&mapping->volatile_root); + } EXPORT_SYMBOL(address_space_init_once); diff --git a/include/linux/fadvise.h b/include/linux/fadvise.h index e8e7471..443951c 100644 --- a/include/linux/fadvise.h +++ b/include/linux/fadvise.h @@ -18,4 +18,9 @@ #define POSIX_FADV_NOREUSE 5 /* Data will be accessed once. */ #endif +#define POSIX_FADV_VOLATILE 8 /* _can_ toss, but don't toss now */ +#define POSIX_FADV_NONVOLATILE 9 /* Remove VOLATILE flag */ + + + #endif /* FADVISE_H_INCLUDED */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 69cd5bb..2e20be1 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -10,6 +10,7 @@ #include <linux/ioctl.h> #include <linux/blk_types.h> #include <linux/types.h> +#include <linux/rangetree.h> /* * It's silly to have NR_OPEN bigger than NR_FILE, but you can change @@ -655,6 +656,7 @@ struct address_space { spinlock_t private_lock; /* for use by the address_space */ struct list_head private_list; /* ditto */ struct address_space *assoc_mapping; /* ditto */ + struct range_tree_root volatile_root; /* volatile range list */ } __attribute__((aligned(sizeof(long)))); /* * On most architectures that alignment is already the case; but diff --git a/include/linux/volatile.h b/include/linux/volatile.h new file mode 100644 index 0000000..5460d7b --- /dev/null +++ b/include/linux/volatile.h @@ -0,0 +1,14 @@ +#ifndef _LINUX_VOLATILE_H +#define _LINUX_VOLATILE_H + +#include <linux/fs.h> + +extern long mapping_range_volatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index); +extern long mapping_range_nonvolatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index); +extern long mapping_range_isvolatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index); +extern void mapping_clear_volatile_ranges(struct address_space *mapping); + +#endif /* _LINUX_VOLATILE_H */ diff --git a/mm/Makefile b/mm/Makefile index 50ec00e..7b6c7a8 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -13,7 +13,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ readahead.o swap.o truncate.o vmscan.o shmem.o \ prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \ page_isolation.o mm_init.o mmu_context.o percpu.o \ - $(mmu-y) + volatile.o $(mmu-y) obj-y += init-mm.o ifdef CONFIG_NO_BOOTMEM diff --git a/mm/fadvise.c b/mm/fadvise.c index 469491e0..3e33845 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -17,6 +17,7 @@ #include <linux/fadvise.h> #include <linux/writeback.h> #include <linux/syscalls.h> +#include <linux/volatile.h> #include <asm/unistd.h> @@ -106,7 +107,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) nrpages = end_index - start_index + 1; if (!nrpages) nrpages = ~0UL; - + ret = force_page_cache_readahead(mapping, file, start_index, nrpages); @@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) invalidate_mapping_pages(mapping, start_index, end_index); break; + case POSIX_FADV_VOLATILE: + /* First and last PARTIAL page! */ + start_index = offset >> PAGE_CACHE_SHIFT; + end_index = endbyte >> PAGE_CACHE_SHIFT; + ret = mapping_range_volatile(mapping, start_index, end_index); + break; + case POSIX_FADV_NONVOLATILE: + /* First and last PARTIAL page! */ + start_index = offset >> PAGE_CACHE_SHIFT; + end_index = endbyte >> PAGE_CACHE_SHIFT; + ret = mapping_range_nonvolatile(mapping, start_index, + end_index); + break; default: ret = -EINVAL; } diff --git a/mm/volatile.c b/mm/volatile.c new file mode 100644 index 0000000..85d5ece --- /dev/null +++ b/mm/volatile.c @@ -0,0 +1,342 @@ +/* mm/volatile.c + * + * Volatile page range managment. + * Copyright 2011 Linaro + * + * Based on mm/ashmem.c + * by Robert Love <rlove@google.com> + * Copyright (C) 2008 Google, Inc. + * + * + * This software is licensed under the terms of the GNU General Public + * License version 2, as published by the Free Software Foundation, and + * may be copied, distributed, and modified under those terms. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * + * The goal behind volatile ranges is to allow applications to interact + * with the kernel's cache management infrastructure. In particular an + * application can say "this memory contains data that might be useful in + * the future, but can be reconstructed if necessary, so if the kernel + * needs, it can zap and reclaim this memory without having to swap it out. + * + * The proposed mechanism - at a high level - is for user-space to be able + * to say "This memory is volatile" and then later "this memory is no longer + * volatile". If the content of the memory is still available the second + * request succeeds. If not, the memory is marked non-volatile and an + * error is returned to denote that the contents have been lost. + * + * Credits to Neil Brown for the above description. + * + */ + +#include <linux/kernel.h> +#include <linux/fs.h> +#include <linux/mm.h> +#include <linux/slab.h> +#include <linux/pagemap.h> +#include <linux/volatile.h> + +static DEFINE_MUTEX(volatile_mutex); + +struct volatile_range { + struct list_head lru; + struct range_tree_node range_node; + + unsigned int purged; + struct address_space *mapping; +}; + +/* LRU list of volatile page ranges */ +static LIST_HEAD(volatile_lru_list); + +/* Count of pages on our LRU list */ +static u64 lru_count; + + +static inline u64 range_size(struct volatile_range *range) +{ + return range->range_node.end - range->range_node.start + 1; +} + +static inline void lru_add(struct volatile_range *range) +{ + list_add_tail(&range->lru, &volatile_lru_list); + lru_count += range_size(range); +} + +static inline void lru_del(struct volatile_range *range) +{ + list_del(&range->lru); + lru_count -= range_size(range); +} + +#define range_on_lru(range) (!(range)->purged) + + +static inline void volatile_range_resize(struct volatile_range *range, + pgoff_t start_index, pgoff_t end_index) +{ + size_t pre = range_size(range); + + range->range_node.start = start_index; + range->range_node.end = end_index; + + if (range_on_lru(range)) + lru_count -= pre - range_size(range); +} + +static struct volatile_range *vrange_alloc(void) +{ + struct volatile_range *new; + + new = kzalloc(sizeof(struct volatile_range), GFP_KERNEL); + if (!new) + return 0; + range_tree_node_init(&new->range_node); + return new; +} + +static void vrange_del(struct volatile_range *vrange) +{ + struct address_space *mapping; + mapping = vrange->mapping; + + if (range_on_lru(vrange)) + lru_del(vrange); + range_tree_remove(&mapping->volatile_root, &vrange->range_node); + kfree(vrange); +} + + + +/* + * Mark a region as volatile, allowing dirty pages to be purged + * under memory pressure + */ +long mapping_range_volatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index) +{ + struct volatile_range *new; + struct range_tree_node *node; + struct volatile_range *vrange; + + u64 start, end; + int purged = 0; + start = (u64)start_index; + end = (u64)end_index; + + new = vrange_alloc(); + if (!new) + return -ENOMEM; + + mutex_lock(&volatile_mutex); + + /* Find any existing ranges that overlap */ + node = range_tree_in_range(&mapping->volatile_root, start, end); + while (node) { + /* Already entirely marked volatile, so we're done */ + if (node->start < start && node->end > end) { + /* don't need the allocated value */ + kfree(new); + goto out; + } + + /* Grab containing volatile range */ + vrange = container_of(node, struct volatile_range, range_node); + + /* resize range */ + start = min_t(u64, start, node->start); + end = max_t(u64, end, node->end); + purged |= vrange->purged; + + node = range_tree_next_in_range(&vrange->range_node, + start, end); + vrange_del(vrange); + } + + /* Coalesce unpurged left-adjacent ranges */ + node = range_tree_in_range(&mapping->volatile_root, start-1, start); + if (node) { + vrange = container_of(node, struct volatile_range, range_node); + if (!vrange->purged) { + /* resize range */ + start = min_t(u64, start, node->start); + end = max_t(u64, end, node->end); + vrange_del(vrange); + } + } + + /* Coalesce unpurged right-adjacent ranges */ + node = range_tree_in_range(&mapping->volatile_root, end, end+1); + if (node) { + vrange = container_of(node, struct volatile_range, range_node); + if (!vrange->purged) { + /* resize range */ + start = min_t(u64, start, node->start); + end = max_t(u64, end, node->end); + vrange_del(vrange); + } + } + + new->mapping = mapping; + new->range_node.start = start; + new->range_node.end = end; + new->purged = purged; + + range_tree_add(&mapping->volatile_root, &new->range_node); + if (range_on_lru(new)) + lru_add(new); + +out: + mutex_unlock(&volatile_mutex); + + return 0; +} + +/* + * Mark a region as nonvolatile, returns 1 if any pages in the region + * were purged. + */ +long mapping_range_nonvolatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index) +{ + struct volatile_range *new; + struct range_tree_node *node; + int ret = 0; + u64 start, end; + int used_new = 0; + + start = (u64)start_index; + end = (u64)end_index; + + /* create new node */ + new = vrange_alloc(); + if (!new) + return -ENOMEM; + + mutex_lock(&volatile_mutex); + node = range_tree_in_range(&mapping->volatile_root, start, end); + while (node) { + struct volatile_range *vrange; + vrange = container_of(node, struct volatile_range, range_node); + + ret |= vrange->purged; + + if (start <= node->start && end >= node->end) { + /* delete: volatile range is totally within range */ + node = range_tree_next_in_range(&vrange->range_node, + start, end); + vrange_del(vrange); + } else if (node->start >= start) { + /* resize: volatile range right-overlaps range */ + volatile_range_resize(vrange, end+1, node->end); + node = range_tree_next_in_range(&vrange->range_node, + start, end); + + } else if (node->end <= end) { + /* resize: volatile range left-overlaps range */ + volatile_range_resize(vrange, node->start, start-1); + node = range_tree_next_in_range(&vrange->range_node, + start, end); + } else { + /* split: range is totally within a volatile range */ + used_new = 1; /* we only do this once */ + new->mapping = mapping; + new->range_node.start = end + 1; + new->range_node.end = node->end; + new->purged = vrange->purged; + range_tree_add(&mapping->volatile_root, + &new->range_node); + if (range_on_lru(new)) + lru_add(new); + volatile_range_resize(vrange, node->start, start-1); + + break; + } + } + mutex_unlock(&volatile_mutex); + + if (!used_new) + kfree(new); + + return ret; +} + + +/* + * Cleans up any volatile ranges. + */ +void mapping_clear_volatile_ranges(struct address_space *mapping) +{ + struct volatile_range *tozap; + + mutex_lock(&volatile_mutex); + while (!range_tree_empty(&mapping->volatile_root)) { + struct range_tree_node *tmp; + tmp = range_tree_root_node(&mapping->volatile_root); + tozap = container_of(tmp, struct volatile_range, range_node); + vrange_del(tozap); + + } + mutex_unlock(&volatile_mutex); +} + +/* + * Purges volatile ranges when under memory pressure + */ +static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc) +{ + struct volatile_range *range, *next; + s64 nr_to_scan = sc->nr_to_scan; + const gfp_t gfp_mask = sc->gfp_mask; + + if (nr_to_scan && !(gfp_mask & __GFP_FS)) + return -1; + if (!nr_to_scan) + return lru_count; + + mutex_lock(&volatile_mutex); + list_for_each_entry_safe(range, next, &volatile_lru_list, lru) { + struct inode *inode; + loff_t start, end; + + inode = range->mapping->host; + + start = range->range_node.start << PAGE_CACHE_SHIFT; + end = ((range->range_node.end + 1) << PAGE_CACHE_SHIFT) - 1; + + /* + * XXX - calling vmtruncate_range from a shrinker causes + * lockdep warnings. Revisit this! + */ + if (!vmtruncate_range(inode, start, end)) { + lru_del(range); + range->purged = 1; + nr_to_scan -= range_size(range); + } + + if (nr_to_scan <= 0) + break; + } + mutex_unlock(&volatile_mutex); + + return lru_count; +} + +static struct shrinker volatile_shrinker = { + .shrink = volatile_shrink, + .seeks = DEFAULT_SEEKS, +}; + +static int __init volatile_init(void) +{ + register_shrinker(&volatile_shrinker); + return 0; +} + +arch_initcall(volatile_init); -- 1.7.3.2.146.gca209 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 1/2] [RFC] Range tree implementation @ 2012-02-10 0:16 John Stultz 2012-02-10 0:16 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz 0 siblings, 1 reply; 31+ messages in thread From: John Stultz @ 2012-02-10 0:16 UTC (permalink / raw) To: linux-kernel Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel After Andrew suggested something like his mumbletree idea to better store a list of ranges, I worked on a few different approaches, and this is what I've finally managed to get working. I suspect range-tree isn't a totally accurate name, but I couldn't quite make out the difference between range trees and interval trees, so I just picked one to call it. Do let me know if you have a better name. The idea of storing ranges in a tree is nice, but has a number of complications. When adding a range, its possible that a large range will consume and merge a number of smaller ranges. When removing a range, its possible you may end up splitting an existing range, causing one range to become two. This makes it very difficult to provide generic list_head like behavior, as the parent structures would need to be duplicated and removed, and that has lots of memory ownership issues. So, this is a much simplified and more list_head like implementation. You can add a node to a tree, or remove a node to a tree, but the generic implementation doesn't do the merging or splitting for you. But it does provide helpers to find overlapping and adjacent ranges. I made the tree self-balancing via splaying as it seemed easier to handle with the merging/splitting cases I originally tried to make the generic code handle, but since I've dropped that, I suspect it can be reworked to use a rbtree. I just wanted to get this out for initial review. Andrew also really wanted this range-tree implementation to be resuable so we don't duplicate the file locking logic. I'm not totally convinced that the requirements between the volatile ranges and file locking are really equivelent, but this reduced impelementation may make it possible. Do let me know what you think or if you have other ideas for better ways to do the same. thanks -john CC: Andrew Morton <akpm@linux-foundation.org> CC: Android Kernel Team <kernel-team@android.com> CC: Robert Love <rlove@google.com> CC: Mel Gorman <mel@csn.ul.ie> CC: Hugh Dickins <hughd@google.com> CC: Dave Hansen <dave@linux.vnet.ibm.com> CC: Rik van Riel <riel@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org> --- include/linux/rangetree.h | 35 +++++ lib/Makefile | 2 +- lib/rangetree.c | 325 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 361 insertions(+), 1 deletions(-) create mode 100644 include/linux/rangetree.h create mode 100644 lib/rangetree.c diff --git a/include/linux/rangetree.h b/include/linux/rangetree.h new file mode 100644 index 0000000..998ebcc --- /dev/null +++ b/include/linux/rangetree.h @@ -0,0 +1,35 @@ +#ifndef _LINUX_RANGETREE_H +#define _LINUX_RANGETREE_H + +#include <linux/types.h> +#include <linux/fs.h> + +struct range_tree_node { + struct range_tree_node *parent; + struct range_tree_node *left; + struct range_tree_node *right; + u64 start; + u64 end; +}; + +static inline void range_tree_node_init(struct range_tree_node *node) +{ + node->parent = NULL; + node->left = NULL; + node->right = NULL; + node->start = 0; + node->end = 0; +} + +extern struct range_tree_node *range_tree_in_range(struct range_tree_node *root, + u64 start, u64 end); +extern struct range_tree_node *range_tree_in_range_adjacent( + struct range_tree_node *root, + u64 start, u64 end); +extern struct range_tree_node *range_tree_add(struct range_tree_node *root, + struct range_tree_node *node); +extern struct range_tree_node *range_tree_remove(struct range_tree_node *root, + struct range_tree_node *node); +#endif + + diff --git a/lib/Makefile b/lib/Makefile index 18515f0..f43ef0d 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ idr.o int_sqrt.o extable.o prio_tree.o \ sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ proportions.o prio_heap.o ratelimit.o show_mem.o \ - is_single_threaded.o plist.o decompress.o + is_single_threaded.o plist.o decompress.o rangetree.o lib-$(CONFIG_MMU) += ioremap.o lib-$(CONFIG_SMP) += cpumask.o diff --git a/lib/rangetree.c b/lib/rangetree.c new file mode 100644 index 0000000..db20665 --- /dev/null +++ b/lib/rangetree.c @@ -0,0 +1,325 @@ +#include <linux/rangetree.h> +#include <linux/kernel.h> +#include <linux/slab.h> + +/** + * rotate_right - Splay tree helper + * @node: node to be rotated right + * @root: tree root + * + * Returns the tree root after rotating the node right + */ +static struct range_tree_node *rotate_right(struct range_tree_node *node, + struct range_tree_node *root) +{ + struct range_tree_node *new_root, *new_right; + + if (!node->left) + return root; + + new_root = node->left; + new_right = node; + + if (root == node) + root = new_root; + + new_right->left = new_root->right; + new_root->parent = new_right->parent; + if (new_root->parent) { + if (new_root->parent->right == new_right) + new_root->parent->right = new_root; + else + new_root->parent->left = new_root; + } + new_right->parent = new_root; + + new_root->right = new_right; + + if (new_right->left) + new_right->left->parent = new_right; + + + /* Paranoid sanity checking */ + if (new_root->left) + BUG_ON(new_root->left->parent != new_root); + if (new_root->right) + BUG_ON(new_root->right->parent != new_root); + if (new_right->left) + BUG_ON(new_right->left->parent != new_right); + if (new_right->right) + BUG_ON(new_right->right->parent != new_right); + + + return root; + +} + +/** + * rotate_left - Splay tree helper + * @node: node to be rotated left + * @root: tree root + * + * Returns the tree root after rotating the node left + */ +static struct range_tree_node *rotate_left(struct range_tree_node *node, + struct range_tree_node *root) +{ + struct range_tree_node *new_root, *new_left; + + if (!node->right) + return root; + + new_root = node->right; + new_left = node; + + if (root == node) + root = new_root; + + new_left->right = new_root->left; + if (new_left->right) + new_left->right->parent = new_left; + new_root->parent = new_left->parent; + if (new_root->parent) { + if (new_root->parent->left == new_left) + new_root->parent->left = new_root; + else + new_root->parent->right = new_root; + } + new_left->parent = new_root; + new_root->left = new_left; + + + /* Paranoid sanity checking */ + if (new_root->left) + BUG_ON(new_root->left->parent != new_root); + if (new_root->right) + BUG_ON(new_root->right->parent != new_root); + if (new_left->left) + BUG_ON(new_left->left->parent != new_left); + if (new_left->right) + BUG_ON(new_left->right->parent != new_left); + + return root; +} + +/** + * splay_tree Splays a node to the top of a tree + * @root: root of the splay tree + * @node: node to be splayed to the root + * + * Returns the root of a tree after splaying + */ +static struct range_tree_node *splay_tree(struct range_tree_node *root, + struct range_tree_node *node) +{ +restart: + if (root == node) + return root; + + if (node->parent == root) { + if (root->left == node) + root = rotate_right(root, root); + else + root = rotate_left(root, root); + return root; + } else { + struct range_tree_node *parent, *grandparent; + + parent = node->parent; + grandparent = parent->parent; + + if ((node == parent->left) && (parent == grandparent->left)) { + root = rotate_right(grandparent, root); + root = rotate_right(parent, root); + } else if ((node == parent->right) && + (parent == grandparent->right)) { + root = rotate_left(grandparent, root); + root = rotate_left(parent, root); + } else if ((node == parent->right) && + (parent == grandparent->left)) { + root = rotate_left(parent, root); + root = rotate_right(grandparent, root); + } else if ((node == parent->left) && + (parent == grandparent->right)) { + root = rotate_right(parent, root); + root = rotate_left(grandparent, root); + } else { + BUG_ON(1); /* Something is odd */ + } + goto restart; + } +} + + +/** + * range_tree_in_range - Returns the first node that overlaps with the + * given range + * @root: range_tree root + * @start: range start + * @end: range end + * + */ +struct range_tree_node *range_tree_in_range(struct range_tree_node *root, + u64 start, u64 end) +{ + struct range_tree_node *candidate = root; + + if (!candidate) + return 0; +restart: + if (end < candidate->start) { + if (candidate->left) { + candidate = candidate->left; + goto restart; + } + } else if (start > candidate->end) { + if (candidate->right) { + candidate = candidate->right; + goto restart; + } + } else + return candidate; + return 0; +} + + +/** + * range_tree_in_range - Returns the first node that overlaps or is adjacent + * with the given range + * @root: range_tree root + * @start: range start + * @end: range end + * + */ +struct range_tree_node *range_tree_in_range_adjacent( + struct range_tree_node *root, + u64 start, u64 end) +{ + struct range_tree_node *candidate = root; + + if (!candidate) + return 0; +restart: + if (end + 1 < candidate->start) { + if (candidate->left) { + candidate = candidate->left; + goto restart; + } + } else if (start > candidate->end + 1) { + if (candidate->right) { + candidate = candidate->right; + goto restart; + } + } else + return candidate; + return 0; +} + +/** + * range_tree_add - Add a node to a range tree + * @root: range tree to be added to + * @node: range_tree_node to be added + * + * Adds a node to the range tree. + */ +struct range_tree_node *range_tree_add(struct range_tree_node *root, + struct range_tree_node *node) +{ + struct range_tree_node *candidate; + /* make sure its not connected */ + BUG_ON(node->parent || node->left || node->right); + + if (!root) + return node; + + candidate = root; +restart: + if (node->start < candidate->start) { + if (candidate->left) { + candidate = candidate->left; + goto restart; + } + candidate->left = node; + node->parent = candidate; + } else if (node->start > candidate->start) { + if (candidate->right) { + candidate = candidate->right; + goto restart; + } + candidate->right = node; + node->parent = candidate; + } + + root = splay_tree(root, node); + return root; +} + +/** + * range_tree_merge - Helper to merge two range sub-trees + * @left: left subtree to be merged + * @right: right subtree to be merged + * + * Returns a merged range tree of two subtrees. left subtree + * must be all less then the right subtree. + */ +static struct range_tree_node *range_tree_merge(struct range_tree_node *left, + struct range_tree_node *right) +{ + struct range_tree_node *merge; + + if (!left) + return right; + if (!right) + return left; + + merge = left; + /* grab the right-most node on the left side */ + while (merge->right) + merge = merge->right; + merge->right = right; + if (right) + right->parent = merge; + + return left; +} + +/** + * null_node: Helper that clears node data + * @node: Node to be cleared + */ +static void null_node(struct range_tree_node *node) +{ + node->left = node->right = node->parent = NULL; + node->start = node->end = 0; +} + +/** + * range_tree_remove: Removes a given node from the tree + * @root: root of tree + * @node: Node to be removed + * + * Removes a node and splays the tree + */ +struct range_tree_node *range_tree_remove(struct range_tree_node *root, + struct range_tree_node *node) +{ + struct range_tree_node *subtree; + + subtree = range_tree_merge(node->left, node->right); + + if (subtree) + subtree->parent = node->parent; + + if (node->parent && node->parent->left == node) + node->parent->left = subtree; + if (node->parent && node->parent->right == node) + node->parent->right = subtree; + + null_node(node); + if (node == root) + return subtree; + + if (subtree) + root = splay_tree(root, subtree); + return root; +} -- 1.7.3.2.146.gca209 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-02-10 0:16 [PATCH 1/2] [RFC] Range tree implementation John Stultz @ 2012-02-10 0:16 ` John Stultz 2012-02-12 14:08 ` Dmitry Adamushko ` (2 more replies) 0 siblings, 3 replies; 31+ messages in thread From: John Stultz @ 2012-02-10 0:16 UTC (permalink / raw) To: linux-kernel Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel This patch provides new fadvise flags that can be used to mark file pages as volatile, which will allow it to be discarded if the kernel wants to reclaim memory. This is useful for userspace to allocate things like caches, and lets the kernel destructively (but safely) reclaim them when there's memory pressure. It's different from FADV_DONTNEED since the pages are not immediately discarded; they are only discarded under pressure. This is very much influenced by the Android Ashmem interface by Robert Love so credits to him and the Android developers. In many cases the code & logic come directly from the ashmem patch. The intent of this patch is to allow for ashmem-like behavior, but embeds the idea a little deeper into the VM code, instead of isolating it into a specific driver. I'm very much a newbie at the VM code, so At this point, I just want to try to get some input on the patch, so if you have another idea for using something other then fadvise, or other thoughts on how the volatile ranges are stored, I'd be really interested in hearing them. So let me know if you have any comments for feedback! Also many thanks to Dave Hansen who helped design and develop the initial version of this patch, and has provided continued review and mentoring for me in the VM code. v2: After the valid critique that just dropping pages would poke holes in volatile ranges, and instead we should zap an entire range if we drop any of it, I changed the code to more closely mimic the ashmem implementation, which zaps entire ranges via a shrinker using an lru list that tracks which range has been marked volatile the longest. v3: Reworked to use range tree implementation. Known issues: * Not sure how to nicely error out if we get ENOMEM while splitting a range. * Lockdep doesn't like calling vmtruncate_range() from a shrinker. Any help here on how to address this would be appreciated. CC: Andrew Morton <akpm@linux-foundation.org> CC: Android Kernel Team <kernel-team@android.com> CC: Robert Love <rlove@google.com> CC: Mel Gorman <mel@csn.ul.ie> CC: Hugh Dickins <hughd@google.com> CC: Dave Hansen <dave@linux.vnet.ibm.com> CC: Rik van Riel <riel@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org> --- fs/inode.c | 5 + include/linux/fadvise.h | 6 + include/linux/fs.h | 3 + include/linux/volatile.h | 14 ++ mm/Makefile | 2 +- mm/fadvise.c | 22 +++- mm/volatile.c | 314 ++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 364 insertions(+), 2 deletions(-) create mode 100644 include/linux/volatile.h create mode 100644 mm/volatile.c diff --git a/fs/inode.c b/fs/inode.c index fb10d86..0675962 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -27,6 +27,7 @@ #include <linux/cred.h> #include <linux/buffer_head.h> /* for inode_has_buffers */ #include <linux/ratelimit.h> +#include <linux/volatile.h> #include "internal.h" /* @@ -254,6 +255,7 @@ void __destroy_inode(struct inode *inode) if (inode->i_default_acl && inode->i_default_acl != ACL_NOT_CACHED) posix_acl_release(inode->i_default_acl); #endif + mapping_clear_volatile_ranges(&inode->i_data); this_cpu_dec(nr_inodes); } EXPORT_SYMBOL(__destroy_inode); @@ -360,6 +362,9 @@ void address_space_init_once(struct address_space *mapping) spin_lock_init(&mapping->private_lock); INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap); INIT_LIST_HEAD(&mapping->i_mmap_nonlinear); + mapping->volatile_root = NULL; + mutex_init(&mapping->vlist_mutex); + } EXPORT_SYMBOL(address_space_init_once); diff --git a/include/linux/fadvise.h b/include/linux/fadvise.h index e8e7471..988fb00 100644 --- a/include/linux/fadvise.h +++ b/include/linux/fadvise.h @@ -18,4 +18,10 @@ #define POSIX_FADV_NOREUSE 5 /* Data will be accessed once. */ #endif +#define POSIX_FADV_VOLATILE 8 /* _can_ toss, but don't toss now */ +#define POSIX_FADV_NONVOLATILE 9 /* Remove VOLATILE flag */ +#define POSIX_FADV_ISVOLATILE 10 /* Returns volatile flag for region */ + + + #endif /* FADVISE_H_INCLUDED */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 386da09..a784a9b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -10,6 +10,7 @@ #include <linux/ioctl.h> #include <linux/blk_types.h> #include <linux/types.h> +#include <linux/rangetree.h> /* * It's silly to have NR_OPEN bigger than NR_FILE, but you can change @@ -655,6 +656,8 @@ struct address_space { spinlock_t private_lock; /* for use by the address_space */ struct list_head private_list; /* ditto */ struct address_space *assoc_mapping; /* ditto */ + struct range_tree_node *volatile_root; /* volatile range list */ + struct mutex vlist_mutex; /* protect volatile_list */ } __attribute__((aligned(sizeof(long)))); /* * On most architectures that alignment is already the case; but diff --git a/include/linux/volatile.h b/include/linux/volatile.h new file mode 100644 index 0000000..5460d7b --- /dev/null +++ b/include/linux/volatile.h @@ -0,0 +1,14 @@ +#ifndef _LINUX_VOLATILE_H +#define _LINUX_VOLATILE_H + +#include <linux/fs.h> + +extern long mapping_range_volatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index); +extern long mapping_range_nonvolatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index); +extern long mapping_range_isvolatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index); +extern void mapping_clear_volatile_ranges(struct address_space *mapping); + +#endif /* _LINUX_VOLATILE_H */ diff --git a/mm/Makefile b/mm/Makefile index 50ec00e..7b6c7a8 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -13,7 +13,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ readahead.o swap.o truncate.o vmscan.o shmem.o \ prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \ page_isolation.o mm_init.o mmu_context.o percpu.o \ - $(mmu-y) + volatile.o $(mmu-y) obj-y += init-mm.o ifdef CONFIG_NO_BOOTMEM diff --git a/mm/fadvise.c b/mm/fadvise.c index 469491e0..732258b 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -17,6 +17,7 @@ #include <linux/fadvise.h> #include <linux/writeback.h> #include <linux/syscalls.h> +#include <linux/volatile.h> #include <asm/unistd.h> @@ -106,7 +107,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) nrpages = end_index - start_index + 1; if (!nrpages) nrpages = ~0UL; - + ret = force_page_cache_readahead(mapping, file, start_index, nrpages); @@ -128,6 +129,25 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) invalidate_mapping_pages(mapping, start_index, end_index); break; + case POSIX_FADV_VOLATILE: + /* First and last PARTIAL page! */ + start_index = offset >> PAGE_CACHE_SHIFT; + end_index = endbyte >> PAGE_CACHE_SHIFT; + ret = mapping_range_volatile(mapping, start_index, end_index); + break; + case POSIX_FADV_NONVOLATILE: + /* First and last PARTIAL page! */ + start_index = offset >> PAGE_CACHE_SHIFT; + end_index = endbyte >> PAGE_CACHE_SHIFT; + ret = mapping_range_nonvolatile(mapping, start_index, + end_index); + break; + case POSIX_FADV_ISVOLATILE: + /* First and last PARTIAL page! */ + start_index = offset >> PAGE_CACHE_SHIFT; + end_index = endbyte >> PAGE_CACHE_SHIFT; + ret = mapping_range_isvolatile(mapping, start_index, end_index); + break; default: ret = -EINVAL; } diff --git a/mm/volatile.c b/mm/volatile.c new file mode 100644 index 0000000..7ac1afd --- /dev/null +++ b/mm/volatile.c @@ -0,0 +1,314 @@ +/* mm/volatile.c + * + * Volatile page range managment. + * Copyright 2011 Linaro + * + * Based on mm/ashmem.c + * by Robert Love <rlove@google.com> + * Copyright (C) 2008 Google, Inc. + * + * + * This software is licensed under the terms of the GNU General Public + * License version 2, as published by the Free Software Foundation, and + * may be copied, distributed, and modified under those terms. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include <linux/kernel.h> +#include <linux/fs.h> +#include <linux/mm.h> +#include <linux/slab.h> +#include <linux/volatile.h> + + +struct volatile_range { + struct list_head lru; + struct range_tree_node range_node; + + unsigned int purged; + struct address_space *mapping; +}; + + +/* LRU list of volatile page ranges */ +static LIST_HEAD(volatile_lru_list); +static DEFINE_MUTEX(volatile_lru_mutex); + +/* Count of pages on our LRU list */ +static u64 lru_count; + + +/* range helpers */ + +static inline u64 range_size(struct volatile_range *range) +{ + return range->range_node.end - range->range_node.start + 1; +} + + +static inline void lru_add(struct volatile_range *range) +{ + mutex_lock(&volatile_lru_mutex); + list_add_tail(&range->lru, &volatile_lru_list); + lru_count += range_size(range); + mutex_unlock(&volatile_lru_mutex); +} + +static inline void __lru_del(struct volatile_range *range) +{ + list_del(&range->lru); + lru_count -= range_size(range); +} + +static inline void lru_del(struct volatile_range *range) +{ + mutex_lock(&volatile_lru_mutex); + __lru_del(range); + mutex_unlock(&volatile_lru_mutex); +} + +#define range_on_lru(range) (!(range)->purged) + + +static inline void volatile_range_shrink(struct volatile_range *range, + pgoff_t start_index, pgoff_t end_index) +{ + size_t pre = range_size(range); + + range->range_node.start = start_index; + range->range_node.end = end_index; + + if (range_on_lru(range)) { + mutex_lock(&volatile_lru_mutex); + lru_count -= pre - range_size(range); + mutex_unlock(&volatile_lru_mutex); + } +} + +static struct volatile_range *vrange_alloc(void) +{ + struct volatile_range *new; + + new = kzalloc(sizeof(struct volatile_range), GFP_KERNEL); + if (!new) + return 0; + + range_tree_node_init(&new->range_node); + return new; +} + +static void vrange_del(struct volatile_range *vrange) +{ + struct address_space *mapping; + mapping = vrange->mapping; + + mapping->volatile_root = + range_tree_remove(mapping->volatile_root, &vrange->range_node); + if (range_on_lru(vrange)) + lru_del(vrange); + kfree(vrange); +} + + + +/* + * Mark a region as volatile, allowing dirty pages to be purged + * under memory pressure + */ +long mapping_range_volatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index) +{ + struct volatile_range *new; + struct range_tree_node *node; + + u64 start, end; + int purged = 0; + start = (u64)start_index; + end = (u64)end_index; + + new = vrange_alloc(); + if (!new) + return -ENOMEM; + + mutex_lock(&mapping->vlist_mutex); + + node = range_tree_in_range_adjacent(mapping->volatile_root, start, end); + while (node) { + struct volatile_range *vrange; + + /* Already entirely marked volatile, so we're done */ + if (node->start < start && node->end > end) { + /* don't need the allocated value */ + kfree(new); + return 0; + } + + /* Grab containing volatile range */ + vrange = container_of(node, struct volatile_range, range_node); + + /* resize range */ + start = min_t(u64, start, node->start); + end = max_t(u64, end, node->end); + purged |= vrange->purged; + + vrange_del(vrange); + + /* get the next possible overlap */ + node = range_tree_in_range(mapping->volatile_root, start, end); + } + + new->mapping = mapping; + new->range_node.start = start; + new->range_node.end = end; + new->purged = purged; + mapping->volatile_root = range_tree_add(mapping->volatile_root, + &new->range_node); + if (range_on_lru(new)) + lru_add(new); + mutex_unlock(&mapping->vlist_mutex); + + return 0; +} + +/* + * Mark a region as nonvolatile, returns 1 if any pages in the region + * were purged. + */ +long mapping_range_nonvolatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index) +{ + struct volatile_range *new; + struct range_tree_node *node; + int ret = 0; + u64 start, end; + start = (u64)start_index; + end = (u64)end_index; + + mutex_lock(&mapping->vlist_mutex); + node = range_tree_in_range(mapping->volatile_root, start, end); + while (node) { + struct volatile_range *vrange; + vrange = container_of(node, struct volatile_range, range_node); + + ret |= vrange->purged; + + if (start <= node->start && end >= node->end) { + vrange_del(vrange); + } else if (node->start >= start) { + volatile_range_shrink(vrange, end+1, node->end); + } else if (node->end <= end) { + volatile_range_shrink(vrange, node->start, start-1); + } else { + /* create new node */ + new = vrange_alloc(); /* XXX ENOMEM HERE? */ + + new->mapping = mapping; + new->range_node.start = end + 1; + new->range_node.end = node->end; + volatile_range_shrink(vrange, node->start, start-1); + mapping->volatile_root = + range_tree_add(mapping->volatile_root, + &new->range_node); + if (range_on_lru(new)) + lru_add(new); + break; + } + node = range_tree_in_range(mapping->volatile_root, start, end); + } + mutex_unlock(&mapping->vlist_mutex); + + return ret; +} + +/* + * Returns if a region has been marked volatile or not. + * Does not return if the region has been purged. + */ +long mapping_range_isvolatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index) +{ + long ret = 0; + u64 start, end; + start = (u64)start_index; + end = (u64)end_index; + + mutex_lock(&mapping->vlist_mutex); + if (range_tree_in_range(mapping->volatile_root, start, end)) + ret = 1; + mutex_unlock(&mapping->vlist_mutex); + return ret; +} + + +/* + * Cleans up any volatile ranges. + */ +void mapping_clear_volatile_ranges(struct address_space *mapping) +{ + struct volatile_range *tozap; + + mutex_lock(&mapping->vlist_mutex); + while (mapping->volatile_root) { + tozap = container_of(mapping->volatile_root, + struct volatile_range, range_node); + vrange_del(tozap); + } + mutex_unlock(&mapping->vlist_mutex); +} + + +static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc) +{ + struct volatile_range *range, *next; + unsigned long nr_to_scan = sc->nr_to_scan; + const gfp_t gfp_mask = sc->gfp_mask; + + /* We might recurse into filesystem code, so bail out if necessary */ + if (nr_to_scan && !(gfp_mask & __GFP_FS)) + return -1; + if (!nr_to_scan) + return lru_count; + + mutex_lock(&volatile_lru_mutex); + list_for_each_entry_safe(range, next, &volatile_lru_list, lru) { + struct inode *inode = range->mapping->host; + loff_t start, end; + + + start = range->range_node.start * PAGE_SIZE; + end = (range->range_node.end + 1) * PAGE_SIZE - 1; + + /* + * XXX - calling vmtruncate_range from a shrinker causes + * lockdep warnings. Revisit this! + */ + vmtruncate_range(inode, start, end); + range->purged = 1; + __lru_del(range); + + nr_to_scan -= range_size(range); + if (nr_to_scan <= 0) + break; + } + mutex_unlock(&volatile_lru_mutex); + + return lru_count; +} + +static struct shrinker volatile_shrinker = { + .shrink = volatile_shrink, + .seeks = DEFAULT_SEEKS * 4, +}; + + +static int __init volatile_init(void) +{ + register_shrinker(&volatile_shrinker); + return 0; +} + +arch_initcall(volatile_init); -- 1.7.3.2.146.gca209 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-02-10 0:16 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz @ 2012-02-12 14:08 ` Dmitry Adamushko 2012-02-17 3:49 ` John Stultz 2012-02-14 5:16 ` Dave Chinner [not found] ` <CAO6Zf6B6nGqsz5zpT3ixbO-+JWxMsScABasnwo-CVHuMKPqpLQ@mail.gmail.com> 2 siblings, 1 reply; 31+ messages in thread From: Dmitry Adamushko @ 2012-02-12 14:08 UTC (permalink / raw) To: John Stultz Cc: linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel On 10 February 2012 01:16, John Stultz <john.stultz@linaro.org> wrote: [ ... ] > +/* > + * Mark a region as nonvolatile, returns 1 if any pages in the region > + * were purged. > + */ > +long mapping_range_nonvolatile(struct address_space *mapping, > + pgoff_t start_index, pgoff_t end_index) > +{ > + struct volatile_range *new; > + struct range_tree_node *node; > + int ret = 0; > + u64 start, end; > + start = (u64)start_index; > + end = (u64)end_index; > + > + mutex_lock(&mapping->vlist_mutex); > + node = range_tree_in_range(mapping->volatile_root, start, end); > + while (node) { > + struct volatile_range *vrange; > + vrange = container_of(node, struct volatile_range, range_node); > + > + ret |= vrange->purged; again, racing with volatile_shrink() here, so we can return a stale state. > + > + if (start <= node->start && end >= node->end) { > + vrange_del(vrange); > + } else if (node->start >= start) { > + volatile_range_shrink(vrange, end+1, node->end); > + } else if (node->end <= end) { > + volatile_range_shrink(vrange, node->start, start-1); > + } else { > + /* create new node */ > + new = vrange_alloc(); /* XXX ENOMEM HERE? */ > + > + new->mapping = mapping; > + new->range_node.start = end + 1; > + new->range_node.end = node->end; new->purged = vrange->purged ? > + volatile_range_shrink(vrange, node->start, start-1); > + mapping->volatile_root = > + range_tree_add(mapping->volatile_root, > + &new->range_node); > + if (range_on_lru(new)) > + lru_add(new); > + break; > + } > + node = range_tree_in_range(mapping->volatile_root, start, end); > + } > + mutex_unlock(&mapping->vlist_mutex); > + > + return ret; > +} > + Also, I have a question about mapping_range_volatile(). +long mapping_range_volatile(struct address_space *mapping, + pgoff_t start_index, pgoff_t end_index) +{ + struct volatile_range *new; + struct range_tree_node *node; + + u64 start, end; + int purged = 0; + start = (u64)start_index; + end = (u64)end_index; + + new = vrange_alloc(); + if (!new) + return -ENOMEM; + + mutex_lock(&mapping->vlist_mutex); + + node = range_tree_in_range_adjacent(mapping->volatile_root, start, end); + while (node) { + struct volatile_range *vrange; + + /* Already entirely marked volatile, so we're done */ + if (node->start < start && node->end > end) { + /* don't need the allocated value */ + kfree(new); + return 0; + } + + /* Grab containing volatile range */ + vrange = container_of(node, struct volatile_range, range_node); + + /* resize range */ + start = min_t(u64, start, node->start); + end = max_t(u64, end, node->end); + purged |= vrange->purged; + + vrange_del(vrange); + + /* get the next possible overlap */ + node = range_tree_in_range(mapping->volatile_root, start, end); + } + + new->mapping = mapping; + new->range_node.start = start; + new->range_node.end = end; + new->purged = purged; I'm wondering whether this 'inheritance' is always desirable. Say, mapping_range_volatile(mapping, X, X + 1); ... time goes by and volatile_shrink() has been called for this region. now, a user does the following (is it considered bad user-behavior?) mapping_range_volatile(mapping, Y = X - big_value, Z = X + big_value); This new range will 'inherit' purged=1 from the old one and won't be on the lru_list. Yet, it's much bigger than the old one and so many pages are not really 'volatile'. -- Dmitry ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-02-12 14:08 ` Dmitry Adamushko @ 2012-02-17 3:49 ` John Stultz 0 siblings, 0 replies; 31+ messages in thread From: John Stultz @ 2012-02-17 3:49 UTC (permalink / raw) To: Dmitry Adamushko Cc: linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel On Sun, 2012-02-12 at 15:08 +0100, Dmitry Adamushko wrote: > On 10 February 2012 01:16, John Stultz <john.stultz@linaro.org> wrote: > Also, I have a question about mapping_range_volatile(). [snip] > + new->mapping = mapping; > + new->range_node.start = start; > + new->range_node.end = end; > + new->purged = purged; > > I'm wondering whether this 'inheritance' is always desirable. > > Say, > > mapping_range_volatile(mapping, X, X + 1); > ... > time goes by and volatile_shrink() has been called for this region. > > now, a user does the following (is it considered bad user-behavior?) > > mapping_range_volatile(mapping, Y = X - big_value, Z = X + big_value); > > This new range will 'inherit' purged=1 from the old one and won't be > on the lru_list. Yet, it's much bigger than the old one and so many > pages are not really 'volatile'. Yea, I think this is a interesting point, and we can probably be a little smarter then what is done here. We could only coalesce ranges that haven't been purged, for instance. Although, the coalescing of neighboring ranges in of itself is sort of questionable, as if they were marked volatile independently, they may be marked nonvolatile independently as well, so merging them together mucks up the lru ordering. Robert/Brian: Is there strong rational for the coalescing of neighboring ranges in ashmem? thanks -john ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-02-10 0:16 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz 2012-02-12 14:08 ` Dmitry Adamushko @ 2012-02-14 5:16 ` Dave Chinner 2012-02-14 5:55 ` John Stultz [not found] ` <CAO6Zf6B6nGqsz5zpT3ixbO-+JWxMsScABasnwo-CVHuMKPqpLQ@mail.gmail.com> 2 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2012-02-14 5:16 UTC (permalink / raw) To: John Stultz Cc: linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel On Thu, Feb 09, 2012 at 04:16:33PM -0800, John Stultz wrote: > This patch provides new fadvise flags that can be used to mark > file pages as volatile, which will allow it to be discarded if the > kernel wants to reclaim memory. > > This is useful for userspace to allocate things like caches, and lets > the kernel destructively (but safely) reclaim them when there's memory > pressure. ..... > @@ -655,6 +656,8 @@ struct address_space { > spinlock_t private_lock; /* for use by the address_space */ > struct list_head private_list; /* ditto */ > struct address_space *assoc_mapping; /* ditto */ > + struct range_tree_node *volatile_root; /* volatile range list */ > + struct mutex vlist_mutex; /* protect volatile_list */ > } __attribute__((aligned(sizeof(long)))); So you're adding roughly 32 bytes to every cached inode in the system? This will increasing the memory footprint of the inode cache by 2-5% (depending on the filesystem). Almost no-one will be using this functionality on most inodes that are cached in the system, so that seems like a pretty bad trade-off to me... > +static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc) > +{ > + struct volatile_range *range, *next; > + unsigned long nr_to_scan = sc->nr_to_scan; > + const gfp_t gfp_mask = sc->gfp_mask; > + > + /* We might recurse into filesystem code, so bail out if necessary */ > + if (nr_to_scan && !(gfp_mask & __GFP_FS)) > + return -1; > + if (!nr_to_scan) > + return lru_count; > + > + mutex_lock(&volatile_lru_mutex); > + list_for_each_entry_safe(range, next, &volatile_lru_list, lru) { > + struct inode *inode = range->mapping->host; > + loff_t start, end; > + > + > + start = range->range_node.start * PAGE_SIZE; > + end = (range->range_node.end + 1) * PAGE_SIZE - 1; > + > + /* > + * XXX - calling vmtruncate_range from a shrinker causes > + * lockdep warnings. Revisit this! > + */ > + vmtruncate_range(inode, start, end); That function vmtruncate_range, I don't think it does what you think it does. Firstly, it's only implemented for shmfs/tmpfs, so this can't have been tested for caching files on any real filesystem. If it's only for shm/tmpfs, then the applications cwcan just as easily use their own memory for caching their volatile data... Secondly, vmtruncate_range() is actually a hole punching function, not a page cache invalidation function. You should be using invalidate_inode_pages2_range() to invalidate and tear down the page cache. If you really want to punch holes in files, then you should be using the fallocate syscall with direct application control, not trying to hide it until memory pressure occurs via fadvise because hole punching requires memory for the transactions necessary to run extent freeing operations. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-02-14 5:16 ` Dave Chinner @ 2012-02-14 5:55 ` John Stultz 2012-02-14 23:51 ` Dave Chinner 0 siblings, 1 reply; 31+ messages in thread From: John Stultz @ 2012-02-14 5:55 UTC (permalink / raw) To: Dave Chinner Cc: linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel On Tue, 2012-02-14 at 16:16 +1100, Dave Chinner wrote: > On Thu, Feb 09, 2012 at 04:16:33PM -0800, John Stultz wrote: > > This patch provides new fadvise flags that can be used to mark > > file pages as volatile, which will allow it to be discarded if the > > kernel wants to reclaim memory. > > > > This is useful for userspace to allocate things like caches, and lets > > the kernel destructively (but safely) reclaim them when there's memory > > pressure. > ..... > > @@ -655,6 +656,8 @@ struct address_space { > > spinlock_t private_lock; /* for use by the address_space */ > > struct list_head private_list; /* ditto */ > > struct address_space *assoc_mapping; /* ditto */ > > + struct range_tree_node *volatile_root; /* volatile range list */ > > + struct mutex vlist_mutex; /* protect volatile_list */ > > } __attribute__((aligned(sizeof(long)))); > > So you're adding roughly 32 bytes to every cached inode in the > system? This will increasing the memory footprint of the inode cache > by 2-5% (depending on the filesystem). Almost no-one will be using > this functionality on most inodes that are cached in the system, so > that seems like a pretty bad trade-off to me... Yea. Bloating the address_space is a concern I'm aware of, but for the initial passes I left it to see where folks would rather I keep it. Pushing the mutex into a range_tree_root structure or something could cut this down, but I still suspect it won't be loved. Another idea would be to manage the mapping -> range tree separately via something like a hash. Do you have any preferences or suggestions here? > > +static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc) > > +{ > > + struct volatile_range *range, *next; > > + unsigned long nr_to_scan = sc->nr_to_scan; > > + const gfp_t gfp_mask = sc->gfp_mask; > > + > > + /* We might recurse into filesystem code, so bail out if necessary */ > > + if (nr_to_scan && !(gfp_mask & __GFP_FS)) > > + return -1; > > + if (!nr_to_scan) > > + return lru_count; > > + > > + mutex_lock(&volatile_lru_mutex); > > + list_for_each_entry_safe(range, next, &volatile_lru_list, lru) { > > + struct inode *inode = range->mapping->host; > > + loff_t start, end; > > + > > + > > + start = range->range_node.start * PAGE_SIZE; > > + end = (range->range_node.end + 1) * PAGE_SIZE - 1; > > + > > + /* > > + * XXX - calling vmtruncate_range from a shrinker causes > > + * lockdep warnings. Revisit this! > > + */ > > + vmtruncate_range(inode, start, end); > > That function vmtruncate_range, I don't think it does what you think > it does. > > Firstly, it's only implemented for shmfs/tmpfs, so this can't have > been tested for caching files on any real filesystem. If it's only > for shm/tmpfs, then the applications cwcan just as easily use their > own memory for caching their volatile data... Yep you're right, this started as being shm only, and has only been tested on tmpfs mounts. In this verison, I had left the shm checks off so that it could be possibly more generic, but I admittedly haven't thought that through enough. > Secondly, vmtruncate_range() is actually a hole punching function, > not a page cache invalidation function. You should be using > invalidate_inode_pages2_range() to invalidate and tear down the page > cache. If you really want to punch holes in files, then you should > be using the fallocate syscall with direct application control, not > trying to hide it until memory pressure occurs via fadvise because > hole punching requires memory for the transactions necessary to run > extent freeing operations. Thanks for the tip on invalidate_inode_pages2_range()! I'll look it over and rework the patch using that. Thanks so much for the review! -john ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-02-14 5:55 ` John Stultz @ 2012-02-14 23:51 ` Dave Chinner 2012-02-15 0:29 ` John Stultz 0 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2012-02-14 23:51 UTC (permalink / raw) To: John Stultz Cc: linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel On Mon, Feb 13, 2012 at 09:55:32PM -0800, John Stultz wrote: > On Tue, 2012-02-14 at 16:16 +1100, Dave Chinner wrote: > > On Thu, Feb 09, 2012 at 04:16:33PM -0800, John Stultz wrote: > > > This patch provides new fadvise flags that can be used to mark > > > file pages as volatile, which will allow it to be discarded if the > > > kernel wants to reclaim memory. > > > > > > This is useful for userspace to allocate things like caches, and lets > > > the kernel destructively (but safely) reclaim them when there's memory > > > pressure. > > ..... > > > @@ -655,6 +656,8 @@ struct address_space { > > > spinlock_t private_lock; /* for use by the address_space */ > > > struct list_head private_list; /* ditto */ > > > struct address_space *assoc_mapping; /* ditto */ > > > + struct range_tree_node *volatile_root; /* volatile range list */ > > > + struct mutex vlist_mutex; /* protect volatile_list */ > > > } __attribute__((aligned(sizeof(long)))); > > > > So you're adding roughly 32 bytes to every cached inode in the > > system? This will increasing the memory footprint of the inode cache > > by 2-5% (depending on the filesystem). Almost no-one will be using > > this functionality on most inodes that are cached in the system, so > > that seems like a pretty bad trade-off to me... > > Yea. Bloating the address_space is a concern I'm aware of, but for the > initial passes I left it to see where folks would rather I keep it. > Pushing the mutex into a range_tree_root structure or something could > cut this down, but I still suspect it won't be loved. Another idea would > be to manage the mapping -> range tree separately via something like a > hash. Do you have any preferences or suggestions here? Given that it is a single state bit per page (volatile/non volatile) you could just use a radix tree tag for keeping the state. Changing the state isn't a performance critical operation, and tagging large ranges isn't that expensive (e.g. we do that in the writeback code), so I'm not sure the overhead of a separate tree is necessary here.... That doesn't help with the reclaim side of things, but I would have thought that such functioanlity would be better integrated into the VM page cache/lru scanning code than adding a shrinker to shrink the page cache additionally on top of what the VM has already done before calling the shrinkers. I'm not sure what is best here, though... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-02-14 23:51 ` Dave Chinner @ 2012-02-15 0:29 ` John Stultz 2012-02-15 1:37 ` NeilBrown 0 siblings, 1 reply; 31+ messages in thread From: John Stultz @ 2012-02-15 0:29 UTC (permalink / raw) To: Dave Chinner Cc: linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel On Wed, 2012-02-15 at 10:51 +1100, Dave Chinner wrote: > Given that it is a single state bit per page (volatile/non volatile) > you could just use a radix tree tag for keeping the state. Changing > the state isn't a performance critical operation, and tagging large > ranges isn't that expensive (e.g. we do that in the writeback code), > so I'm not sure the overhead of a separate tree is necessary here.... Hrm. I'll look into this. > That doesn't help with the reclaim side of things, but I would have > thought that such functioanlity would be better integrated into the > VM page cache/lru scanning code than adding a shrinker to shrink the > page cache additionally on top of what the VM has already done > before calling the shrinkers. I'm not sure what is best here, > though... Yea. My previous version did eviction from shmem_writepage(), I believe much as you suggest here, but the concern with that is that you could have a larger volatile range that was for the most part recently in use, but one idle page causes the entire thing to be evicted first. Using least-recently-marked-volatile order seems more natural for the use case (although Dimitry and others have already pointed out that the inheritance from the coalescing of neighboring ranges results in a similar issue). But I'm open to other ideas and arguments. Thanks again for the feedback! -john ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-02-15 0:29 ` John Stultz @ 2012-02-15 1:37 ` NeilBrown 2012-02-17 4:45 ` Dave Chinner 2012-02-17 5:21 ` John Stultz 0 siblings, 2 replies; 31+ messages in thread From: NeilBrown @ 2012-02-15 1:37 UTC (permalink / raw) To: John Stultz Cc: Dave Chinner, linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel [-- Attachment #1: Type: text/plain, Size: 3336 bytes --] On Tue, 14 Feb 2012 16:29:10 -0800 John Stultz <john.stultz@linaro.org> wrote: > But I'm open to other ideas and arguments. I didn't notice the original patch, but found it at https://lwn.net/Articles/468837/ and had a look. My first comment is -ENODOC. A bit background always helps, so let me try to construct that: The goal is to allow applications to interact with the kernel's cache management infrastructure. In particular an application can say "this memory contains data that might be useful in the future, but can be reconstructed if necessary, and it is cheaper to reconstruct it than to read it back from disk, so don't bother writing it out". The proposed mechanism - at a high level - is for user-space to be able to say "This memory is volatile" and then later "this memory is no longer volatile". If the content of the memory is still available the second request succeeds. If not, it fails.. Well, actually it succeeds but reports that some content has been lost. (not sure what happens then - can the app do a binary search to find which pages it still has or something). (technically we should probably include the cost to reconstruct the page, which the kernel measures as 'seeks' but maybe that isn't necessary). This is implemented by using files in a 'tmpfs' filesystem. These file support three new flags to fadvise: POSIX_FADV_VOLATILE - this marks a range of pages as 'volatile'. They may be removed from the page cache as needed, even if they are not 'clean'. POSIX_FADV_NONVOLATILE - this marks a range of pages as non-volatile. If any pages in the range were previously volatile but have since been removed, then a status is returned reporting this. POSIX_FADV_ISVOLATILE - this does not actually give any advice to the kernel but rather asks a question: Are any of these pages volatile? Is this an accurate description? My first thoughts are: 1/ is page granularity really needed? Would file granularity be sufficient? 2/ POSIX_FADV_ISVOLATILE is a warning sign to me - it doesn't actually provide advice. Is this really needed? What for? Because it feels like a wrong interface. 3/ Given that this is specific to one filesystem, is fadvise really an appropriate interface? (fleshing out the above documentation might be an excellent way to answer these questions). As a counter-point, this is my first thought of an implementation approach (-ENOPATCH, sorry) - new mount option for tmpfs e.g. 'volatile'. Any file in a filesystem mounted with that option and which is not currently open by any process can have blocks removed at any time. The file name must remain, and the file size must not change. - lseek can be used to determine if anything has been purged with 'SEEK_DATA' and 'SEEK_HOLE'. So you can only mark volatility on a whole-file granularity (hence the question above). 'open' says "NONVOLATILE". 'close' says "VOLATILE". 'lseek' is used to check if anything was discarded. This approach would allow multiple processes to share a cache (might this be valuable?) as it doesn't become truly volatile until all processes close their handles. Just a few thoughts ... take or discard as you choose. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-02-15 1:37 ` NeilBrown @ 2012-02-17 4:45 ` Dave Chinner 2012-02-17 5:27 ` NeilBrown 2012-02-17 5:38 ` John Stultz 2012-02-17 5:21 ` John Stultz 1 sibling, 2 replies; 31+ messages in thread From: Dave Chinner @ 2012-02-17 4:45 UTC (permalink / raw) To: NeilBrown Cc: John Stultz, linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel On Wed, Feb 15, 2012 at 12:37:50PM +1100, NeilBrown wrote: > On Tue, 14 Feb 2012 16:29:10 -0800 John Stultz <john.stultz@linaro.org> wrote: > > > But I'm open to other ideas and arguments. > > I didn't notice the original patch, but found it at > https://lwn.net/Articles/468837/ > and had a look. > > My first comment is -ENODOC. A bit background always helps, so let me try to > construct that: > > The goal is to allow applications to interact with the kernel's cache > management infrastructure. In particular an application can say "this > memory contains data that might be useful in the future, but can be > reconstructed if necessary, and it is cheaper to reconstruct it than to read > it back from disk, so don't bother writing it out". > > The proposed mechanism - at a high level - is for user-space to be able to > say "This memory is volatile" and then later "this memory is no longer > volatile". If the content of the memory is still available the second > request succeeds. If not, it fails.. Well, actually it succeeds but reports > that some content has been lost. (not sure what happens then - can the app do > a binary search to find which pages it still has or something). > > (technically we should probably include the cost to reconstruct the page, > which the kernel measures as 'seeks' but maybe that isn't necessary). > > This is implemented by using files in a 'tmpfs' filesystem. These file > support three new flags to fadvise: > > POSIX_FADV_VOLATILE - this marks a range of pages as 'volatile'. They may be > removed from the page cache as needed, even if they are not 'clean'. > POSIX_FADV_NONVOLATILE - this marks a range of pages as non-volatile. > If any pages in the range were previously volatile but have since been > removed, then a status is returned reporting this. > POSIX_FADV_ISVOLATILE - this does not actually give any advice to the kernel > but rather asks a question: Are any of these pages volatile? What about for files that aren't on tmpfs? the fadvise() interface is not tmpfs specific, and given that everyone is talking about volatility of page cache pages, I fail to see what is tmpfs specific about this proposal. So what are the semantics that are supposed to apply to a file that is on a filesystem with stable storage that is cached in the page cache? If this is tmpfs specific behaviour that is required, then IMO fadvise is not the correct interface to use here because fadvise is supposed to be a generic interface to controlling the page cache behaviour on any given file.... > As a counter-point, this is my first thought of an implementation approach > (-ENOPATCH, sorry) > > - new mount option for tmpfs e.g. 'volatile'. Any file in a filesystem > mounted with that option and which is not currently open by any process can > have blocks removed at any time. The file name must remain, and the file > size must not change. > - lseek can be used to determine if anything has been purged with 'SEEK_DATA' > and 'SEEK_HOLE'. > > So you can only mark volatility on a whole-file granularity (hence the > question above). > 'open' says "NONVOLATILE". > 'close' says "VOLATILE". > 'lseek' is used to check if anything was discarded. > > This approach would allow multiple processes to share a cache (might this be > valuable?) as it doesn't become truly volatile until all processes close > their handles. If this functionality is only useful for tmpfs, then I'd much prefer a tmpfs specific approach like this.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-02-17 4:45 ` Dave Chinner @ 2012-02-17 5:27 ` NeilBrown 2012-02-17 5:38 ` John Stultz 1 sibling, 0 replies; 31+ messages in thread From: NeilBrown @ 2012-02-17 5:27 UTC (permalink / raw) To: Dave Chinner Cc: John Stultz, linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel [-- Attachment #1: Type: text/plain, Size: 3465 bytes --] On Fri, 17 Feb 2012 15:45:57 +1100 Dave Chinner <david@fromorbit.com> wrote: > On Wed, Feb 15, 2012 at 12:37:50PM +1100, NeilBrown wrote: > > On Tue, 14 Feb 2012 16:29:10 -0800 John Stultz <john.stultz@linaro.org> wrote: > > > > > But I'm open to other ideas and arguments. > > > > I didn't notice the original patch, but found it at > > https://lwn.net/Articles/468837/ > > and had a look. > > > > My first comment is -ENODOC. A bit background always helps, so let me try to > > construct that: > > > > The goal is to allow applications to interact with the kernel's cache > > management infrastructure. In particular an application can say "this > > memory contains data that might be useful in the future, but can be > > reconstructed if necessary, and it is cheaper to reconstruct it than to read > > it back from disk, so don't bother writing it out". > > > > The proposed mechanism - at a high level - is for user-space to be able to > > say "This memory is volatile" and then later "this memory is no longer > > volatile". If the content of the memory is still available the second > > request succeeds. If not, it fails.. Well, actually it succeeds but reports > > that some content has been lost. (not sure what happens then - can the app do > > a binary search to find which pages it still has or something). > > > > (technically we should probably include the cost to reconstruct the page, > > which the kernel measures as 'seeks' but maybe that isn't necessary). > > > > This is implemented by using files in a 'tmpfs' filesystem. These file > > support three new flags to fadvise: > > > > POSIX_FADV_VOLATILE - this marks a range of pages as 'volatile'. They may be > > removed from the page cache as needed, even if they are not 'clean'. > > POSIX_FADV_NONVOLATILE - this marks a range of pages as non-volatile. > > If any pages in the range were previously volatile but have since been > > removed, then a status is returned reporting this. > > POSIX_FADV_ISVOLATILE - this does not actually give any advice to the kernel > > but rather asks a question: Are any of these pages volatile? > > What about for files that aren't on tmpfs? the fadvise() interface > is not tmpfs specific, and given that everyone is talking about > volatility of page cache pages, I fail to see what is tmpfs specific > about this proposal. It seems I was looking at an earlier version of the patch which only seemed to affect tmpfs file. I see now that the latest version can affect all filesystems. > > So what are the semantics that are supposed to apply to a file that > is on a filesystem with stable storage that is cached in the page > cache? This is my question too. Does this make any sense at all for a storage-backed filesystem? If I understand the current code (which is by no means certain), then there is nothing concrete that stops volatile pages from being written back to storage. Whether they are or not would be the result of a race between the 'volatile_shrinker' purging them, and the VM cleaning them. Given that the volatile_shrinker sets 'seeks = DEFAULT_SEEKS * 4', I would guess that the VM would get to the pages before the shrinker, but that is mostly just a guess. If this really what we want? Certainly having this clarified in Documents/volatile.txt would help a lot :-) Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-02-17 4:45 ` Dave Chinner 2012-02-17 5:27 ` NeilBrown @ 2012-02-17 5:38 ` John Stultz 1 sibling, 0 replies; 31+ messages in thread From: John Stultz @ 2012-02-17 5:38 UTC (permalink / raw) To: Dave Chinner Cc: NeilBrown, linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel On Fri, 2012-02-17 at 15:45 +1100, Dave Chinner wrote: > On Wed, Feb 15, 2012 at 12:37:50PM +1100, NeilBrown wrote: > > On Tue, 14 Feb 2012 16:29:10 -0800 John Stultz <john.stultz@linaro.org> wrote: > > > > > But I'm open to other ideas and arguments. > > > > I didn't notice the original patch, but found it at > > https://lwn.net/Articles/468837/ > > and had a look. > > > > My first comment is -ENODOC. A bit background always helps, so let me try to > > construct that: > > > > The goal is to allow applications to interact with the kernel's cache > > management infrastructure. In particular an application can say "this > > memory contains data that might be useful in the future, but can be > > reconstructed if necessary, and it is cheaper to reconstruct it than to read > > it back from disk, so don't bother writing it out". > > > > The proposed mechanism - at a high level - is for user-space to be able to > > say "This memory is volatile" and then later "this memory is no longer > > volatile". If the content of the memory is still available the second > > request succeeds. If not, it fails.. Well, actually it succeeds but reports > > that some content has been lost. (not sure what happens then - can the app do > > a binary search to find which pages it still has or something). > > > > (technically we should probably include the cost to reconstruct the page, > > which the kernel measures as 'seeks' but maybe that isn't necessary). > > > > This is implemented by using files in a 'tmpfs' filesystem. These file > > support three new flags to fadvise: > > > > POSIX_FADV_VOLATILE - this marks a range of pages as 'volatile'. They may be > > removed from the page cache as needed, even if they are not 'clean'. > > POSIX_FADV_NONVOLATILE - this marks a range of pages as non-volatile. > > If any pages in the range were previously volatile but have since been > > removed, then a status is returned reporting this. > > POSIX_FADV_ISVOLATILE - this does not actually give any advice to the kernel > > but rather asks a question: Are any of these pages volatile? > > What about for files that aren't on tmpfs? the fadvise() interface > is not tmpfs specific, and given that everyone is talking about > volatility of page cache pages, I fail to see what is tmpfs specific > about this proposal. > > So what are the semantics that are supposed to apply to a file that > is on a filesystem with stable storage that is cached in the page > cache? Indeed, this is probably the most awkward case. So currently, we use vmtruncate_range, which should punch a hole in the file. If I switch to invalidate_inode_pages2_range(), then I think dirty data is dropped and the backed page remains (I'm currently reading over that now). > If this is tmpfs specific behaviour that is required, then IMO > fadvise is not the correct interface to use here because fadvise is > supposed to be a generic interface to controlling the page cache > behaviour on any given file.... > > > As a counter-point, this is my first thought of an implementation approach > > (-ENOPATCH, sorry) > > > > - new mount option for tmpfs e.g. 'volatile'. Any file in a filesystem > > mounted with that option and which is not currently open by any process can > > have blocks removed at any time. The file name must remain, and the file > > size must not change. > > - lseek can be used to determine if anything has been purged with 'SEEK_DATA' > > and 'SEEK_HOLE'. > > > > So you can only mark volatility on a whole-file granularity (hence the > > question above). > > 'open' says "NONVOLATILE". > > 'close' says "VOLATILE". > > 'lseek' is used to check if anything was discarded. > > > > This approach would allow multiple processes to share a cache (might this be > > valuable?) as it doesn't become truly volatile until all processes close > > their handles. > > If this functionality is only useful for tmpfs, then I'd much prefer > a tmpfs specific approach like this.... Since, as I think more on this, this seems to map closer to file hole punching, would fallocate be the right interface? FALLOC_FL_PUNCH_HOLE isn't supported by all filesystems, after all. Maybe FALLOC_FL_VOLATILE and FALLOC_FL_NONVOLATILE? thanks -john ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-02-15 1:37 ` NeilBrown 2012-02-17 4:45 ` Dave Chinner @ 2012-02-17 5:21 ` John Stultz 2012-02-20 7:34 ` NeilBrown 1 sibling, 1 reply; 31+ messages in thread From: John Stultz @ 2012-02-17 5:21 UTC (permalink / raw) To: NeilBrown Cc: Dave Chinner, linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel On Wed, 2012-02-15 at 12:37 +1100, NeilBrown wrote: > On Tue, 14 Feb 2012 16:29:10 -0800 John Stultz <john.stultz@linaro.org> wrote: > > > But I'm open to other ideas and arguments. > > I didn't notice the original patch, but found it at > https://lwn.net/Articles/468837/ > and had a look. > > My first comment is -ENODOC. A bit background always helps, so let me try to > construct that: Apologies for not providing better documentation, and thanks for your first pass below. > The goal is to allow applications to interact with the kernel's cache > management infrastructure. In particular an application can say "this > memory contains data that might be useful in the future, but can be > reconstructed if necessary, and it is cheaper to reconstruct it than to read > it back from disk, so don't bother writing it out". Or alternatively for tmpfs/ramfs, "this data can be reconstructed, so purge it and free up memory". But as it currently stands, being fs agnostic, for disk backed filesystems "don't bother writing it out" is correct as well. > The proposed mechanism - at a high level - is for user-space to be able to > say "This memory is volatile" and then later "this memory is no longer > volatile". If the content of the memory is still available the second > request succeeds. If not, it fails.. Well, actually it succeeds but reports > that some content has been lost. (not sure what happens then - can the app do > a binary search to find which pages it still has or something). The app should expect all was lost in that range. > (technically we should probably include the cost to reconstruct the page, > which the kernel measures as 'seeks' but maybe that isn't necessary). Not sure I'm following this. > This is implemented by using files in a 'tmpfs' filesystem. These file > support three new flags to fadvise: > > POSIX_FADV_VOLATILE - this marks a range of pages as 'volatile'. They may be > removed from the page cache as needed, even if they are not 'clean'. > POSIX_FADV_NONVOLATILE - this marks a range of pages as non-volatile. > If any pages in the range were previously volatile but have since been > removed, then a status is returned reporting this. > POSIX_FADV_ISVOLATILE - this does not actually give any advice to the kernel > but rather asks a question: Are any of these pages volatile? > > > Is this an accurate description? Right now its not tmpfs specific, but otherwise this is pretty spot on. > My first thoughts are: > 1/ is page granularity really needed? Would file granularity be sufficient? The current users of similar functionality via ashmem do seem to find page granularity useful. You can share basically an unlinked tmpfs fd between two applications and mark and unmark ranges of pages "volatile" (unpinned in ashmem terms) as needed. > 2/ POSIX_FADV_ISVOLATILE is a warning sign to me - it doesn't actually > provide advice. Is this really needed? What for? Because it feels like > a wrong interface. It is more awkward, I agree. And the more I think about it, it seems like its something we can drop, as it is likely only useful as a probe before using a page, and using the POSIX_FADV_NONVOLAILE on the range to be used would also provide the same behavior. So I'll drop it in the next revision. > 3/ Given that this is specific to one filesystem, is fadvise really an > appropriate interface? > > (fleshing out the above documentation might be an excellent way to answer > these questions). So, the ashmem implementation is really tmpfs specific, but there's also the expectation on android devices that there isn't swap, so its more like ramfs. I'd like to think that this behavior makes some sense on other filesystems, providing a way to cheaply throw out dirty data without the cost of hitting the disk. However, the next time the file is opened, that could cause some really strange inconsistent results, with some recent pages written out and some stale pages. The vmtruncate would punch a hole instead of leaving stale data, but that still would have to hit the disk so its not free. So I'm not really sure if it makes sense in a totally generic way. That said, it would be easy for now to return errors if the fs isn't shmem based. Really, I'm not married to any specific interface here. fadvise just seemed the most logical to me. Given page granularity is needed, what would be a filesystem specific interface that makes sense here? > As a counter-point, this is my first thought of an implementation approach > (-ENOPATCH, sorry) > > - new mount option for tmpfs e.g. 'volatile'. Any file in a filesystem > mounted with that option and which is not currently open by any process can > have blocks removed at any time. The file name must remain, and the file > size must not change. > - lseek can be used to determine if anything has been purged with 'SEEK_DATA' > and 'SEEK_HOLE'. > > So you can only mark volatility on a whole-file granularity (hence the > question above). > 'open' says "NONVOLATILE". > 'close' says "VOLATILE". > 'lseek' is used to check if anything was discarded. > > This approach would allow multiple processes to share a cache (might this be > valuable?) as it doesn't become truly volatile until all processes close > their handles. I do really appreciate the feedback, but I don't think the full file semantics described here would work for what are essentially existing users of ashmem. thanks -john ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-02-17 5:21 ` John Stultz @ 2012-02-20 7:34 ` NeilBrown 2012-02-20 23:25 ` Dave Hansen 0 siblings, 1 reply; 31+ messages in thread From: NeilBrown @ 2012-02-20 7:34 UTC (permalink / raw) To: John Stultz Cc: Dave Chinner, linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel [-- Attachment #1: Type: text/plain, Size: 9279 bytes --] Hi John, thanks for your answers.... > > > The proposed mechanism - at a high level - is for user-space to be able to > > say "This memory is volatile" and then later "this memory is no longer > > volatile". If the content of the memory is still available the second > > request succeeds. If not, it fails.. Well, actually it succeeds but reports > > that some content has been lost. (not sure what happens then - can the app do > > a binary search to find which pages it still has or something). > > The app should expect all was lost in that range. So... the app has some idea of the real granularity of the cache, which is several objects in one file, and marks them volatile as a whole - then marks them non-volatile as a whole and if that fails it assumes that the whole object is gone. However the kernel doesn't really have any idea of real granularity and so just removes individual pages until it has freed up enough. It could have just corrupted a much bigger object and so the rest of the object is of no value and may as well be freed, but it has no way to know this, so frees something else instead. Is this a problem? If the typical granularity is a page or two then it is unlikely to hurt. If it is hundreds of pages I think it would mean that we don't make as good use of memory as we could (but it is all heuristics anyway and we probably waste lots of opportunities already so maybe it doesn't matter). My gut feeling is that seeing the app has concrete knowledge about granularity it should give it to the kernel somehow. > > > (technically we should probably include the cost to reconstruct the page, > > which the kernel measures as 'seeks' but maybe that isn't necessary). > > Not sure I'm following this. The shrinker in your code (and the original ashmem) contains: .seeks = DEFAULT_SEEKS * 4, This means that objects in this cache are 4 times as expensive to replace as most other caches. (the cost of replacing an entry in the cache is measured in 'seeks' and the default is to assume that it takes 2 seeks to reload and object). I don't really know what the practical importance of 'seeks' is. Maybe it is close to meaningless, in which case you should probably use DEFAULT_SEEKS like (almost) everyone else. Maybe it is quite relevant, in which case maybe you should expose that setting to user-space somehow. Or maybe 'DEFAULT_SEEKS * 4' is perfect for all possible users of this caching mechanism. I guess my point is that any non-default value should be justified. > > > This is implemented by using files in a 'tmpfs' filesystem. These file > > support three new flags to fadvise: > > > > POSIX_FADV_VOLATILE - this marks a range of pages as 'volatile'. They may be > > removed from the page cache as needed, even if they are not 'clean'. > > POSIX_FADV_NONVOLATILE - this marks a range of pages as non-volatile. > > If any pages in the range were previously volatile but have since been > > removed, then a status is returned reporting this. > > POSIX_FADV_ISVOLATILE - this does not actually give any advice to the kernel > > but rather asks a question: Are any of these pages volatile? > > > > > > Is this an accurate description? > > Right now its not tmpfs specific, but otherwise this is pretty spot on. > > > My first thoughts are: > > 1/ is page granularity really needed? Would file granularity be sufficient? > > The current users of similar functionality via ashmem do seem to find > page granularity useful. You can share basically an unlinked tmpfs fd > between two applications and mark and unmark ranges of pages > "volatile" (unpinned in ashmem terms) as needed. Sharing an unlinked cache between processes certainly seems like a valid case that my model doesn't cover. I feel uncomfortable about different processes being able to unpin each other's pages. It means they need to negotiate with each other to ensure one doesn't unpin a page that the other is using. If this was a common use case, it would make a lot of sense for the kernel to refcount the pinning so that a range only becomes really unpinned when no-one has it pinned any more. Do you know any more about these apps that share a cache file? Do they need extra inter-locking (or are they completely hypothetical?). > > > 2/ POSIX_FADV_ISVOLATILE is a warning sign to me - it doesn't actually > > provide advice. Is this really needed? What for? Because it feels like > > a wrong interface. > > It is more awkward, I agree. And the more I think about it, it seems > like its something we can drop, as it is likely only useful as a probe > before using a page, and using the POSIX_FADV_NONVOLAILE on the range to > be used would also provide the same behavior. So I'll drop it in the > next revision. Good. That makes me feel happier. > > > 3/ Given that this is specific to one filesystem, is fadvise really an > > appropriate interface? > > > > (fleshing out the above documentation might be an excellent way to answer > > these questions). > > So, the ashmem implementation is really tmpfs specific, but there's also > the expectation on android devices that there isn't swap, so its more > like ramfs. I'd like to think that this behavior makes some sense on > other filesystems, providing a way to cheaply throw out dirty data > without the cost of hitting the disk. However, the next time the file is > opened, that could cause some really strange inconsistent results, with > some recent pages written out and some stale pages. The vmtruncate would > punch a hole instead of leaving stale data, but that still would have to > hit the disk so its not free. So I'm not really sure if it makes sense > in a totally generic way. That said, it would be easy for now to return > errors if the fs isn't shmem based. As I think I said somewhere, I cannot see how the functionality makes any sense at all on a storage-backed filesystem - and what you have said about inconsistent on-disk images only re-enforces that. I think it should definitely be ramfs only (maybe tmpfs as well??). > > Really, I'm not married to any specific interface here. fadvise just > seemed the most logical to me. Given page granularity is needed, what > would be a filesystem specific interface that makes sense here? OK, let me try again. This looks to me a bit like byte-range locking. locking can already have a filesystem-specific implementation so this could be implemented as a ramfs-specific locking protocol. This would be activated by some mount option (or it could even be a different filesystem type - ramcachefs). 1- a shared lock (F_RDLCK) pins the range in memory and prevents an exclusive lock, or any purge of pages. 2- an exclusive lock (F_WRLCK) is used to create or re-create an object in the cache. 3- when pages are purged a lock-range is created which marks the range as purged and prevents any read lock from succeeding. This lock-range is removed when a write-lock is taken out. So initially all pages are marked by an internal 'purged' lock indicating that they contain nothing. Objects can be created by creating a write lock and writing data. Then unlocking (or down grading to a read lock) allows them to be accessed by other processes. Any process that wants to read an object first asks for a shared lock. If this succeeds they are sure that the pages are still available (and that no-one has an exclusive lock). If the shared lock fails then at least one page doesn't exist - probably all are gone. They can then optionally try to get a write lock. Once they get that they can revalidate somehow, or refill the object. When the last lock is removed, the locking code could keep the range information but mark it as unlocked and put it on an lru list. So 4 sorts of ranges are defined and they cover the entire file: shared locks - these might overlap exclusive locks - these don't overlap purge locks - mark ranges that have been purged or never written pending locks - mark all remaining ranges. When a shared or exclusive lock is released it becomes a pending lock. When the shrinker fires it converts some number of pending locks to purge locks and discards the pages wholly in them. A shared lock can only be taken when there is a shared or pending lock there. An exclusive lock can be taken when a purge or pending lock is present. For the most part this doesn't conflict with the more normal usage of byte range locks. However it does mean that a process cannot place a range in a state where some other process is allowed to write, but the kernel is not allowed to purge the pages. I cannot tell if this might be a problem. (it could probably be managed by some convention where locking the first byte in an object gives read/write permission and locking the rest keeps it in cache. One byte by itself will never be purged). I'm not sure what should happen if you write without first getting a write lock. I guess it should turn a purge lock to a pending lock, but leave an other range unchanged. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-02-20 7:34 ` NeilBrown @ 2012-02-20 23:25 ` Dave Hansen 0 siblings, 0 replies; 31+ messages in thread From: Dave Hansen @ 2012-02-20 23:25 UTC (permalink / raw) To: NeilBrown Cc: John Stultz, Dave Chinner, linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Rik van Riel On 02/19/2012 11:34 PM, NeilBrown wrote: > Is this a problem? If the typical granularity is a page or two then it is > unlikely to hurt. If it is hundreds of pages I think it would mean that we > don't make as good use of memory as we could (but it is all heuristics anyway > and we probably waste lots of opportunities already so maybe it doesn't > matter). > > My gut feeling is that seeing the app has concrete knowledge about > granularity it should give it to the kernel somehow. I dunno... The kernel works well with an awful lot of applications without ever having any concept of the underlying objects. I guess the worst-case scenario would be if we have a very large object, most of its pages being accessed very often, but one never accessed. The one page makes its way to the end of the LRU and gets discarded, now the whole object is worthless, and most of it is at the beginning of the LRU. If it is truly accessed often, then userspace should notice quickly and discard the entire object quickly. If it isn't accessed often, then the majority of the object is moving down the LRU and won't be wasted for long. What else can we do? I guess whenever a range is set NONVOLATILE, we could SetPageReferenced() on every page in the range to keep them bunched up on the LRU. ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <CAO6Zf6B6nGqsz5zpT3ixbO-+JWxMsScABasnwo-CVHuMKPqpLQ@mail.gmail.com>]
* Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags [not found] ` <CAO6Zf6B6nGqsz5zpT3ixbO-+JWxMsScABasnwo-CVHuMKPqpLQ@mail.gmail.com> @ 2012-02-17 3:43 ` John Stultz 2012-02-17 5:24 ` John Stultz 0 siblings, 1 reply; 31+ messages in thread From: John Stultz @ 2012-02-17 3:43 UTC (permalink / raw) To: Dmitry Adamushko Cc: linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel On Sun, 2012-02-12 at 13:48 +0100, Dmitry Adamushko wrote: > > On 10 February 2012 01:16, John Stultz <john.stultz@linaro.org> wrote: > +static inline void volatile_range_shrink(struct > volatile_range *range, > + pgoff_t start_index, pgoff_t > end_index) > +{ > + size_t pre = range_size(range); > + > + range->range_node.start = start_index; > + range->range_node.end = end_index; > + > > I guess, here we get a whole range of races with volatile_shrink(), > which may see inconsistent (in-the-middle-of-update) ranges > (e.g. .start and .end). We should be holding the vlist_mutex to avoid any such races. But you also make clear that volatile_range_shrink() should really be called volatile_range_resize(), since having two _shrink calls is terrible. My apologies. > + unsigned long nr_to_scan = sc->nr_to_scan; > + const gfp_t gfp_mask = sc->gfp_mask; > + > + /* We might recurse into filesystem code, so bail out > if necessary */ > + if (nr_to_scan && !(gfp_mask & __GFP_FS)) > + return -1; > + if (!nr_to_scan) > + return lru_count; > > So it's u64 -> int here, which is possibly 32 bits and signed. Can't > it lead to inconsistent results on 32bit platforms? Good point. Thanks for pointing that out. > + start = range->range_node.start * PAGE_SIZE; > + end = (range->range_node.end + 1) * PAGE_SIZE > - 1; > > PAGE_CACHE_SHIFT was used in fadvise() to calculate .start and .end > indexes, and here we use PAGE_SIZE to get back to 'normal' addresses. > Isn't it inconsistent at the very least? Fair enough. > > + nr_to_scan -= range_size(range); > > hmm, unsigned long -= u64 > > + if (nr_to_scan <= 0) > > nr_to_scan is "unsigned long" :-)) Good catch. Thanks for the feedback! -john ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags 2012-02-17 3:43 ` John Stultz @ 2012-02-17 5:24 ` John Stultz 0 siblings, 0 replies; 31+ messages in thread From: John Stultz @ 2012-02-17 5:24 UTC (permalink / raw) To: Dmitry Adamushko Cc: linux-kernel, Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel On Thu, 2012-02-16 at 19:43 -0800, John Stultz wrote: > On Sun, 2012-02-12 at 13:48 +0100, Dmitry Adamushko wrote: > > > > On 10 February 2012 01:16, John Stultz <john.stultz@linaro.org> wrote: > > > +static inline void volatile_range_shrink(struct > > volatile_range *range, > > + pgoff_t start_index, pgoff_t > > end_index) > > +{ > > + size_t pre = range_size(range); > > + > > + range->range_node.start = start_index; > > + range->range_node.end = end_index; > > + > > > > I guess, here we get a whole range of races with volatile_shrink(), > > which may see inconsistent (in-the-middle-of-update) ranges > > (e.g. .start and .end). > > We should be holding the vlist_mutex to avoid any such races. But you > also make clear that volatile_range_shrink() should really be called > volatile_range_resize(), since having two _shrink calls is terrible. My > apologies. And sure enough in the shrinker we're not holding the vlist_mutex. Thanks for pointing that out. -john ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2012-07-21 11:17 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-03-16 22:51 [PATCH 0/2] [RFC] Volatile ranges (v4) John Stultz 2012-03-16 22:51 ` [PATCH 1/2] [RFC] Range tree implementation John Stultz 2012-03-20 10:00 ` Dmitry Adamushko 2012-03-20 18:04 ` John Stultz 2012-03-20 16:44 ` Aneesh Kumar K.V 2012-03-16 22:51 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz 2012-03-17 0:47 ` [PATCH] fadvise volatile fixes from Dmitry John Stultz 2012-03-17 16:21 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags Dmitry Adamushko 2012-03-18 9:13 ` Dmitry Adamushko 2012-03-20 0:18 ` John Stultz 2012-07-19 10:13 ` [PATCH 0/2] [RFC] Volatile ranges (v4) Dmitry Vyukov [not found] ` <CAO6Zf6BSpq53UqYjCkq0b3pTPW=WDTnCorQ59tONnV7U-U6EOg@mail.gmail.com> [not found] ` <CACT4Y+ZgBo9=HX5MHhmWBiQcdiGMss9RSS_reF4gJimivJx7sQ@mail.gmail.com> 2012-07-21 11:17 ` Dmitry Adamushko -- strict thread matches above, loose matches on Subject: below -- 2012-04-14 1:07 [PATCH 0/2][RFC] Volatile Ranges (v7) John Stultz 2012-04-14 1:08 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz 2012-04-07 0:08 [PATCH 0/2] [RFC] Volatile Ranges (v6) John Stultz 2012-04-07 0:08 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz 2012-03-21 4:15 [PATCH 0/2] [RFC] fadivse volatile & range tree (v5) John Stultz 2012-03-21 4:15 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz 2012-02-10 0:16 [PATCH 1/2] [RFC] Range tree implementation John Stultz 2012-02-10 0:16 ` [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz 2012-02-12 14:08 ` Dmitry Adamushko 2012-02-17 3:49 ` John Stultz 2012-02-14 5:16 ` Dave Chinner 2012-02-14 5:55 ` John Stultz 2012-02-14 23:51 ` Dave Chinner 2012-02-15 0:29 ` John Stultz 2012-02-15 1:37 ` NeilBrown 2012-02-17 4:45 ` Dave Chinner 2012-02-17 5:27 ` NeilBrown 2012-02-17 5:38 ` John Stultz 2012-02-17 5:21 ` John Stultz 2012-02-20 7:34 ` NeilBrown 2012-02-20 23:25 ` Dave Hansen [not found] ` <CAO6Zf6B6nGqsz5zpT3ixbO-+JWxMsScABasnwo-CVHuMKPqpLQ@mail.gmail.com> 2012-02-17 3:43 ` John Stultz 2012-02-17 5:24 ` John Stultz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).