From: John Stultz <john.stultz@linaro.org>
To: Dave Chinner <david@fromorbit.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Android Kernel Team <kernel-team@android.com>,
	Robert Love <rlove@google.com>, Mel Gorman <mel@csn.ul.ie>,
	Hugh Dickins <hughd@google.com>,
	Dave Hansen <dave@linux.vnet.ibm.com>,
	Rik van Riel <riel@redhat.com>,
	Dmitry Adamushko <dmitry.adamushko@gmail.com>,
	Neil Brown <neilb@suse.de>, Andrea Righi <andrea@betterlinux.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Subject: Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
Date: Mon, 30 Apr 2012 18:15:44 -0700	[thread overview]
Message-ID: <4F9F3940.9000509@linaro.org> (raw)
In-Reply-To: <20120501002842.GR7015@dastard>
On 04/30/2012 05:28 PM, Dave Chinner wrote:
> On Mon, Apr 30, 2012 at 12:40:13PM -0700, John Stultz wrote:
>> On 04/27/2012 07:04 PM, Dave Chinner wrote:
>>> On Fri, Apr 27, 2012 at 12:14:18PM -0700, John Stultz wrote:
>>>> On 04/26/2012 05:39 PM, Dave Chinner wrote:
>>>>> This probably won't perform wonderfully, which is where the range
>>>>> tracking and delayed punching (and the implied memory freeing)
>>>>> optimiation comes into play. Sure, for tmpfs this can be implemented
>>>>> as a shrinker, but for real filesystems that have to punch blocks a
>>>>> shrinker is really the wrong context to be running such
>>>>> transactions. However, using the fallocate() interface allows each
>>>>> filesytsem to optimise the delayed hole punching as they see best,
>>>>> something that cannot be done with this fadvise() interface.
>>>> So if a shrinker isn't the right context, what would be a good
>>>> context for delayed hole punching?
>>> Like we in XFs for inode reclaim. We have a background workqueue
>>> that frees aged inodes periodically in the fastest manner possible
>>> (i.e. all async, no blocking on locks, etc), and the shrinker, when
>>> run kicks that background thread first, and then enters into
>>> synchronous reclaim. By the time a single sync reclaim cycle is run
>>> and throttled reclaim sufficiently, the background thread has done a
>>> great deal more work.
>>>
>>> A similar mechanism can be used for this functionality within XFS.
>>> Indeed, we could efficiently track which inodes have volatile ranges
>>> on them via a bit in the radix trees than index the inode cache,
>>> just like we do for reclaimable inodes. If we then used a bit in the
>>> page cache radix tree index to indicate volatile pages, we could
>>> then easily find the ranges we need to punch out without requiring
>>> some new tree and more per-inode memory.
>>>
>>> That's a very filesystem specific implementation - it's vastly
>>> different to you tmpfs implementation - but this is exactly what I
>>> mean about using fallocate to allow filesystems to optimise the
>>> implementation in the most suitable manner for them....
>>>
>> So, just to make sure I'm folloiwng you, you're suggesting that
>> there would be a filesystem specific implementation at the top
>> level. Something like a  mark_volatile(struct inode *, bool, loff_t,
>> loff_t) inode operation? And the filesystem would then be
>> responsible for managing the ranges and appropriately purging them?
> Not quite. I'm suggesting that you use the .fallocate() file
> operation to call into the filesystem specific code, and from there
> the filesystem code either calls a generic helper function to mark
> ranges as volatile and provides a callback for implementing the
> shrinker functionailty, or it implements it all itself.
>
> i.e. userspace would do:
>
> 	err = fallocate(fd, FALLOC_FL_MARK_VOLATILE, off, len);
> 	err = fallocate(fd, FALLOC_FL_CLEAR_VOLATILE, off, len);
>
> and that will get passed to the filesystem implementation of
> .fallocate (from do_fallocate()). The filesystem callout for this:
>
> 0 btrfs/file.c   1898 .fallocate = btrfs_fallocate,
> 1 ext4/file.c     247 .fallocate = ext4_fallocate,
> 2 gfs2/file.c    1015 .fallocate = gfs2_fallocate,
> 3 gfs2/file.c    1045 .fallocate = gfs2_fallocate,
> 4 ocfs2/file.c   2727 .fallocate = ocfs2_fallocate,
> 5 ocfs2/file.c   2774 .fallocate = ocfs2_fallocate,
> 6 xfs/xfs_file.c 1026 .fallocate = xfs_file_fallocate,
Ok.
Although noting that tmpfs doesn't implement fallocate, this isn't just 
a ruse to get me to implement fallocate for tmpfs, right? ;)
Out of curiosity: While for my uses, a tmpfs exclusive implementation 
isn't a problem for me,  I *really* appreciate your feedback and help 
focusing this into a more generic fs solution. But to get more context 
for your insights, I'm curious if you have any use cases in your mind 
that is directing this work to be more generic?
Again, you've been helpfully skeptical of the work, and also at the same 
time pushing it in a certain direction, and I want to make sure I better 
understand where you're coming from.  You've clarified the file security 
concern well enough, but is that all?  I just want to make sure that 
going the generic fallocate route is really worth it, as opposed to 
moving to madvise() and just failing file-data backed memory.
> can then call a generic helper like, say:
>
> 	filemap_mark_volatile_range(inode, off, len);
> 	filemap_clear_volatile_range(inode, off, len);
>
> to be able to use the range tree tracking you have written for this
> purpose. The filesystem is also free to track ranges however it
> pleases.
>
> The filesystem will need to be able to store a tree/list root for
> tracking all it's inodes that have volatile ranges, and register a
> shrinker to walk that list and do the work necessary when memory
> becomes low, but that is simple to do for a basic implementation.
>
Hrmm..  Currently I'm using a per-mapping range-tree along with a global 
LRU list that ties all the ranges together.
We could go for a per-filesystem filesystem controlled LRU as you 
suggest.  Although that along with the filesystem managed range-tree 
roots  would make the generic helpers a little complex. But it could 
probably be worked out.
I'll let you know when I have a first pass implementation or if I hit 
any walls.
Once again, thanks so much for all the feedback and guidance!
-john
next prev parent reply	other threads:[~2012-05-01  1:17 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-24 17:49 [PATCH 0/3] Volatile Ranges John Stultz
2012-04-24 17:49 ` [PATCH 1/3] Range tree implementation John Stultz
2012-04-24 19:14   ` Peter Zijlstra
2012-04-24 19:25     ` John Stultz
2012-04-24 19:33       ` Peter Zijlstra
2012-04-25 12:16   ` Dmitry Adamushko
2012-04-25 16:19     ` John Stultz
2012-04-26 10:00       ` Dmitry Adamushko
2012-04-27 19:34         ` John Stultz
2012-04-24 17:49 ` [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz
2012-04-24 19:20   ` Peter Zijlstra
2012-04-24 19:50     ` John Stultz
2012-04-27  0:39   ` Dave Chinner
2012-04-27 15:25     ` Dave Hansen
2012-04-28  1:36       ` Dave Chinner
2012-04-30 21:07         ` John Stultz
2012-05-01  0:08           ` Dave Chinner
2012-05-01  0:46             ` John Stultz
2012-05-01  1:28               ` Dave Chinner
2012-04-27 19:14     ` John Stultz
2012-04-28  2:04       ` Dave Chinner
2012-04-30 19:40         ` John Stultz
2012-05-01  0:28           ` Dave Chinner
2012-05-01  1:15             ` John Stultz [this message]
2012-05-01  1:51               ` Dave Chinner
2012-04-24 17:49 ` [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
2012-04-24 19:21   ` Peter Zijlstra
2012-04-24 19:42     ` John Stultz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox
  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):
  git send-email \
    --in-reply-to=4F9F3940.9000509@linaro.org \
    --to=john.stultz@linaro.org \
    --cc=akpm@linux-foundation.org \
    --cc=andrea@betterlinux.com \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=dave@linux.vnet.ibm.com \
    --cc=david@fromorbit.com \
    --cc=dmitry.adamushko@gmail.com \
    --cc=hughd@google.com \
    --cc=kernel-team@android.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mel@csn.ul.ie \
    --cc=neilb@suse.de \
    --cc=riel@redhat.com \
    --cc=rlove@google.com \
    /path/to/YOUR_REPLY
  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
  Be sure your reply has a Subject: header at the top and a blank line
  before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).