From: John Stultz <john.stultz@linaro.org>
To: Dave Chinner <david@fromorbit.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Android Kernel Team <kernel-team@android.com>,
Robert Love <rlove@google.com>, Mel Gorman <mel@csn.ul.ie>,
Hugh Dickins <hughd@google.com>,
Dave Hansen <dave@linux.vnet.ibm.com>,
Rik van Riel <riel@redhat.com>,
Dmitry Adamushko <dmitry.adamushko@gmail.com>,
Neil Brown <neilb@suse.de>, Andrea Righi <andrea@betterlinux.com>,
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Subject: Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
Date: Mon, 30 Apr 2012 12:40:13 -0700 [thread overview]
Message-ID: <4F9EEA9D.8020909@linaro.org> (raw)
In-Reply-To: <20120428020444.GK9541@dastard>
On 04/27/2012 07:04 PM, Dave Chinner wrote:
> On Fri, Apr 27, 2012 at 12:14:18PM -0700, John Stultz wrote:
>> On 04/26/2012 05:39 PM, Dave Chinner wrote:
>>> On Tue, Apr 24, 2012 at 10:49:46AM -0700, John Stultz wrote:
>>>> @@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>>>> invalidate_mapping_pages(mapping, start_index,
>>>> end_index);
>>>> break;
>>>> + case POSIX_FADV_VOLATILE:
>>>> + /* First and last PARTIAL page! */
>>>> + start_index = offset>> PAGE_CACHE_SHIFT;
>>>> + end_index = endbyte>> PAGE_CACHE_SHIFT;
>>>> + ret = mapping_range_volatile(mapping, start_index, end_index);
>>>> + break;
>>>> + case POSIX_FADV_NONVOLATILE:
>>>> + /* First and last PARTIAL page! */
>>>> + start_index = offset>> PAGE_CACHE_SHIFT;
>>>> + end_index = endbyte>> PAGE_CACHE_SHIFT;
>>>> + ret = mapping_range_nonvolatile(mapping, start_index,
>>>> + end_index);
>>> As it is, I'm still not sold on these being an fadvise() interface
>>> because all it really is a delayed hole punching interface whose
>>> functionailty is currently specific to tmpfs. The behaviour cannot
>>> be implemented sanely by anything else at this point.
>> Yea. So I spent some time looking at the various hole punching
>> mechanisms and they aren't all together consistent across
>> filesystems. For instance, on some filesystems (ext4 and mostly disk
>> backed fs) you have to use fallocate(fd,
>> |FALLOC_FL_PUNCH_HOLE,...)|, while on tmpfs, its
>> madvise(...,MADV_REMOVE). So in a way, currently, the
>> FADVISE_VOLATILE is closer to a delayed MADVISE_REMOVE.
> The MADVISE_REMOVE functionality for hole punching works *only* for
> tmpfs - no other filesystem implements the .truncate_range() method.
> In fact, several filesystems *can't* implement .truncate_range()
> because there is no callout from the page cache truncation code to
> allow filesystems to punch out the underlying blocks. The
> vmtruncate() code is deprecated for this reason (and various others
> like a lack of error handling), and .truncate_range() is just as
> nasty. .truncate_range() needs to die, IMO.
>
> So, rather than building more infrastructure on a nasty, filesystem
> specific mmap() hack, implement .fallocate() on tmpfs and use the
> same interface that every other filesystem uses for punching holes.
Ah. Ok. I wasn't aware that vmtruncate was deprecated. Thanks for
cluing me in here!
>>> This probably won't perform wonderfully, which is where the range
>>> tracking and delayed punching (and the implied memory freeing)
>>> optimiation comes into play. Sure, for tmpfs this can be implemented
>>> as a shrinker, but for real filesystems that have to punch blocks a
>>> shrinker is really the wrong context to be running such
>>> transactions. However, using the fallocate() interface allows each
>>> filesytsem to optimise the delayed hole punching as they see best,
>>> something that cannot be done with this fadvise() interface.
>> So if a shrinker isn't the right context, what would be a good
>> context for delayed hole punching?
> Like we in XFs for inode reclaim. We have a background workqueue
> that frees aged inodes periodically in the fastest manner possible
> (i.e. all async, no blocking on locks, etc), and the shrinker, when
> run kicks that background thread first, and then enters into
> synchronous reclaim. By the time a single sync reclaim cycle is run
> and throttled reclaim sufficiently, the background thread has done a
> great deal more work.
>
> A similar mechanism can be used for this functionality within XFS.
> Indeed, we could efficiently track which inodes have volatile ranges
> on them via a bit in the radix trees than index the inode cache,
> just like we do for reclaimable inodes. If we then used a bit in the
> page cache radix tree index to indicate volatile pages, we could
> then easily find the ranges we need to punch out without requiring
> some new tree and more per-inode memory.
>
> That's a very filesystem specific implementation - it's vastly
> different to you tmpfs implementation - but this is exactly what I
> mean about using fallocate to allow filesystems to optimise the
> implementation in the most suitable manner for them....
>
So, just to make sure I'm folloiwng you, you're suggesting that there
would be a filesystem specific implementation at the top level.
Something like a mark_volatile(struct inode *, bool, loff_t, loff_t)
inode operation? And the filesystem would then be responsible for
managing the ranges and appropriately purging them?
Thanks again for the feedback, I'll continue looking into this.
thanks
-john
next prev parent reply other threads:[~2012-04-30 19:41 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-04-24 17:49 [PATCH 0/3] Volatile Ranges John Stultz
2012-04-24 17:49 ` [PATCH 1/3] Range tree implementation John Stultz
2012-04-24 19:14 ` Peter Zijlstra
2012-04-24 19:25 ` John Stultz
2012-04-24 19:33 ` Peter Zijlstra
2012-04-25 12:16 ` Dmitry Adamushko
2012-04-25 16:19 ` John Stultz
2012-04-26 10:00 ` Dmitry Adamushko
2012-04-27 19:34 ` John Stultz
2012-04-24 17:49 ` [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz
2012-04-24 19:20 ` Peter Zijlstra
2012-04-24 19:50 ` John Stultz
2012-04-27 0:39 ` Dave Chinner
2012-04-27 15:25 ` Dave Hansen
2012-04-28 1:36 ` Dave Chinner
2012-04-30 21:07 ` John Stultz
2012-05-01 0:08 ` Dave Chinner
2012-05-01 0:46 ` John Stultz
2012-05-01 1:28 ` Dave Chinner
2012-04-27 19:14 ` John Stultz
2012-04-28 2:04 ` Dave Chinner
2012-04-30 19:40 ` John Stultz [this message]
2012-05-01 0:28 ` Dave Chinner
2012-05-01 1:15 ` John Stultz
2012-05-01 1:51 ` Dave Chinner
2012-04-24 17:49 ` [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
2012-04-24 19:21 ` Peter Zijlstra
2012-04-24 19:42 ` John Stultz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4F9EEA9D.8020909@linaro.org \
--to=john.stultz@linaro.org \
--cc=akpm@linux-foundation.org \
--cc=andrea@betterlinux.com \
--cc=aneesh.kumar@linux.vnet.ibm.com \
--cc=dave@linux.vnet.ibm.com \
--cc=david@fromorbit.com \
--cc=dmitry.adamushko@gmail.com \
--cc=hughd@google.com \
--cc=kernel-team@android.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mel@csn.ul.ie \
--cc=neilb@suse.de \
--cc=riel@redhat.com \
--cc=rlove@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).