linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: John Stultz <john.stultz@linaro.org>
To: Dave Chinner <david@fromorbit.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Android Kernel Team <kernel-team@android.com>,
	Robert Love <rlove@google.com>, Mel Gorman <mel@csn.ul.ie>,
	Hugh Dickins <hughd@google.com>,
	Dave Hansen <dave@linux.vnet.ibm.com>,
	Rik van Riel <riel@redhat.com>,
	Dmitry Adamushko <dmitry.adamushko@gmail.com>,
	Neil Brown <neilb@suse.de>, Andrea Righi <andrea@betterlinux.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Subject: Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
Date: Mon, 30 Apr 2012 12:40:13 -0700	[thread overview]
Message-ID: <4F9EEA9D.8020909@linaro.org> (raw)
In-Reply-To: <20120428020444.GK9541@dastard>

On 04/27/2012 07:04 PM, Dave Chinner wrote:
> On Fri, Apr 27, 2012 at 12:14:18PM -0700, John Stultz wrote:
>> On 04/26/2012 05:39 PM, Dave Chinner wrote:
>>> On Tue, Apr 24, 2012 at 10:49:46AM -0700, John Stultz wrote:
>>>> @@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>>>>   			invalidate_mapping_pages(mapping, start_index,
>>>>   						end_index);
>>>>   		break;
>>>> +	case POSIX_FADV_VOLATILE:
>>>> +		/* First and last PARTIAL page! */
>>>> +		start_index = offset>>   PAGE_CACHE_SHIFT;
>>>> +		end_index = endbyte>>   PAGE_CACHE_SHIFT;
>>>> +		ret = mapping_range_volatile(mapping, start_index, end_index);
>>>> +		break;
>>>> +	case POSIX_FADV_NONVOLATILE:
>>>> +		/* First and last PARTIAL page! */
>>>> +		start_index = offset>>   PAGE_CACHE_SHIFT;
>>>> +		end_index = endbyte>>   PAGE_CACHE_SHIFT;
>>>> +		ret = mapping_range_nonvolatile(mapping, start_index,
>>>> +								end_index);
>>> As it is, I'm still not sold on these being an fadvise() interface
>>> because all it really is a delayed hole punching interface whose
>>> functionailty is currently specific to tmpfs. The behaviour cannot
>>> be implemented sanely by anything else at this point.
>> Yea. So I spent some time looking at the various hole punching
>> mechanisms and they aren't all together consistent across
>> filesystems. For instance, on some filesystems (ext4 and mostly disk
>> backed fs) you have to use fallocate(fd,
>> |FALLOC_FL_PUNCH_HOLE,...)|, while on tmpfs, its
>> madvise(...,MADV_REMOVE).   So in a way, currently, the
>> FADVISE_VOLATILE is closer to a delayed MADVISE_REMOVE.
> The MADVISE_REMOVE functionality for hole punching works *only* for
> tmpfs - no other filesystem implements the .truncate_range() method.
> In fact, several filesystems *can't* implement .truncate_range()
> because there is no callout from the page cache truncation code to
> allow filesystems to punch out the underlying blocks. The
> vmtruncate() code is deprecated for this reason (and various others
> like a lack of error handling), and .truncate_range() is just as
> nasty. .truncate_range() needs to die, IMO.
>
> So, rather than building more infrastructure on a nasty, filesystem
> specific mmap() hack, implement .fallocate() on tmpfs and use the
> same interface that every other filesystem uses for punching holes.

Ah. Ok.  I wasn't aware that vmtruncate was deprecated.  Thanks for 
cluing me in here!

>>> This probably won't perform wonderfully, which is where the range
>>> tracking and delayed punching (and the implied memory freeing)
>>> optimiation comes into play. Sure, for tmpfs this can be implemented
>>> as a shrinker, but for real filesystems that have to punch blocks a
>>> shrinker is really the wrong context to be running such
>>> transactions. However, using the fallocate() interface allows each
>>> filesytsem to optimise the delayed hole punching as they see best,
>>> something that cannot be done with this fadvise() interface.
>> So if a shrinker isn't the right context, what would be a good
>> context for delayed hole punching?
> Like we in XFs for inode reclaim. We have a background workqueue
> that frees aged inodes periodically in the fastest manner possible
> (i.e. all async, no blocking on locks, etc), and the shrinker, when
> run kicks that background thread first, and then enters into
> synchronous reclaim. By the time a single sync reclaim cycle is run
> and throttled reclaim sufficiently, the background thread has done a
> great deal more work.
>
> A similar mechanism can be used for this functionality within XFS.
> Indeed, we could efficiently track which inodes have volatile ranges
> on them via a bit in the radix trees than index the inode cache,
> just like we do for reclaimable inodes. If we then used a bit in the
> page cache radix tree index to indicate volatile pages, we could
> then easily find the ranges we need to punch out without requiring
> some new tree and more per-inode memory.
>
> That's a very filesystem specific implementation - it's vastly
> different to you tmpfs implementation - but this is exactly what I
> mean about using fallocate to allow filesystems to optimise the
> implementation in the most suitable manner for them....
>

So, just to make sure I'm folloiwng you, you're suggesting that there 
would be a filesystem specific implementation at the top level. 
Something like a  mark_volatile(struct inode *, bool, loff_t, loff_t) 
inode operation? And the filesystem would then be responsible for 
managing the ranges and appropriately purging them?

Thanks again for the feedback, I'll continue looking into this.

thanks
-john





  reply	other threads:[~2012-04-30 19:41 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-24 17:49 [PATCH 0/3] Volatile Ranges John Stultz
2012-04-24 17:49 ` [PATCH 1/3] Range tree implementation John Stultz
2012-04-24 19:14   ` Peter Zijlstra
2012-04-24 19:25     ` John Stultz
2012-04-24 19:33       ` Peter Zijlstra
2012-04-25 12:16   ` Dmitry Adamushko
2012-04-25 16:19     ` John Stultz
2012-04-26 10:00       ` Dmitry Adamushko
2012-04-27 19:34         ` John Stultz
2012-04-24 17:49 ` [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz
2012-04-24 19:20   ` Peter Zijlstra
2012-04-24 19:50     ` John Stultz
2012-04-27  0:39   ` Dave Chinner
2012-04-27 15:25     ` Dave Hansen
2012-04-28  1:36       ` Dave Chinner
2012-04-30 21:07         ` John Stultz
2012-05-01  0:08           ` Dave Chinner
2012-05-01  0:46             ` John Stultz
2012-05-01  1:28               ` Dave Chinner
2012-04-27 19:14     ` John Stultz
2012-04-28  2:04       ` Dave Chinner
2012-04-30 19:40         ` John Stultz [this message]
2012-05-01  0:28           ` Dave Chinner
2012-05-01  1:15             ` John Stultz
2012-05-01  1:51               ` Dave Chinner
2012-04-24 17:49 ` [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
2012-04-24 19:21   ` Peter Zijlstra
2012-04-24 19:42     ` John Stultz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F9EEA9D.8020909@linaro.org \
    --to=john.stultz@linaro.org \
    --cc=akpm@linux-foundation.org \
    --cc=andrea@betterlinux.com \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=dave@linux.vnet.ibm.com \
    --cc=david@fromorbit.com \
    --cc=dmitry.adamushko@gmail.com \
    --cc=hughd@google.com \
    --cc=kernel-team@android.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mel@csn.ul.ie \
    --cc=neilb@suse.de \
    --cc=riel@redhat.com \
    --cc=rlove@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).