All of lore.kernel.org
 help / color / mirror / Atom feed
From: John Stultz <john.stultz@linaro.org>
To: Dave Chinner <david@fromorbit.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Android Kernel Team <kernel-team@android.com>,
	Robert Love <rlove@google.com>, Mel Gorman <mel@csn.ul.ie>,
	Hugh Dickins <hughd@google.com>,
	Dave Hansen <dave@linux.vnet.ibm.com>,
	Rik van Riel <riel@redhat.com>,
	Dmitry Adamushko <dmitry.adamushko@gmail.com>,
	Neil Brown <neilb@suse.de>, Andrea Righi <andrea@betterlinux.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Subject: Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
Date: Mon, 30 Apr 2012 12:40:13 -0700	[thread overview]
Message-ID: <4F9EEA9D.8020909@linaro.org> (raw)
In-Reply-To: <20120428020444.GK9541@dastard>

On 04/27/2012 07:04 PM, Dave Chinner wrote:
> On Fri, Apr 27, 2012 at 12:14:18PM -0700, John Stultz wrote:
>> On 04/26/2012 05:39 PM, Dave Chinner wrote:
>>> On Tue, Apr 24, 2012 at 10:49:46AM -0700, John Stultz wrote:
>>>> @@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>>>>   			invalidate_mapping_pages(mapping, start_index,
>>>>   						end_index);
>>>>   		break;
>>>> +	case POSIX_FADV_VOLATILE:
>>>> +		/* First and last PARTIAL page! */
>>>> +		start_index = offset>>   PAGE_CACHE_SHIFT;
>>>> +		end_index = endbyte>>   PAGE_CACHE_SHIFT;
>>>> +		ret = mapping_range_volatile(mapping, start_index, end_index);
>>>> +		break;
>>>> +	case POSIX_FADV_NONVOLATILE:
>>>> +		/* First and last PARTIAL page! */
>>>> +		start_index = offset>>   PAGE_CACHE_SHIFT;
>>>> +		end_index = endbyte>>   PAGE_CACHE_SHIFT;
>>>> +		ret = mapping_range_nonvolatile(mapping, start_index,
>>>> +								end_index);
>>> As it is, I'm still not sold on these being an fadvise() interface
>>> because all it really is a delayed hole punching interface whose
>>> functionailty is currently specific to tmpfs. The behaviour cannot
>>> be implemented sanely by anything else at this point.
>> Yea. So I spent some time looking at the various hole punching
>> mechanisms and they aren't all together consistent across
>> filesystems. For instance, on some filesystems (ext4 and mostly disk
>> backed fs) you have to use fallocate(fd,
>> |FALLOC_FL_PUNCH_HOLE,...)|, while on tmpfs, its
>> madvise(...,MADV_REMOVE).   So in a way, currently, the
>> FADVISE_VOLATILE is closer to a delayed MADVISE_REMOVE.
> The MADVISE_REMOVE functionality for hole punching works *only* for
> tmpfs - no other filesystem implements the .truncate_range() method.
> In fact, several filesystems *can't* implement .truncate_range()
> because there is no callout from the page cache truncation code to
> allow filesystems to punch out the underlying blocks. The
> vmtruncate() code is deprecated for this reason (and various others
> like a lack of error handling), and .truncate_range() is just as
> nasty. .truncate_range() needs to die, IMO.
>
> So, rather than building more infrastructure on a nasty, filesystem
> specific mmap() hack, implement .fallocate() on tmpfs and use the
> same interface that every other filesystem uses for punching holes.

Ah. Ok.  I wasn't aware that vmtruncate was deprecated.  Thanks for 
cluing me in here!

>>> This probably won't perform wonderfully, which is where the range
>>> tracking and delayed punching (and the implied memory freeing)
>>> optimiation comes into play. Sure, for tmpfs this can be implemented
>>> as a shrinker, but for real filesystems that have to punch blocks a
>>> shrinker is really the wrong context to be running such
>>> transactions. However, using the fallocate() interface allows each
>>> filesytsem to optimise the delayed hole punching as they see best,
>>> something that cannot be done with this fadvise() interface.
>> So if a shrinker isn't the right context, what would be a good
>> context for delayed hole punching?
> Like we in XFs for inode reclaim. We have a background workqueue
> that frees aged inodes periodically in the fastest manner possible
> (i.e. all async, no blocking on locks, etc), and the shrinker, when
> run kicks that background thread first, and then enters into
> synchronous reclaim. By the time a single sync reclaim cycle is run
> and throttled reclaim sufficiently, the background thread has done a
> great deal more work.
>
> A similar mechanism can be used for this functionality within XFS.
> Indeed, we could efficiently track which inodes have volatile ranges
> on them via a bit in the radix trees than index the inode cache,
> just like we do for reclaimable inodes. If we then used a bit in the
> page cache radix tree index to indicate volatile pages, we could
> then easily find the ranges we need to punch out without requiring
> some new tree and more per-inode memory.
>
> That's a very filesystem specific implementation - it's vastly
> different to you tmpfs implementation - but this is exactly what I
> mean about using fallocate to allow filesystems to optimise the
> implementation in the most suitable manner for them....
>

So, just to make sure I'm folloiwng you, you're suggesting that there 
would be a filesystem specific implementation at the top level. 
Something like a  mark_volatile(struct inode *, bool, loff_t, loff_t) 
inode operation? And the filesystem would then be responsible for 
managing the ranges and appropriately purging them?

Thanks again for the feedback, I'll continue looking into this.

thanks
-john





  reply	other threads:[~2012-04-30 19:41 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-24 17:49 [PATCH 0/3] Volatile Ranges John Stultz
2012-04-24 17:49 ` [PATCH 1/3] Range tree implementation John Stultz
2012-04-24 19:14   ` Peter Zijlstra
2012-04-24 19:25     ` John Stultz
2012-04-24 19:33       ` Peter Zijlstra
2012-04-25 12:16   ` Dmitry Adamushko
2012-04-25 16:19     ` John Stultz
2012-04-26 10:00       ` Dmitry Adamushko
2012-04-27 19:34         ` John Stultz
2012-04-24 17:49 ` [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz
2012-04-24 19:20   ` Peter Zijlstra
2012-04-24 19:50     ` John Stultz
2012-04-27  0:39   ` Dave Chinner
2012-04-27 15:25     ` Dave Hansen
2012-04-28  1:36       ` Dave Chinner
2012-04-30 21:07         ` John Stultz
2012-05-01  0:08           ` Dave Chinner
2012-05-01  0:46             ` John Stultz
2012-05-01  1:28               ` Dave Chinner
2012-04-27 19:14     ` John Stultz
2012-04-28  2:04       ` Dave Chinner
2012-04-30 19:40         ` John Stultz [this message]
2012-05-01  0:28           ` Dave Chinner
2012-05-01  1:15             ` John Stultz
2012-05-01  1:51               ` Dave Chinner
2012-04-24 17:49 ` [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
2012-04-24 19:21   ` Peter Zijlstra
2012-04-24 19:42     ` John Stultz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F9EEA9D.8020909@linaro.org \
    --to=john.stultz@linaro.org \
    --cc=akpm@linux-foundation.org \
    --cc=andrea@betterlinux.com \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=dave@linux.vnet.ibm.com \
    --cc=david@fromorbit.com \
    --cc=dmitry.adamushko@gmail.com \
    --cc=hughd@google.com \
    --cc=kernel-team@android.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mel@csn.ul.ie \
    --cc=neilb@suse.de \
    --cc=riel@redhat.com \
    --cc=rlove@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.