From: Dave Chinner <david@fromorbit.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Rik van Riel <riel@redhat.com>, Jan Kara <jack@suse.cz>,
Vlastimil Babka <vbabka@suse.cz>,
Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
Andi Kleen <andi@firstfloor.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Greg Thelen <gthelen@google.com>,
Christoph Hellwig <hch@infradead.org>,
Hugh Dickins <hughd@google.com>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Mel Gorman <mgorman@suse.de>, Minchan Kim <minchan.kim@gmail.com>,
Michel Lespinasse <walken@google.com>,
Seth Jennings <sjenning@linux.vnet.ibm.com>,
Roman Gushchin <klamm@yandex-team.ru>,
Ozgun Erdogan <ozgun@citusdata.com>,
Metin Doslu <metin@citusdata.com>,
linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [patch 6/9] mm + fs: store shadow entries in page cache
Date: Thu, 28 Nov 2013 10:32:49 +1100 [thread overview]
Message-ID: <20131127233249.GK10988@dastard> (raw)
In-Reply-To: <20131127170804.GD3556@cmpxchg.org>
On Wed, Nov 27, 2013 at 12:08:04PM -0500, Johannes Weiner wrote:
> On Tue, Nov 26, 2013 at 10:17:16AM +1100, Dave Chinner wrote:
> > On Sun, Nov 24, 2013 at 06:38:25PM -0500, Johannes Weiner wrote:
> > > Reclaim will be leaving shadow entries in the page cache radix tree
> > > upon evicting the real page. As those pages are found from the LRU,
> > > an iput() can lead to the inode being freed concurrently. At this
> > > point, reclaim must no longer install shadow pages because the inode
> > > freeing code needs to ensure the page tree is really empty.
> > >
> > > Add an address_space flag, AS_EXITING, that the inode freeing code
> > > sets under the tree lock before doing the final truncate. Reclaim
> > > will check for this flag before installing shadow pages.
> > >
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ....
> > > @@ -545,10 +546,25 @@ static void evict(struct inode *inode)
> > > */
> > > inode_wait_for_writeback(inode);
> > >
> > > + /*
> > > + * Page reclaim can not do iput() and thus can race with the
> > > + * inode teardown. Tell it when the address space is exiting,
> > > + * so that it does not install eviction information after the
> > > + * final truncate has begun.
> > > + *
> > > + * As truncation uses a lockless tree lookup, acquire the
> > > + * spinlock to make sure any ongoing tree modification that
> > > + * does not see AS_EXITING is completed before starting the
> > > + * final truncate.
> > > + */
> > > + spin_lock_irq(&inode->i_data.tree_lock);
> > > + mapping_set_exiting(&inode->i_data);
> > > + spin_unlock_irq(&inode->i_data.tree_lock);
> > > +
> > > if (op->evict_inode) {
> > > op->evict_inode(inode);
> > > } else {
> > > - if (inode->i_data.nrpages)
> > > + if (inode->i_data.nrpages || inode->i_data.nrshadows)
> > > truncate_inode_pages(&inode->i_data, 0);
> > > clear_inode(inode);
> > > }
> >
> > Ok, so what I see here is that we need a wrapper function that
> > handles setting the AS_EXITING flag and doing the "final"
> > truncate_inode_pages() call, and the locking for the AS_EXITING flag
> > moved into mapping_set_exiting()
> >
> > That is, because this AS_EXITING flag and it's locking constraints
> > are directly related to the upcoming truncate_inode_pages() call,
> > I'd prefer to see a helper that captures that relationship used
> > in all the filesystem code. e.g:
> >
> > void truncate_inode_pages_final(struct address_space *mapping)
> > {
> > spin_lock_irq(&mapping->tree_lock);
> > mapping_set_exiting(mapping);
> > spin_unlock_irq(&mapping->tree_lock);
> > if (inode->i_data.nrpages || inode->i_data.nrshadows)
> > truncate_inode_pages_range(mapping, 0, (loff_t)-1);
> > }
> >
> > And document it in Documentation/filesystems/porting as a mandatory
> > function to be called from ->evict_inode() implementations before
> > calling clear_inode(). You can then replace all the direct calls to
> > truncate_inode_pages() in the evict_inode() path with a call to
> > truncate_inode_pages_final().
>
> Ok, fair enough. I'll add a BUG_ON(!mapping_exiting(&inode->i_data))
> to the inode sanity checks on final teardown to make sure filesystems
> don't miss the change to truncate_inode_pages_final().
Good idea. :)
> > As it is, I'd really like to see that unconditional irq disable go
> > away from this code - disabling and enabling interrupts for every
> > single inode we reclaim is going to add significant overhead to this
> > hot code path. And given that:
> >
> > > +static inline void mapping_set_exiting(struct address_space *mapping)
> > > +{
> > > + set_bit(AS_EXITING, &mapping->flags);
> > > +}
> > > +
> > > +static inline int mapping_exiting(struct address_space *mapping)
> > > +{
> > > + return test_bit(AS_EXITING, &mapping->flags);
> > > +}
> >
> > these atomic bit ops, why do we need to take the tree_lock and
> > disable irqs in evict() to set this bit if there's nothing to
> > truncate on the inode? i.e. something like this:
> >
> > void truncate_inode_pages_final(struct address_space *mapping)
> > {
> > mapping_set_exiting(mapping);
> > if (inode->i_data.nrpages || inode->i_data.nrshadows) {
> > /*
> > * spinlock barrier to ensure all modifications are
> > * complete before we do the final truncate
> > */
> > spin_lock_irq(&mapping->tree_lock);
> > spin_unlock_irq(&mapping->tree_lock);
> > truncate_inode_pages_range(mapping, 0, (loff_t)-1);
> > }
>
> That would almost work, but we need to enforce ordering of the counter
> reads and updates or truncation might read 0 on both while racing with
> reclaim.
>
> Reclaim would have to do:
>
> spin_lock_irq(&mapping->tree_lock)
> if !mapping_exiting():
> swap shadow entry
> mapping->nrshadows++
> smp_wmb()
> mapping->nrpages--
> spin_unlock_irq(&mapping->tree_lock)
>
> and the final truncate side would have to do
>
> mapping_set_exiting()
> nrpages = mapping->nrpages
> smp_rmb()
> nrshadows = mapping->nrshadows
> if (nrpages || nrshadows)
> spin_lock_irq(&mapping->tree_lock)
> spin_unlock_irq(&mapping->tree_lock)
> truncate
I don't see a problem with doing that as long as the memory barriers
are properly documented. One ofthe advantages of pulling this code
together is that we can use more complex synchronisation techniques
it in this way without messing up code all over the place. ;)
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-11-27 23:32 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-11-24 23:38 [patch 0/9] mm: thrash detection-based file cache sizing v6 Johannes Weiner
2013-11-24 23:38 ` [patch 1/9] fs: cachefiles: use add_to_page_cache_lru() Johannes Weiner
2013-11-24 23:38 ` [patch 2/9] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
2013-11-25 8:21 ` Minchan Kim
2013-11-24 23:38 ` [patch 3/9] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
2013-11-25 8:21 ` Minchan Kim
2013-11-24 23:38 ` [patch 4/9] mm: filemap: move radix tree hole searching here Johannes Weiner
2013-11-24 23:38 ` [patch 5/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
2013-11-24 23:38 ` [patch 6/9] mm + fs: store shadow entries in page cache Johannes Weiner
2013-11-25 23:17 ` Dave Chinner
2013-11-26 10:20 ` Peter Zijlstra
2013-11-27 16:45 ` Johannes Weiner
2013-11-27 17:08 ` Johannes Weiner
2013-11-27 23:32 ` Dave Chinner [this message]
2013-11-24 23:38 ` [patch 7/9] mm: thrash detection-based file cache sizing Johannes Weiner
2013-11-25 23:50 ` Andrew Morton
2013-11-26 2:15 ` Johannes Weiner
2013-11-26 1:56 ` Ryan Mallon
2013-11-26 20:57 ` Johannes Weiner
2013-11-24 23:38 ` [patch 8/9] lib: radix_tree: tree node interface Johannes Weiner
2013-11-24 23:38 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
2013-11-25 23:49 ` Dave Chinner
2013-11-26 21:27 ` Johannes Weiner
2013-11-26 22:29 ` Dave Chinner
2013-11-26 23:00 ` Johannes Weiner
2013-11-27 0:59 ` Dave Chinner
2013-11-26 0:13 ` Andrew Morton
2013-11-26 22:05 ` Johannes Weiner
2013-11-26 0:57 ` [patch 0/9] mm: thrash detection-based file cache sizing v6 Andrew Morton
2013-11-26 22:30 ` Johannes Weiner
2013-11-28 4:40 ` Johannes Weiner
-- strict thread matches above, loose matches on Subject: below --
2013-12-02 19:21 [patch 0/9] mm: thrash detection-based file cache sizing v7 Johannes Weiner
2013-12-02 19:21 ` [patch 6/9] mm + fs: store shadow entries in page cache Johannes Weiner
2014-01-10 18:10 [patch 0/9] mm: thrash detection-based file cache sizing v8 Johannes Weiner
2014-01-10 18:10 ` [patch 6/9] mm + fs: store shadow entries in page cache Johannes Weiner
2014-01-10 22:30 ` Rik van Riel
2014-01-13 2:18 ` Minchan Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131127233249.GK10988@dastard \
--to=david@fromorbit.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=gthelen@google.com \
--cc=hannes@cmpxchg.org \
--cc=hch@infradead.org \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=klamm@yandex-team.ru \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=metin@citusdata.com \
--cc=mgorman@suse.de \
--cc=minchan.kim@gmail.com \
--cc=ozgun@citusdata.com \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=sjenning@linux.vnet.ibm.com \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=walken@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).