From: Trond Myklebust <trond.myklebust@primarydata.com>
To: Jeff Layton <jlayton@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>,
linuxnfs <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH 7/7] nfs: page cache invalidation for dio
Date: Fri, 24 Jan 2014 11:46:41 -0700 [thread overview]
Message-ID: <1390589201.2927.46.camel@leira.trondhjem.org> (raw)
In-Reply-To: <20140124130013.22f706de@tlielax.poochiereds.net>
On Fri, 2014-01-24 at 13:00 -0500, Jeff Layton wrote:
> On Fri, 24 Jan 2014 10:40:06 -0700
> Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>
> > On Fri, 2014-01-24 at 12:29 -0500, Jeff Layton wrote:
> > > On Fri, 24 Jan 2014 10:11:11 -0700
> > > Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> > >
> > > >
> > > > On Jan 24, 2014, at 8:52, Jeff Layton <jlayton@redhat.com> wrote:
> > > >
> > > > > On Wed, 22 Jan 2014 07:04:09 -0500
> > > > > Jeff Layton <jlayton@redhat.com> wrote:
> > > > >
> > > > >> On Wed, 22 Jan 2014 00:24:14 -0800
> > > > >> Christoph Hellwig <hch@infradead.org> wrote:
> > > > >>
> > > > >>> On Tue, Jan 21, 2014 at 02:21:59PM -0500, Jeff Layton wrote:
> > > > >>>> In any case, this helps but it's a little odd. With this patch, you add
> > > > >>>> an invalidate_inode_pages2 call prior to doing the DIO. But, you've
> > > > >>>> also left in the call to nfs_zap_mapping in the completion codepath.
> > > > >>>>
> > > > >>>> So now, we shoot down the mapping prior to doing a DIO write, and then
> > > > >>>> mark the mapping for invalidation again when the write completes. Was
> > > > >>>> that intentional?
> > > > >>>>
> > > > >>>> It seems a little excessive and might hurt performance in some cases.
> > > > >>>> OTOH, if you mix buffered and DIO you're asking for trouble anyway and
> > > > >>>> this approach seems to give better cache coherency.
> > > > >>>
> > > > >>> Thile follows the model implemented and documented in
> > > > >>> generic_file_direct_write().
> > > > >>>
> > > > >>
> > > > >> Ok, thanks. That makes sense, and the problem described in those
> > > > >> comments is almost exactly the one I've seen in practice.
> > > > >>
> > > > >> I'm still not 100% thrilled with the way that the NFS_INO_INVALID_DATA
> > > > >> flag is handled, but that really has nothing to do with this patchset.
> > > > >>
> > > > >> You can add my Tested-by to the set if you like...
> > > > >>
> > > > >
> > > > > (re-sending with Trond's address fixed)
> > > > >
> > > > > I may have spoken too soon...
> > > > >
> > > > > This patchset didn't fix the problem once I cranked up the concurrency
> > > > > from 100 child tasks to 1000. I think that HCH's patchset makes sense
> > > > > and helps narrow the race window some, but the way that
> > > > > nfs_revalidate_mapping/nfs_invalidate_mapping work is just racy.
> > > > >
> > > > > The following patch does seem to fix it however. It's a combination of
> > > > > a test patch that Trond gave me a while back and another change to
> > > > > serialize the nfs_invalidate_mapping ops.
> > > > >
> > > > > I think it's a reasonable approach to deal with the problem, but we
> > > > > likely have some other areas that will need similar treatment since
> > > > > they also check NFS_INO_INVALID_DATA:
> > > > >
> > > > > nfs_write_pageuptodate
> > > > > nfs_readdir_search_for_cookie
> > > > > nfs_update_inode
> > > > >
> > > > > Trond, thoughts? It's not quite ready for merge, but I'd like to get an
> > > > > opinion on the basic approach, or whether you have an idea of how
> > > > > better handle the races here:
> > > >
> > > > I think that it is reasonable for nfs_revalidate_mapping, but I don’t see how it is relevant to nfs_update_inode or nfs_write_pageuptodate.
> > > > Readdir already has its own locking at the VFS level, so we shouldn’t need to care there.
> > > >
> > >
> > >
> > > nfs_write_pageuptodate does this:
> > >
> > > ---------------8<-----------------
> > > if (NFS_I(inode)->cache_validity & (NFS_INO_INVALID_DATA|NFS_INO_REVAL_PAGECACHE))
> > > return false;
> > > out:
> > > return PageUptodate(page) != 0;
> > > ---------------8<-----------------
> > >
> > > With the proposed patch, NFS_INO_INVALID_DATA would be cleared first and
> > > only later would the page be invalidated. So, there's a race window in
> > > there where the bit could be cleared but the page flag is still set,
> > > even though it's on its way out the cache. So, I think we'd need to do
> > > some similar sort of locking in there to make sure that doesn't happen.
> >
> > We _cannot_ lock against nfs_revalidate_mapping() here, because we could
> > end up deadlocking with invalidate_inode_pages2().
> >
> > If you like, we could add a test for NFS_INO_INVALIDATING, to turn off
> > the optimisation in that case, but I'd like to understand what the race
> > would be: don't forget that the page is marked as PageUptodate(), which
> > means that either invalidate_inode_pages2() has not yet reached this
> > page, or that a read of the page succeeded after the invalidation was
> > made.
> >
>
> Right. The first situation seems wrong to me. We've marked the file as
> INVALID and then cleared the bit to start the process of invalidating
> the actual pages. It seems like nfs_write_pageuptodate ought not return
> true even if PageUptodate() is still set at that point.
>
> We could check NFS_INO_INVALIDATING, but we might miss that
> optimization in a lot of cases just because something happens to be
> in nfs_revalidate_mapping. Maybe that means that this bitlock isn't
> sufficient and we need some other mechanism. I'm not sure what that
> should be though.
Convert your patch to use wait_on_bit(), and then to call
wait_on_bit_lock() if and only if you see NFS_INO_INVALID_DATA is set.
> > > nfs_update_inode just does this:
> > >
> > > if (invalid & NFS_INO_INVALID_DATA)
> > > nfs_fscache_invalidate(inode);
> > >
> > > ...again, since we clear the bit first with this patch, I think we have
> > > a potential race window there too. We might not see it set in a
> > > situation where we would have before. That case is a bit more
> > > problematic since we can't sleep to wait on the bitlock there.
> >
> > Umm... That test in nfs_update_inode() is there because we might just
> > have _set_ the NFS_INO_INVALID_DATA bit.
> >
>
> Correct. But do we need to force a fscache invalidation at that point,
> or can it wait until we're going to invalidate the mapping too?
That's a question for David. My assumption is that since invalidation is
handled asynchronously by the fscache layer itself, that we need to let
it start that process as soon as possible, but perhaps these races are
an indication that we should actually do it at the time when we call
invalidate_inode_pages2() (or at the latest, when we're evicting the
inode from the icache)...
--
Trond Myklebust
Linux NFS client maintainer
next prev parent reply other threads:[~2014-01-24 18:46 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-11-14 16:50 [PATCH 0/7] direct I/O fixes Christoph Hellwig
2013-11-14 16:50 ` [PATCH 1/7] nfs: fix size updates for aio writes Christoph Hellwig
2013-11-14 16:50 ` [PATCH 2/7] nfs: defer inode_dio_done call until size update is done Christoph Hellwig
2013-11-14 16:50 ` [PATCH 3/7] nfs: increment i_dio_count for reads, too Christoph Hellwig
2013-11-14 16:50 ` [PATCH 4/7] nfs: merge nfs_direct_read into nfs_file_direct_read Christoph Hellwig
2013-11-14 16:50 ` [PATCH 5/7] nfs: merge nfs_direct_write into nfs_file_direct_write Christoph Hellwig
2013-11-14 16:50 ` [PATCH 6/7] nfs: take i_mutex during direct I/O reads Christoph Hellwig
2013-11-14 17:00 ` Chuck Lever
2013-11-15 14:29 ` Christoph Hellwig
2013-11-14 20:43 ` Trond Myklebust
2013-11-15 14:32 ` Christoph Hellwig
2013-11-15 15:23 ` Trond Myklebust
2013-11-15 15:25 ` Christoph Hellwig
2013-11-15 15:34 ` Trond Myklebust
2013-11-15 15:37 ` Christoph Hellwig
2013-11-15 16:00 ` Trond Myklebust
2013-11-14 16:50 ` [PATCH 7/7] nfs: page cache invalidation for dio Christoph Hellwig
2013-11-14 18:35 ` Jeff Layton
2013-11-15 14:28 ` Christoph Hellwig
2013-11-15 14:52 ` Jeff Layton
2013-11-15 15:02 ` Christoph Hellwig
2013-11-15 15:33 ` Jeff Layton
2014-01-21 19:21 ` Jeff Layton
2014-01-22 8:24 ` Christoph Hellwig
2014-01-22 12:04 ` Jeff Layton
2014-01-24 15:50 ` Jeff Layton
2014-01-24 15:52 ` Jeff Layton
2014-01-24 17:11 ` Trond Myklebust
2014-01-24 17:29 ` Jeff Layton
2014-01-24 17:40 ` Trond Myklebust
2014-01-24 18:00 ` Jeff Layton
2014-01-24 18:46 ` Trond Myklebust [this message]
2014-01-24 21:21 ` Jeff Layton
2014-01-25 0:39 ` Trond Myklebust
2014-01-25 0:54 ` Jeff Layton
2014-01-25 1:05 ` Trond Myklebust
2014-01-25 1:11 ` Trond Myklebust
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1390589201.2927.46.camel@leira.trondhjem.org \
--to=trond.myklebust@primarydata.com \
--cc=hch@infradead.org \
--cc=jlayton@redhat.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).