From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>,
xfs <linux-xfs@vger.kernel.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH v3] vfs: fix page locking deadlocks when deduping files
Date: Wed, 14 Aug 2019 08:33:49 -0700 [thread overview]
Message-ID: <20190814153349.GS7138@magnolia> (raw)
In-Reply-To: <20190814095448.GK6129@dread.disaster.area>
On Wed, Aug 14, 2019 at 07:54:48PM +1000, Dave Chinner wrote:
> On Tue, Aug 13, 2019 at 08:40:10AM -0700, Matthew Wilcox wrote:
> > On Tue, Aug 13, 2019 at 08:14:34AM -0700, Darrick J. Wong wrote:
> > > + /*
> > > + * Now that we've locked both pages, make sure they still
> > > + * represent the data we're interested in. If not, someone
> > > + * is invalidating pages on us and we lose.
> > > + */
> > > + if (src_page->mapping != src->i_mapping ||
> > > + src_page->index != srcoff >> PAGE_SHIFT ||
> > > + dest_page->mapping != dest->i_mapping ||
> > > + dest_page->index != destoff >> PAGE_SHIFT) {
> > > + same = false;
> > > + goto unlock;
> > > + }
> >
> > It is my understanding that you don't need to check the ->index here.
> > If I'm wrong about that, I'd really appreciate being corrected, because
> > the page cache locking is subtle.
>
> Ah, when talking to Darrick about this, I didn't notice the code
> took references on the page, so it probably doesn't need the index
> check - the page can't be recycled out from under us here an
> inserted into a new mapping until we drop the reference.
>
> What I was mainly concerned about here is that we only have a shared
> inode lock on the src inode, so this code can be running
> concurrently with both invalidation and insertion into the mapping.
> e.g. direct write io does invalidation, buffered read does
> insertion. Hence we have to be really careful about the data in the
> source page being valid and stable while we run the comparison.
>
> And on further thought, I don't think shared locking is actually
> safe here. A shared lock doesn't stop new direct IO from being
> submitted, so inode_dio_wait() just drains IO at that point in time
> and but doesn't provide any guarantee that there isn't concurrent
> DIO running.
>
> Hence we could do the comparison here, see the data is the same,
> drop the page lock, a DIO write then invalidates the page and writes
> new data while we are comparing the rest of page(s) in the range. By
> the time we've checked the whole range, the data at the start is no
> longer the same, and the comparison is stale.
>
> And then we do the dedupe operation oblivious to the fact the data
> on disk doesn't actually match anymore, and we corrupt the data in
> the destination file as it gets linked to mismatched data in the
> source file....
<urrrrrrk> Yeah, that looks like a bug to me. I didn't realize that
directio writes were IOLOCK_SHARED and therefore reflink has to take
IOLOCK_EXCL to block them.
Related question: do we need to do the same for MMAPLOCK? I think the
answer is yes because xfs_filemap_fault can call page_mkwrite with
MMAPLOCK_SHARED.
--D
> Darrick?
>
> > You call read_mapping_page() which returns the page with an elevated
> > refcount. That means the page can't go back to the page allocator and
> > be allocated again. It can, because it's unlocked, still be truncated,
> > so the check for ->mapping after locking it is needed. But the check
> > for ->index being correct was done by find_get_entry().
> >
> > See pagecache_get_page() -- if we specify FGP_LOCK, then it will lock
> > the page, check the ->mapping but not check ->index. OK, it does check
> > ->index, but in a VM_BUG_ON(), so it's not something that ought to be
> > able to be wrong.
>
> Yeah, we used to have to play tricks in the old XFS writeback
> clustering code to do our own non-blocking page cache lookups adn
> this was one of the things we needed to be careful about until
> the pagevec_lookup* interfaces came along and solved all the
> problems for us. Funny how the brain remembers old gotchas with
> also reminding you that the problems went away almost as long
> ago.....
>
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2019-08-14 15:34 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-13 15:14 [PATCH v3] vfs: fix page locking deadlocks when deduping files Darrick J. Wong
2019-08-13 15:40 ` Matthew Wilcox
2019-08-14 7:03 ` Gao Xiang
2019-08-14 7:17 ` Gao Xiang
2019-08-14 9:54 ` Dave Chinner
2019-08-14 15:33 ` Darrick J. Wong [this message]
2019-08-14 21:28 ` Dave Chinner
2019-08-15 0:41 ` Darrick J. Wong
2019-08-13 15:53 ` Filipe Manana
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190814153349.GS7138@magnolia \
--to=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.