From: Dave Chinner <david@fromorbit.com>
To: Ross Zwisler <ross.zwisler@linux.intel.com>,
Jan Kara <jack@suse.cz>,
linux-fsdevel@vger.kernel.org, linux-nvdimm@lists.01.org,
xfs@oss.sgi.com, linux-ext4@vger.kernel.org,
Dan Williams <dan.j.williams@intel.com>
Subject: Re: Subtle races between DAX mmap fault and write path
Date: Thu, 28 Jul 2016 08:19:49 +1000 [thread overview]
Message-ID: <20160727221949.GU16044@dastard> (raw)
In-Reply-To: <20160727211039.GA20278@linux.intel.com>
On Wed, Jul 27, 2016 at 03:10:39PM -0600, Ross Zwisler wrote:
> On Wed, Jul 27, 2016 at 02:07:45PM +0200, Jan Kara wrote:
> > Hi,
> >
> > when testing my latest changes to DXA fault handling code I have hit the
> > following interesting race between the fault and write path (I'll show
> > function names for ext4 but xfs has the same issue AFAICT).
The XFS update I just pushed to Linus contains a rework of the XFS
DAX IO path. It no longer shares the XFS direct IO path, so it
doesn't contain any of the direct IO warts anymore.
> > We have a file 'f' which has a hole at offset 0.
> >
> > Process 0 Process 1
> >
> > data = mmap('f');
> > read data[0]
> > -> fault, we map a hole page
> >
> > pwrite('f', buf, len, 0)
> > -> ext4_file_write_iter
> > inode_lock(inode);
> > __generic_file_write_iter()
> > generic_file_direct_write()
> > invalidate_inode_pages2_range()
> > - drops hole page from
> > the radix tree
> > ext4_direct_IO()
> > dax_do_io()
> > - allocates block for
> > offset 0
> > data[0] = 1
> > -> page_mkwrite fault
> > -> ext4_dax_fault()
> > down_read(&EXT4_I(inode)->i_mmap_sem);
> > __dax_fault()
> > grab_mapping_entry()
> > - creates locked radix tree entry
> > - maps block into PTE
> > put_locked_mapping_entry()
> >
> > invalidate_inode_pages2_range()
> > - removes dax entry from
> > the radix tree
In XFS, we don't call __generic_file_write_iter or
generic_file_direct_write(), and xfs_file_dax_write() does not have
this trailing call to invalidate_inode_pages2_range() anymore. It's
DAX - there's nothing to invalidate, right?
So I think Christoph just (accidentally) removed this race condition
from XFS....
> > So we have just lost information that block 0 is mapped and needs flushing
> > caches.
> >
> > Also the fact that the consistency of data as viewed by mmap and
> > dax_do_io() relies on invalidate_inode_pages2_range() is somewhat
> > unexpected to me and we should document it somewhere.
I don't think it does - what, in DAX, is incoherent? If anything,
it's the data in the direct IO buffer, not the view the mmap will
see. i.e. the post-write invalidate is to ensure that applications
that have mmapped the file see the data written by direct IO. That's
not necessary with DAX.
> > The question is how to best fix this. I see three options:
> >
> > 1) Lock out faults during writes via exclusive i_mmap_sem. That is rather
> > harsh but should work - we call filemap_write_and_wait() in
> > generic_file_direct_write() so we flush out all caches for the relevant
> > area before dropping radix tree entries.
We don't call filemap_write_and_wait() in xfs_file_dax_write()
anymore, either. It's DAX - we don't need to flush anything to read
the correct data, and there's nothing cached that becomes stale when
we overwrite the destination memory.
[snip]
> > Any opinions on this?
>
> Can we just skip the two calls to invalidate_inode_pages2_range() in
> generic_file_direct_write() for DAX I/O?
The first invalidate is still there in XFS. The comment above it
in the new XFS code says:
/*
* Yes, even DAX files can have page cache attached to them: A zeroed
* page is inserted into the pagecache when we have to serve a write
* fault on a hole. It should never be dirtied and can simply be
* dropped from the pagecache once we get real data for the page.
*/
if (mapping->nrpages) {
ret = invalidate_inode_pages2(mapping);
WARN_ON_ONCE(ret);
}
> Similarly, for DAX I don't think we actually need to do the
> filemap_write_and_wait_range() call in generic_file_direct_write() either.
> It's a similar scenario - for direct I/O we are trying to make sure that any
> dirty data in the page cache is written out to media before the ->direct_IO()
> call happens. For DAX I don't think we care. If a user does an mmap() write
> which creates a dirty radix tree entry, then does a write(), we should be able
> to happily overwrite the old data with the new without flushing, and just
> leave the dirty radix tree entry in place.
Yup, that's pretty much how XFS treats DAX now - it's not direct IO,
but it's not buffered IO, either...
I don't think there's a problem that needs to be fixed in the DAX
code - the issue is all about the surrounding IO context. i.e
whether do_dax_io() is automatically coherent with mmap or not
because mmap directly exposes the CPU cache coherent memory
do_dax_io() modifies.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2016-07-27 22:20 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-07-27 12:07 Subtle races between DAX mmap fault and write path Jan Kara
2016-07-27 21:10 ` Ross Zwisler
2016-07-27 22:19 ` Dave Chinner [this message]
2016-07-28 8:10 ` Jan Kara
2016-07-29 2:21 ` Dave Chinner
2016-07-29 14:44 ` Dan Williams
2016-07-30 0:12 ` Dave Chinner
2016-07-30 0:53 ` Dan Williams
2016-08-01 1:46 ` Dave Chinner
2016-08-01 3:13 ` Keith Packard
2016-08-01 4:07 ` Dave Chinner
2016-08-01 4:39 ` Dan Williams
2016-08-01 7:39 ` Dave Chinner
2016-08-01 10:13 ` Boaz Harrosh
2016-08-02 0:21 ` Dave Chinner
2016-08-04 18:40 ` Kani, Toshimitsu
2016-08-05 11:27 ` Dave Chinner
2016-08-05 15:18 ` Kani, Toshimitsu
2016-08-05 19:58 ` Boylston, Brian
2016-08-08 9:26 ` Jan Kara
2016-08-08 12:30 ` Boylston, Brian
2016-08-08 13:11 ` Christoph Hellwig
2016-08-08 18:28 ` Jan Kara
2016-08-08 19:32 ` Kani, Toshimitsu
2016-08-08 23:12 ` Dave Chinner
2016-08-09 1:00 ` Kani, Toshimitsu
2016-08-09 5:58 ` Dave Chinner
2016-08-01 17:47 ` Dan Williams
2016-07-28 8:47 ` Jan Kara
2016-07-27 21:38 ` Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160727221949.GU16044@dastard \
--to=david@fromorbit.com \
--cc=dan.j.williams@intel.com \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-nvdimm@lists.01.org \
--cc=ross.zwisler@linux.intel.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).