Re: Subtle races between DAX mmap fault and write path

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Ross Zwisler <ross.zwisler@linux.intel.com>,
	Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org, linux-nvdimm@lists.01.org,
	xfs@oss.sgi.com, linux-ext4@vger.kernel.org,
	Dan Williams <dan.j.williams@intel.com>
Subject: Re: Subtle races between DAX mmap fault and write path
Date: Thu, 28 Jul 2016 08:19:49 +1000	[thread overview]
Message-ID: <20160727221949.GU16044@dastard> (raw)
In-Reply-To: <20160727211039.GA20278@linux.intel.com>

On Wed, Jul 27, 2016 at 03:10:39PM -0600, Ross Zwisler wrote:
> On Wed, Jul 27, 2016 at 02:07:45PM +0200, Jan Kara wrote:
> > Hi,
> > 
> > when testing my latest changes to DXA fault handling code I have hit the
> > following interesting race between the fault and write path (I'll show
> > function names for ext4 but xfs has the same issue AFAICT).

The XFS update I just pushed to Linus contains a rework of the XFS
DAX IO path. It no longer shares the XFS direct IO path, so it
doesn't contain any of the direct IO warts anymore.

> > We have a file 'f' which has a hole at offset 0.
> > 
> > Process 0				Process 1
> > 
> > data = mmap('f');
> > read data[0]
> >   -> fault, we map a hole page
> > 
> > 					pwrite('f', buf, len, 0)
> > 					  -> ext4_file_write_iter
> > 					    inode_lock(inode);
> > 					    __generic_file_write_iter()
> > 					      generic_file_direct_write()
> > 						invalidate_inode_pages2_range()
> > 						  - drops hole page from
> > 						    the radix tree
> > 						ext4_direct_IO()
> > 						  dax_do_io()
> > 						    - allocates block for
> > 						      offset 0
> > data[0] = 1
> >   -> page_mkwrite fault
> >     -> ext4_dax_fault()
> >       down_read(&EXT4_I(inode)->i_mmap_sem);
> >       __dax_fault()
> > 	grab_mapping_entry()
> > 	  - creates locked radix tree entry
> > 	- maps block into PTE
> > 	put_locked_mapping_entry()
> > 
> > 						invalidate_inode_pages2_range()
> > 						  - removes dax entry from
> > 						    the radix tree

In XFS, we don't call __generic_file_write_iter or
generic_file_direct_write(), and xfs_file_dax_write() does not have
this trailing call to invalidate_inode_pages2_range() anymore. It's
DAX - there's nothing to invalidate, right?

So I think Christoph just (accidentally) removed this race condition
from XFS....

> > So we have just lost information that block 0 is mapped and needs flushing
> > caches.
> > 
> > Also the fact that the consistency of data as viewed by mmap and
> > dax_do_io() relies on invalidate_inode_pages2_range() is somewhat
> > unexpected to me and we should document it somewhere.

I don't think it does - what, in DAX, is incoherent? If anything,
it's the data in the direct IO buffer, not the view the mmap will
see. i.e. the post-write invalidate is to ensure that applications
that have mmapped the file see the data written by direct IO. That's
not necessary with DAX.

> > The question is how to best fix this. I see three options:
> > 
> > 1) Lock out faults during writes via exclusive i_mmap_sem. That is rather
> > harsh but should work - we call filemap_write_and_wait() in
> > generic_file_direct_write() so we flush out all caches for the relevant
> > area before dropping radix tree entries.

We don't call filemap_write_and_wait() in xfs_file_dax_write()
anymore, either. It's DAX - we don't need to flush anything to read
the correct data, and there's nothing cached that becomes stale when
we overwrite the destination memory.

[snip]

> > Any opinions on this?
> 
> Can we just skip the two calls to invalidate_inode_pages2_range() in
> generic_file_direct_write() for DAX I/O?

The first invalidate is still there in XFS. The comment above it
in the new XFS code says:

	/*
	 * Yes, even DAX files can have page cache attached to them:  A zeroed
	 * page is inserted into the pagecache when we have to serve a write
	 * fault on a hole.  It should never be dirtied and can simply be
	 * dropped from the pagecache once we get real data for the page.
	 */
	if (mapping->nrpages) {
		ret = invalidate_inode_pages2(mapping);
		WARN_ON_ONCE(ret);
	}


> Similarly, for DAX I don't think we actually need to do the
> filemap_write_and_wait_range() call in generic_file_direct_write() either.
> It's a similar scenario - for direct I/O we are trying to make sure that any
> dirty data in the page cache is written out to media before the ->direct_IO()
> call happens.  For DAX I don't think we care.  If a user does an mmap() write
> which creates a dirty radix tree entry, then does a write(), we should be able
> to happily overwrite the old data with the new without flushing, and just
> leave the dirty radix tree entry in place.

Yup, that's pretty much how XFS treats DAX now - it's not direct IO,
but it's not buffered IO, either...

I don't think there's a problem that needs to be fixed in the DAX
code - the issue is all about the surrounding IO context. i.e
whether do_dax_io() is automatically coherent with mmap or not
because mmap directly exposes the CPU cache coherent memory
do_dax_io() modifies.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2016-07-27 22:20 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-27 12:07 Subtle races between DAX mmap fault and write path Jan Kara
2016-07-27 21:10 ` Ross Zwisler
2016-07-27 22:19   ` Dave Chinner [this message]
2016-07-28  8:10     ` Jan Kara
2016-07-29  2:21       ` Dave Chinner
2016-07-29 14:44         ` Dan Williams
2016-07-30  0:12           ` Dave Chinner
2016-07-30  0:53             ` Dan Williams
2016-08-01  1:46               ` Dave Chinner
2016-08-01  3:13                 ` Keith Packard
2016-08-01  4:07                   ` Dave Chinner
2016-08-01  4:39                     ` Dan Williams
2016-08-01  7:39                       ` Dave Chinner
2016-08-01 10:13             ` Boaz Harrosh
2016-08-02  0:21               ` Dave Chinner
2016-08-04 18:40                 ` Kani, Toshimitsu
2016-08-05 11:27                   ` Dave Chinner
2016-08-05 15:18                     ` Kani, Toshimitsu
2016-08-05 19:58                     ` Boylston, Brian
2016-08-08  9:26                       ` Jan Kara
2016-08-08 12:30                         ` Boylston, Brian
2016-08-08 13:11                           ` Christoph Hellwig
2016-08-08 18:28                           ` Jan Kara
2016-08-08 19:32                             ` Kani, Toshimitsu
2016-08-08 23:12                       ` Dave Chinner
2016-08-09  1:00                         ` Kani, Toshimitsu
2016-08-09  5:58                           ` Dave Chinner
2016-08-01 17:47             ` Dan Williams
2016-07-28  8:47   ` Jan Kara
2016-07-27 21:38 ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160727221949.GU16044@dastard \
    --to=david@fromorbit.com \
    --cc=dan.j.williams@intel.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=ross.zwisler@linux.intel.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).