public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-nvdimm@lists.01.org, xfs@oss.sgi.com,
	linux-fsdevel@vger.kernel.org,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	linux-ext4@vger.kernel.org,
	Dan Williams <dan.j.williams@intel.com>
Subject: Re: Subtle races between DAX mmap fault and write path
Date: Thu, 28 Jul 2016 10:10:33 +0200	[thread overview]
Message-ID: <20160728081033.GC4094@quack2.suse.cz> (raw)
In-Reply-To: <20160727221949.GU16044@dastard>

On Thu 28-07-16 08:19:49, Dave Chinner wrote:
> On Wed, Jul 27, 2016 at 03:10:39PM -0600, Ross Zwisler wrote:
> > On Wed, Jul 27, 2016 at 02:07:45PM +0200, Jan Kara wrote:
> > > Hi,
> > > 
> > > when testing my latest changes to DXA fault handling code I have hit the
> > > following interesting race between the fault and write path (I'll show
> > > function names for ext4 but xfs has the same issue AFAICT).
> 
> The XFS update I just pushed to Linus contains a rework of the XFS
> DAX IO path. It no longer shares the XFS direct IO path, so it
> doesn't contain any of the direct IO warts anymore.

Ah, OK. I knew Christoph rewrote it but forgot to check your tree when
looking into this bug.

> > > We have a file 'f' which has a hole at offset 0.
> > > 
> > > Process 0				Process 1
> > > 
> > > data = mmap('f');
> > > read data[0]
> > >   -> fault, we map a hole page
> > > 
> > > 					pwrite('f', buf, len, 0)
> > > 					  -> ext4_file_write_iter
> > > 					    inode_lock(inode);
> > > 					    __generic_file_write_iter()
> > > 					      generic_file_direct_write()
> > > 						invalidate_inode_pages2_range()
> > > 						  - drops hole page from
> > > 						    the radix tree
> > > 						ext4_direct_IO()
> > > 						  dax_do_io()
> > > 						    - allocates block for
> > > 						      offset 0
> > > data[0] = 1
> > >   -> page_mkwrite fault
> > >     -> ext4_dax_fault()
> > >       down_read(&EXT4_I(inode)->i_mmap_sem);
> > >       __dax_fault()
> > > 	grab_mapping_entry()
> > > 	  - creates locked radix tree entry
> > > 	- maps block into PTE
> > > 	put_locked_mapping_entry()
> > > 
> > > 						invalidate_inode_pages2_range()
> > > 						  - removes dax entry from
> > > 						    the radix tree
> 
> In XFS, we don't call __generic_file_write_iter or
> generic_file_direct_write(), and xfs_file_dax_write() does not have
> this trailing call to invalidate_inode_pages2_range() anymore. It's
> DAX - there's nothing to invalidate, right?
> 
> So I think Christoph just (accidentally) removed this race condition
> from XFS....

So with current XFS code what prevents the following:

Process 0				Process 1

data = mmap('f');
					pwrite('f', buf, len, 0)
					  -> xfs_file_dax_write
					    xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
					    invalidate_inode_pages2()
					      - drops hole page from the
					        radix tree
					    xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
read data[0]
  -> fault
    -> xfs_filemap_fault
      xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
      dax_fault()
	- installs hole in the radix tree and PTE

					    dax_do_io()
					      - allocates block for offset 0

And now Process 0 doesn't see the data Process 1 has written until fault on
that address is triggered again for some reason.

Heck, looking at the code in xfs_file_dax_write() you call
invalidate_inode_pages2() which clears the *whole* radix tree. So you have
just lost the dirtiness information for the whole file and thus fsync(2)
will not flush any data written via mmap.

> > > So we have just lost information that block 0 is mapped and needs flushing
> > > caches.
> > > 
> > > Also the fact that the consistency of data as viewed by mmap and
> > > dax_do_io() relies on invalidate_inode_pages2_range() is somewhat
> > > unexpected to me and we should document it somewhere.
> 
> I don't think it does - what, in DAX, is incoherent? If anything,
> it's the data in the direct IO buffer, not the view the mmap will
> see. i.e. the post-write invalidate is to ensure that applications
> that have mmapped the file see the data written by direct IO. That's
> not necessary with DAX.

The coherency issues are very similar to direct IO because of hole pages.
Generally we need to maintain coherency between what is pointed to from page
tables and what is really backing file data (as viewed by inode's extent
tree). So whenever block allocation / deallocation happens for the file, we
may need to update page tables which map this offset as well. For
deallocation (truncate, punch hole, ...) we take care of this under
protection of MMAPLOCK so things are fine. For allocation we currently
don't use this serialization mechanism.

> > > The question is how to best fix this. I see three options:
> > > 
> > > 1) Lock out faults during writes via exclusive i_mmap_sem. That is rather
> > > harsh but should work - we call filemap_write_and_wait() in
> > > generic_file_direct_write() so we flush out all caches for the relevant
> > > area before dropping radix tree entries.
> 
> We don't call filemap_write_and_wait() in xfs_file_dax_write()
> anymore, either. It's DAX - we don't need to flush anything to read
> the correct data, and there's nothing cached that becomes stale when
> we overwrite the destination memory.

So DAX doesn't need flushing to maintain consistent view of the data but it
does need flushing to make sure fsync(2) results in data written via mmap
to reach persistent storage. Look at the following example: Assume you
write via mmap to offsets 0-1023 in the page and via write(2) to offsets
1024-2048 in the page. Then after write(2) you still may have data in
volatile caches for the page for offsets 0-1023. So your write(2) either
has to keep the corresponding entry in the radix tree dirty (and currently
XFS's call to invalidate_inode_pages2() in xfs_file_dax_write() doesn't do
this) or you have to flush the page via filemap_write_and_wait().

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2016-07-28  8:10 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-27 12:07 Subtle races between DAX mmap fault and write path Jan Kara
2016-07-27 21:10 ` Ross Zwisler
2016-07-27 22:19   ` Dave Chinner
2016-07-28  8:10     ` Jan Kara [this message]
2016-07-29  2:21       ` Dave Chinner
2016-07-29 14:44         ` Dan Williams
2016-07-30  0:12           ` Dave Chinner
2016-07-30  0:53             ` Dan Williams
2016-08-01  1:46               ` Dave Chinner
2016-08-01  3:13                 ` Keith Packard
2016-08-01  4:07                   ` Dave Chinner
2016-08-01  4:39                     ` Dan Williams
2016-08-01  7:39                       ` Dave Chinner
2016-08-01 10:13             ` Boaz Harrosh
2016-08-02  0:21               ` Dave Chinner
2016-08-04 18:40                 ` Kani, Toshimitsu
2016-08-05 11:27                   ` Dave Chinner
2016-08-05 15:18                     ` Kani, Toshimitsu
2016-08-05 19:58                     ` Boylston, Brian
2016-08-08  9:26                       ` Jan Kara
2016-08-08 12:30                         ` Boylston, Brian
2016-08-08 13:11                           ` Christoph Hellwig
2016-08-08 18:28                           ` Jan Kara
2016-08-08 19:32                             ` Kani, Toshimitsu
2016-08-08 23:12                       ` Dave Chinner
2016-08-09  1:00                         ` Kani, Toshimitsu
2016-08-09  5:58                           ` Dave Chinner
2016-08-01 17:47             ` Dan Williams
2016-07-28  8:47   ` Jan Kara
2016-07-27 21:38 ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160728081033.GC4094@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=ross.zwisler@linux.intel.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox