linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Matthew Wilcox <willy@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	linux-kernel@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dan Williams <dan.j.williams@intel.com>, Jan Kara <jack@suse.com>,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-nvdimm@lists.01.org, xfs@oss.sgi.com
Subject: Re: [RFC PATCH] dax, ext2, ext4, XFS: fix data corruption race
Date: Tue, 26 Jan 2016 09:46:54 +0100	[thread overview]
Message-ID: <20160126084654.GT24938@quack.suse.cz> (raw)
In-Reply-To: <20160125204636.GI2948@linux.intel.com>

On Mon 25-01-16 15:46:36, Matthew Wilcox wrote:
> On Mon, Jan 25, 2016 at 09:01:07AM +1100, Dave Chinner wrote:
> > On Fri, Jan 22, 2016 at 04:06:11PM -0700, Ross Zwisler wrote:
> > > With the current DAX code the following race exists:
> > > 
> > > Process 1                	Process 2
> > > ---------			---------
> > > 
> > > __dax_fault() - read file f, index 0
> > >   get_block() -> returns hole
> > >                              	__dax_fault() - write file f, index 0
> > >                                   get_block() -> allocates blocks
> > >                                   dax_insert_mapping()
> > >   dax_load_hole()
> > >   *data corruption*
> > > 
> > > An analogous race exists between __dax_fault() loading a hole and
> > > __dax_pmd_fault() allocating a PMD DAX page and trying to insert it, and
> > > that race also ends in data corruption.
> > 
> > Ok, so why doesn't this problem exist for the normal page cache
> > insertion case with concurrent read vs write faults?  It's because
> > the write fault first does a read fault and so always the write
> > fault always has a page in the radix tree for the get_block call
> > that allocates the extents, right?
> 
> No, it's because allocation of blocks is separated from allocation of
> struct page.
> 
> > And DAX has an optimisation in the page fault part where it skips
> > the read fault part of the write fault?  And so essentially the DAX
> > write fault is missing the object (page lock of page in the radix
> > tree) that the non-DAX write fault uses to avoid this problem?
> >
> > What happens if we get rid of that DAX write fault optimisation that
> > skips the initial read fault? The write fault will always run on a
> > mapping that has a hole loaded, right?, so the race between
> > dax_load_hole() and dax_insert_mapping() goes away, because nothing
> > will be calling dax_load_hole() once the write fault is allocating
> > blocks....
> 
> So in your proposal, we'd look in the radix tree, find nothing,
> call get_block(..., 0).  If we get something back, we can insert it.
> If we hit a hole, we allocate a struct page, put it in the radix tree
> and return to user space.  If that was a write fault after all, it'll
> come back to us through the ->page_mkwrite handler where we can take the
> page lock on the allocated struct page, then call down to DAX which calls
> back through get_block to allocate?  Then DAX kicks the struct page out
> of the page cache and frees it.
> 
> That seems to work to me.  And we can get rid of pfn_mkwrite at the same
> time which seems like a win to me.

Getting rid of pfn_mkwrite() would be nice, I agree. But the above scheme
still has issues when PMD pages come into play. Or maybe to start from the
beginning: How would you like PMD faults to work? Because there is no
obvious 'struct page' to protect the allocation of 2MB worth of blocks. You
could resort to similar tricks like transparent huge pages do (compound
pages) but then the cure is IMHO worse than the disease.

There is one option: No allocation of blocks for PMD faults. That would
make code much simpler and solve the races but I'm not sure whether that is
really an acceptable loss of functionality...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

      reply	other threads:[~2016-01-26  8:46 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-22 23:06 [RFC PATCH] dax, ext2, ext4, XFS: fix data corruption race Ross Zwisler
2016-01-23  2:01 ` Matthew Wilcox
2016-01-24 22:01 ` Dave Chinner
2016-01-25 13:59   ` Jan Kara
2016-01-26 12:48     ` Matthew Wilcox
2016-01-26 13:05       ` Jan Kara
2016-01-26 14:47         ` Matthew Wilcox
2016-01-25 20:46   ` Matthew Wilcox
2016-01-26  8:46     ` Jan Kara [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160126084654.GT24938@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=adilger.kernel@dilger.ca \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=jack@suse.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=ross.zwisler@linux.intel.com \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@linux.intel.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).