linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Ross Zwisler <ross.zwisler@linux.intel.com>,
	linux-kernel@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Matthew Wilcox <willy@linux.intel.com>,
	linux-fsdevel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Dan Williams <dan.j.williams@intel.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	linux-nvdimm@lists.01.org, Jan Kara <jack@suse.cz>
Subject: Re: [PATCH] dax: fix deadlock in __dax_fault
Date: Tue, 29 Sep 2015 12:44:58 +1000	[thread overview]
Message-ID: <20150929024458.GC27164@dastard> (raw)
In-Reply-To: <20150928224001.GA21955@linux.intel.com>

On Mon, Sep 28, 2015 at 04:40:01PM -0600, Ross Zwisler wrote:
> On Mon, Sep 28, 2015 at 10:59:04AM +1000, Dave Chinner wrote:
> > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote:
> <>
> > In reality, the require DAX page fault vs truncate serialisation is
> > provided for XFS by the XFS_MMAPLOCK_* inode locking that is done in
> > the fault, mkwrite and filesystem extent manipulation paths. There
> > is no way this sort of exclusion can be done in the mm/ subsystem as
> > it simply does not have the context to be able to provide the
> > necessary serialisation.  Every filesystem that wants to use DAX
> > needs to provide this external page fault serialisation, and in
> > doing so will also protect it's hole punch/extent swap/shift
> > operations under normal operation against racing mmap access....
> > 
> > IOWs, for DAX this needs to be fixed in ext4, not the mm/ subsystem.
> 
> So is it your belief that XFS already has correct locking in place to ensure
> that we don't hit these races?  I see XFS taking XFS_MMAPLOCK_SHARED before it
> calls __dax_fault() in both xfs_filemap_page_mkwrite() (via __dax_mkwrite) and
> in xfs_filemap_fault().
> 
> XFS takes XFS_MMAPLOCK_EXCL before a truncate in xfs_vn_setattr() - I haven't
> found the generic hole punching code yet, but I assume it takes
> XFS_MMAPLOCK_EXCL as well.

There is no generic hole punching. See xfs_file_fallocate(), where
most fallocate() based operations are protected, xfs_ioc_space()
where all the XFS ioctl based extent manipulations are protected,
xfs_swap_extents() where online defrag extent swaps are protected.
And we'll add it to any future operations that directly
manipulate extent mappings. 

> Meaning, is the work that we need to do around extent vs page fault locking
> basically adding equivalent locking to ext4 and ext2 and removing the attempts
> at locking from dax.c?

Yup. I'm not game to touch ext4 locking, though.

> 
> > > 4) Test all changes with xfstests using both xfs & ext4, using lockep.
> > > 
> > > Did I miss any issues, or does this path not solve one of them somehow?
> > > 
> > > Does this sound like a reasonable path forward for v4.3?  Dave, and Jan, can
> > > you guys can provide guidance and code reviews for the XFS and ext4 bits?
> > 
> > IMO, it's way too much to get into 4.3. I'd much prefer we revert
> > the bad changes in 4.3, and then work towards fixing this for the
> > 4.4 merge window. If someone needs this for 4.3, then they can
> > backport the 4.4 code to 4.3-stable.
> > 
> > The "fast and loose and fix it later" development model does not
> > work for persistent storage algorithms; DAX is storage - not memory
> > management - and so we need to treat it as such.
> 
> Okay.  To get our locking back to v4.2 levels here are the two commits I think
> we need to look at:
> 
> commit 843172978bb9 ("dax: fix race between simultaneous faults")
> commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")

Already testing a kernel with those reverted. My current DAX patch
stack is (bottom is first commit in stack):

f672ae4 xfs: add ->pfn_mkwrite support for DAX
6855c23 xfs: remove DAX complete_unwritten callback
e074bdf Revert "dax: fix race between simultaneous faults"
8ba0157 Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
a2ce6a5 xfs: DAX does not use IO completion callbacks
246c52a xfs: update size during allocation for DAX
9d10e7b xfs: Don't use unwritten extents for DAX
eaef807 xfs: factor out sector mapping.
e7f2d50 xfs: introduce per-inode DAX enablement

BTW, add to the problems that need fixing is that the pfn_mkwrite
code needs to check that the fault is still within EOF, like
__dax_fault does. i.e. the top patch in the series adds this
to xfs_filemap_pfn_mkwrite() instead of using dax_pfn_mkwrite():

....
+       /* check if the faulting page hasn't raced with truncate */
+       xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+       size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+       if (vmf->pgoff >= size)
+               ret = VM_FAULT_SIGBUS;
+       xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
+       sb_end_pagefault(inode->i_sb);

i.e. dax_pfn_mkwrite() doesn't work correctly when racing with
truncate (i.e. another way that ext2/ext4 are currently broken).

> On an unrelated note, while wandering through the XFS code I found the
> following lock ordering documented above xfs_ilock():
> 
>  * Basic locking order:
>  *
>  * i_iolock -> i_mmap_lock -> page_lock -> i_ilock
>  *
>  * mmap_sem locking order:
>  *
>  * i_iolock -> page lock -> mmap_sem
>  * mmap_sem -> i_mmap_lock -> page_lock
> 
> I noticed that page_lock and i_mmap_lock are in different places in the
> ordering depending on the presence or absence of mmap_sem.  Does this not open
> us up to a lock ordering inversion?

Typo, not picked up in review (note the missing "_").

- * i_iolock -> page lock -> mmap_sem
+ * i_iolock -> page fault -> mmap_sem

Thanks for the heads-up on that.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2015-09-29  2:44 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-23 20:40 [PATCH] dax: fix deadlock in __dax_fault Ross Zwisler
2015-09-24  2:52 ` Dave Chinner
2015-09-24  9:03   ` Boaz Harrosh
2015-09-24 15:50   ` Ross Zwisler
2015-09-25  2:53     ` Dave Chinner
2015-09-25 18:23       ` Ross Zwisler
2015-09-25 23:30         ` Dave Chinner
2015-09-26  3:17       ` Ross Zwisler
2015-09-28  0:59         ` Dave Chinner
2015-09-28 10:12           ` Dave Chinner
2015-09-28 10:23             ` kbuild test robot
2015-09-28 10:23             ` kbuild test robot
2015-09-28 12:13           ` Dan Williams
2015-09-28 21:35             ` Dave Chinner
2015-09-28 22:57               ` Dan Williams
2015-09-29  2:18                 ` Dave Chinner
2015-09-29  3:08                   ` Dan Williams
2015-09-29  4:19                     ` Dave Chinner
2015-09-28 22:40           ` Ross Zwisler
2015-09-29  2:44             ` Dave Chinner [this message]
2015-09-30  1:57               ` Dave Chinner
2015-09-30  2:04               ` Ross Zwisler
2015-09-30  3:22                 ` Dave Chinner
2015-10-02 12:55                 ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150929024458.GC27164@dastard \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=jack@suse.cz \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=ross.zwisler@linux.intel.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).