From: Dave Chinner <david@fromorbit.com>
To: Ross Zwisler <ross.zwisler@linux.intel.com>,
linux-kernel@vger.kernel.org,
Alexander Viro <viro@zeniv.linux.org.uk>,
Matthew Wilcox <willy@linux.intel.com>,
linux-fsdevel@vger.kernel.org,
Andrew Morton <akpm@linux-foundation.org>,
Dan Williams <dan.j.williams@intel.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
linux-nvdimm@lists.01.org, Jan Kara <jack@suse.cz>
Subject: Re: [PATCH] dax: fix deadlock in __dax_fault
Date: Tue, 29 Sep 2015 12:44:58 +1000 [thread overview]
Message-ID: <20150929024458.GC27164@dastard> (raw)
In-Reply-To: <20150928224001.GA21955@linux.intel.com>
On Mon, Sep 28, 2015 at 04:40:01PM -0600, Ross Zwisler wrote:
> On Mon, Sep 28, 2015 at 10:59:04AM +1000, Dave Chinner wrote:
> > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote:
> <>
> > In reality, the require DAX page fault vs truncate serialisation is
> > provided for XFS by the XFS_MMAPLOCK_* inode locking that is done in
> > the fault, mkwrite and filesystem extent manipulation paths. There
> > is no way this sort of exclusion can be done in the mm/ subsystem as
> > it simply does not have the context to be able to provide the
> > necessary serialisation. Every filesystem that wants to use DAX
> > needs to provide this external page fault serialisation, and in
> > doing so will also protect it's hole punch/extent swap/shift
> > operations under normal operation against racing mmap access....
> >
> > IOWs, for DAX this needs to be fixed in ext4, not the mm/ subsystem.
>
> So is it your belief that XFS already has correct locking in place to ensure
> that we don't hit these races? I see XFS taking XFS_MMAPLOCK_SHARED before it
> calls __dax_fault() in both xfs_filemap_page_mkwrite() (via __dax_mkwrite) and
> in xfs_filemap_fault().
>
> XFS takes XFS_MMAPLOCK_EXCL before a truncate in xfs_vn_setattr() - I haven't
> found the generic hole punching code yet, but I assume it takes
> XFS_MMAPLOCK_EXCL as well.
There is no generic hole punching. See xfs_file_fallocate(), where
most fallocate() based operations are protected, xfs_ioc_space()
where all the XFS ioctl based extent manipulations are protected,
xfs_swap_extents() where online defrag extent swaps are protected.
And we'll add it to any future operations that directly
manipulate extent mappings.
> Meaning, is the work that we need to do around extent vs page fault locking
> basically adding equivalent locking to ext4 and ext2 and removing the attempts
> at locking from dax.c?
Yup. I'm not game to touch ext4 locking, though.
>
> > > 4) Test all changes with xfstests using both xfs & ext4, using lockep.
> > >
> > > Did I miss any issues, or does this path not solve one of them somehow?
> > >
> > > Does this sound like a reasonable path forward for v4.3? Dave, and Jan, can
> > > you guys can provide guidance and code reviews for the XFS and ext4 bits?
> >
> > IMO, it's way too much to get into 4.3. I'd much prefer we revert
> > the bad changes in 4.3, and then work towards fixing this for the
> > 4.4 merge window. If someone needs this for 4.3, then they can
> > backport the 4.4 code to 4.3-stable.
> >
> > The "fast and loose and fix it later" development model does not
> > work for persistent storage algorithms; DAX is storage - not memory
> > management - and so we need to treat it as such.
>
> Okay. To get our locking back to v4.2 levels here are the two commits I think
> we need to look at:
>
> commit 843172978bb9 ("dax: fix race between simultaneous faults")
> commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")
Already testing a kernel with those reverted. My current DAX patch
stack is (bottom is first commit in stack):
f672ae4 xfs: add ->pfn_mkwrite support for DAX
6855c23 xfs: remove DAX complete_unwritten callback
e074bdf Revert "dax: fix race between simultaneous faults"
8ba0157 Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
a2ce6a5 xfs: DAX does not use IO completion callbacks
246c52a xfs: update size during allocation for DAX
9d10e7b xfs: Don't use unwritten extents for DAX
eaef807 xfs: factor out sector mapping.
e7f2d50 xfs: introduce per-inode DAX enablement
BTW, add to the problems that need fixing is that the pfn_mkwrite
code needs to check that the fault is still within EOF, like
__dax_fault does. i.e. the top patch in the series adds this
to xfs_filemap_pfn_mkwrite() instead of using dax_pfn_mkwrite():
....
+ /* check if the faulting page hasn't raced with truncate */
+ xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (vmf->pgoff >= size)
+ ret = VM_FAULT_SIGBUS;
+ xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
+ sb_end_pagefault(inode->i_sb);
i.e. dax_pfn_mkwrite() doesn't work correctly when racing with
truncate (i.e. another way that ext2/ext4 are currently broken).
> On an unrelated note, while wandering through the XFS code I found the
> following lock ordering documented above xfs_ilock():
>
> * Basic locking order:
> *
> * i_iolock -> i_mmap_lock -> page_lock -> i_ilock
> *
> * mmap_sem locking order:
> *
> * i_iolock -> page lock -> mmap_sem
> * mmap_sem -> i_mmap_lock -> page_lock
>
> I noticed that page_lock and i_mmap_lock are in different places in the
> ordering depending on the presence or absence of mmap_sem. Does this not open
> us up to a lock ordering inversion?
Typo, not picked up in review (note the missing "_").
- * i_iolock -> page lock -> mmap_sem
+ * i_iolock -> page fault -> mmap_sem
Thanks for the heads-up on that.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2015-09-29 2:44 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-23 20:40 [PATCH] dax: fix deadlock in __dax_fault Ross Zwisler
2015-09-24 2:52 ` Dave Chinner
2015-09-24 9:03 ` Boaz Harrosh
2015-09-24 15:50 ` Ross Zwisler
2015-09-25 2:53 ` Dave Chinner
2015-09-25 18:23 ` Ross Zwisler
2015-09-25 23:30 ` Dave Chinner
2015-09-26 3:17 ` Ross Zwisler
2015-09-28 0:59 ` Dave Chinner
2015-09-28 10:12 ` Dave Chinner
2015-09-28 10:23 ` kbuild test robot
2015-09-28 10:23 ` kbuild test robot
2015-09-28 12:13 ` Dan Williams
2015-09-28 21:35 ` Dave Chinner
2015-09-28 22:57 ` Dan Williams
2015-09-29 2:18 ` Dave Chinner
2015-09-29 3:08 ` Dan Williams
2015-09-29 4:19 ` Dave Chinner
2015-09-28 22:40 ` Ross Zwisler
2015-09-29 2:44 ` Dave Chinner [this message]
2015-09-30 1:57 ` Dave Chinner
2015-09-30 2:04 ` Ross Zwisler
2015-09-30 3:22 ` Dave Chinner
2015-10-02 12:55 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150929024458.GC27164@dastard \
--to=david@fromorbit.com \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=jack@suse.cz \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvdimm@lists.01.org \
--cc=ross.zwisler@linux.intel.com \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).