From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
linux-kernel@vger.kernel.org,
Alexander Viro <viro@zeniv.linux.org.uk>,
Matthew Wilcox <willy@linux.intel.com>,
linux-fsdevel@vger.kernel.org,
Andrew Morton <akpm@linux-foundation.org>,
Dan Williams <dan.j.williams@intel.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
linux-nvdimm@lists.01.org
Subject: Re: [PATCH] dax: fix deadlock in __dax_fault
Date: Thu, 24 Sep 2015 09:50:29 -0600 [thread overview]
Message-ID: <20150924155029.GA6008@linux.intel.com> (raw)
In-Reply-To: <20150924025225.GT3902@dastard>
On Thu, Sep 24, 2015 at 12:52:25PM +1000, Dave Chinner wrote:
> On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote:
> > Fix the deadlock exposed by xfstests generic/075. Here is the sequence
> > that was causing us to deadlock:
> >
> > 1) enter __dax_fault()
> > 2) page = find_get_page() gives us a page, so skip
> > i_mmap_lock_write(mapping)
> > 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
> > passes, enter this block
> > 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
> > i_mmap_unlock_write(mapping);
> > return dax_load_hole(mapping, page, vmf);
> >
> > This causes us to up_write() a semaphore that we weren't holding.
> >
> > The up_write() on a semaphore we didn't down_write() happens twice in
> > a row, and then the next time we try and i_mmap_lock_write(), we hang.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Reported-by: Dave Chinner <david@fromorbit.com>
> > ---
> > fs/dax.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 7ae6df7..df1b0ac 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> > if (error)
> > goto unlock;
> > } else {
> > - i_mmap_unlock_write(mapping);
> > + if (!page)
> > + i_mmap_unlock_write(mapping);
> > return dax_load_hole(mapping, page, vmf);
> > }
> > }
>
> I can't review this properly because I can't work out how this
> locking is supposed to work. Captain, we have a Charlie Foxtrot
> situation here:
>
> page = find_get_page(mapping, vmf->pgoff)
> if (page) {
> ....
> } else {
> i_mmap_lock_write(mapping);
> }
>
> So if there's no page in the page cache, we lock the i_mmap_lock.
> The we have the case the above patch fixes. Then later:
>
> if (vmf->cow_page) {
> .....
> if (!page) {
> /* can fall through */
> }
> return VM_FAULT_LOCKED;
> }
>
> Which means __dax_fault() can also return here with the
> i_mmap_lock_write() held. There's no documentation to indicate why
> this is valid, and only by looking about 4 function calls higher up
> the stack can I see that there's some attempt to handle this
> *specific return condition* (in do_cow_fault()). That also is
> lacking in documentation explaining the circumstances where we might
> have the i_mmap_lock_write() held and have to release it. (Not to
> mention the beautiful copy-n-waste of the unlock code, either.)
>
> The above code in __dax_fault() is then followed by this gem:
>
> /* Check we didn't race with a read fault installing a new page */
> if (!page && major)
> page = find_lock_page(mapping, vmf->pgoff);
>
> if (page) {
> /* mapping invalidation .... */
> }
> .....
>
> if (!page)
> i_mmap_unlock_write(mapping);
>
> Which means that if we had a race with another read fault, we'll
> remove the page from the page cache, insert the new direct mapped
> pfn into the mapping, and *then fail to unlock the i_mmap lock*.
>
> Is this supposed to work this way? Or is it another bug?
>
> Another difficult question this change of locking raised that I
> can't answer: is it valid to call into the filesystem via getblock()
> or complete_unwritten() while holding the i_mmap_rwsem? This puts
> filesystem transactions and locks inside the scope of i_mmap_rwsem,
> which may have impact on the fact that we already have an inode lock
> order dependency w.r.t. i_mmap_rwsem through truncate (and probably
> other paths, too).
>
> So, please document the locking model, explain the corner cases and
> the intricacies like why *unbalanced, return value conditional
> locking* is necessary, and update the charts of lock order
> dependencies in places like mm/filemap.c, and then we might have
> some idea of how much of a train-wreck this actually is....
Yep, I saw these too, but didn't yet know how to deal with them. We have at
similar issues with __dax_pmd_fault():
i_mmap_lock_write(mapping);
length = get_block(inode, block, &bh, write);
if (length)
return VM_FAULT_SIGBUS;
Whoops, we just took the mmap semaphore, and then we have a return that
doesn't release it. A quick test confirms that this will deadlock the next
fault that tries to take the mmap semaphore.
I agree that we need to give the fault handling code some attention when it
comes to locking, and that we need to have better documentation. I'll work on
this when I get some time.
In the meantime I still think it's worthwhile to take the short term fix for
the obvious generic/075 deadlock, yea?
- Ross
next prev parent reply other threads:[~2015-09-24 15:50 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-23 20:40 [PATCH] dax: fix deadlock in __dax_fault Ross Zwisler
2015-09-24 2:52 ` Dave Chinner
2015-09-24 9:03 ` Boaz Harrosh
2015-09-24 15:50 ` Ross Zwisler [this message]
2015-09-25 2:53 ` Dave Chinner
2015-09-25 18:23 ` Ross Zwisler
2015-09-25 23:30 ` Dave Chinner
2015-09-26 3:17 ` Ross Zwisler
2015-09-28 0:59 ` Dave Chinner
2015-09-28 10:12 ` Dave Chinner
2015-09-28 10:23 ` kbuild test robot
2015-09-28 10:23 ` kbuild test robot
2015-09-28 12:13 ` Dan Williams
2015-09-28 21:35 ` Dave Chinner
2015-09-28 22:57 ` Dan Williams
2015-09-29 2:18 ` Dave Chinner
2015-09-29 3:08 ` Dan Williams
2015-09-29 4:19 ` Dave Chinner
2015-09-28 22:40 ` Ross Zwisler
2015-09-29 2:44 ` Dave Chinner
2015-09-30 1:57 ` Dave Chinner
2015-09-30 2:04 ` Ross Zwisler
2015-09-30 3:22 ` Dave Chinner
2015-10-02 12:55 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150924155029.GA6008@linux.intel.com \
--to=ross.zwisler@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=david@fromorbit.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvdimm@lists.01.org \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).