Re: [PATCH] dax: fix deadlock in __dax_fault

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
	linux-kernel@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Matthew Wilcox <willy@linux.intel.com>,
	linux-fsdevel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Dan Williams <dan.j.williams@intel.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	linux-nvdimm@lists.01.org
Subject: Re: [PATCH] dax: fix deadlock in __dax_fault
Date: Thu, 24 Sep 2015 09:50:29 -0600	[thread overview]
Message-ID: <20150924155029.GA6008@linux.intel.com> (raw)
In-Reply-To: <20150924025225.GT3902@dastard>

On Thu, Sep 24, 2015 at 12:52:25PM +1000, Dave Chinner wrote:
> On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote:
> > Fix the deadlock exposed by xfstests generic/075.  Here is the sequence
> > that was causing us to deadlock:
> > 
> > 1) enter __dax_fault()
> > 2) page = find_get_page() gives us a page, so skip
> > 	i_mmap_lock_write(mapping)
> > 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
> > 	passes, enter this block
> > 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
> > 	i_mmap_unlock_write(mapping);
> > 	return dax_load_hole(mapping, page, vmf);
> > 
> > This causes us to up_write() a semaphore that we weren't holding.
> > 
> > The up_write() on a semaphore we didn't down_write() happens twice in
> > a row, and then the next time we try and i_mmap_lock_write(), we hang.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Reported-by: Dave Chinner <david@fromorbit.com>
> > ---
> >  fs/dax.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 7ae6df7..df1b0ac 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  			if (error)
> >  				goto unlock;
> >  		} else {
> > -			i_mmap_unlock_write(mapping);
> > +			if (!page)
> > +				i_mmap_unlock_write(mapping);
> >  			return dax_load_hole(mapping, page, vmf);
> >  		}
> >  	}
> 
> I can't review this properly because I can't work out how this
> locking is supposed to work.  Captain, we have a Charlie Foxtrot
> situation here:
> 
> 	page = find_get_page(mapping, vmf->pgoff)
> 	if (page) {
> 		....
> 	} else {
> 		i_mmap_lock_write(mapping);
> 	}
> 
> So if there's no page in the page cache, we lock the i_mmap_lock.
> The we have the case the above patch fixes. Then later:
> 
> 	if (vmf->cow_page) {
> 		.....
> 		if (!page) {
> 			/* can fall through */
> 		}
> 		return VM_FAULT_LOCKED;
> 	}
> 
> Which means __dax_fault() can also return here with the
> i_mmap_lock_write() held. There's no documentation to indicate why
> this is valid, and only by looking about 4 function calls higher up
> the stack can I see that there's some attempt to handle this
> *specific return condition* (in do_cow_fault()). That also is
> lacking in documentation explaining the circumstances where we might
> have the i_mmap_lock_write() held and have to release it. (Not to
> mention the beautiful copy-n-waste of the unlock code, either.)
> 
> The above code in __dax_fault() is then followed by this gem:
> 
> 	/* Check we didn't race with a read fault installing a new page */
>         if (!page && major)
>                 page = find_lock_page(mapping, vmf->pgoff);
> 
> 	if (page) {
> 		/* mapping invalidation .... */
> 	}
> 	.....
> 
> 	if (!page)
> 		i_mmap_unlock_write(mapping);
> 
> Which means that if we had a race with another read fault, we'll
> remove the page from the page cache, insert the new direct mapped
> pfn into the mapping, and *then fail to unlock the i_mmap lock*.
> 
> Is this supposed to work this way? Or is it another bug?
> 
> Another difficult question this change of locking raised that I
> can't answer: is it valid to call into the filesystem via getblock()
> or complete_unwritten() while holding the i_mmap_rwsem? This puts
> filesystem transactions and locks inside the scope of i_mmap_rwsem,
> which may have impact on the fact that we already have an inode lock
> order dependency w.r.t. i_mmap_rwsem through truncate (and probably
> other paths, too).
> 
> So, please document the locking model, explain the corner cases and
> the intricacies like why *unbalanced, return value conditional
> locking* is necessary, and update the charts of lock order
> dependencies in places like mm/filemap.c, and then we might have
> some idea of how much of a train-wreck this actually is....

Yep, I saw these too, but didn't yet know how to deal with them.  We have at
similar issues with __dax_pmd_fault():

	i_mmap_lock_write(mapping);
	length = get_block(inode, block, &bh, write);
	if (length)
		return VM_FAULT_SIGBUS;

Whoops, we just took the mmap semaphore, and then we have a return that
doesn't release it.  A quick test confirms that this will deadlock the next
fault that tries to take the mmap semaphore.

I agree that we need to give the fault handling code some attention when it
comes to locking, and that we need to have better documentation.  I'll work on
this when I get some time.

In the meantime I still think it's worthwhile to take the short term fix for
the obvious generic/075 deadlock, yea?

- Ross

WARNING: multiple messages have this Message-ID (diff)

From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
	linux-kernel@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Matthew Wilcox <willy@linux.intel.com>,
	linux-fsdevel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Dan Williams <dan.j.williams@intel.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	linux-nvdimm@ml01.01.org
Subject: Re: [PATCH] dax: fix deadlock in __dax_fault
Date: Thu, 24 Sep 2015 09:50:29 -0600	[thread overview]
Message-ID: <20150924155029.GA6008@linux.intel.com> (raw)
In-Reply-To: <20150924025225.GT3902@dastard>

On Thu, Sep 24, 2015 at 12:52:25PM +1000, Dave Chinner wrote:
> On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote:
> > Fix the deadlock exposed by xfstests generic/075.  Here is the sequence
> > that was causing us to deadlock:
> > 
> > 1) enter __dax_fault()
> > 2) page = find_get_page() gives us a page, so skip
> > 	i_mmap_lock_write(mapping)
> > 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
> > 	passes, enter this block
> > 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
> > 	i_mmap_unlock_write(mapping);
> > 	return dax_load_hole(mapping, page, vmf);
> > 
> > This causes us to up_write() a semaphore that we weren't holding.
> > 
> > The up_write() on a semaphore we didn't down_write() happens twice in
> > a row, and then the next time we try and i_mmap_lock_write(), we hang.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Reported-by: Dave Chinner <david@fromorbit.com>
> > ---
> >  fs/dax.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 7ae6df7..df1b0ac 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  			if (error)
> >  				goto unlock;
> >  		} else {
> > -			i_mmap_unlock_write(mapping);
> > +			if (!page)
> > +				i_mmap_unlock_write(mapping);
> >  			return dax_load_hole(mapping, page, vmf);
> >  		}
> >  	}
> 
> I can't review this properly because I can't work out how this
> locking is supposed to work.  Captain, we have a Charlie Foxtrot
> situation here:
> 
> 	page = find_get_page(mapping, vmf->pgoff)
> 	if (page) {
> 		....
> 	} else {
> 		i_mmap_lock_write(mapping);
> 	}
> 
> So if there's no page in the page cache, we lock the i_mmap_lock.
> The we have the case the above patch fixes. Then later:
> 
> 	if (vmf->cow_page) {
> 		.....
> 		if (!page) {
> 			/* can fall through */
> 		}
> 		return VM_FAULT_LOCKED;
> 	}
> 
> Which means __dax_fault() can also return here with the
> i_mmap_lock_write() held. There's no documentation to indicate why
> this is valid, and only by looking about 4 function calls higher up
> the stack can I see that there's some attempt to handle this
> *specific return condition* (in do_cow_fault()). That also is
> lacking in documentation explaining the circumstances where we might
> have the i_mmap_lock_write() held and have to release it. (Not to
> mention the beautiful copy-n-waste of the unlock code, either.)
> 
> The above code in __dax_fault() is then followed by this gem:
> 
> 	/* Check we didn't race with a read fault installing a new page */
>         if (!page && major)
>                 page = find_lock_page(mapping, vmf->pgoff);
> 
> 	if (page) {
> 		/* mapping invalidation .... */
> 	}
> 	.....
> 
> 	if (!page)
> 		i_mmap_unlock_write(mapping);
> 
> Which means that if we had a race with another read fault, we'll
> remove the page from the page cache, insert the new direct mapped
> pfn into the mapping, and *then fail to unlock the i_mmap lock*.
> 
> Is this supposed to work this way? Or is it another bug?
> 
> Another difficult question this change of locking raised that I
> can't answer: is it valid to call into the filesystem via getblock()
> or complete_unwritten() while holding the i_mmap_rwsem? This puts
> filesystem transactions and locks inside the scope of i_mmap_rwsem,
> which may have impact on the fact that we already have an inode lock
> order dependency w.r.t. i_mmap_rwsem through truncate (and probably
> other paths, too).
> 
> So, please document the locking model, explain the corner cases and
> the intricacies like why *unbalanced, return value conditional
> locking* is necessary, and update the charts of lock order
> dependencies in places like mm/filemap.c, and then we might have
> some idea of how much of a train-wreck this actually is....

Yep, I saw these too, but didn't yet know how to deal with them.  We have at
similar issues with __dax_pmd_fault():

	i_mmap_lock_write(mapping);
	length = get_block(inode, block, &bh, write);
	if (length)
		return VM_FAULT_SIGBUS;

Whoops, we just took the mmap semaphore, and then we have a return that
doesn't release it.  A quick test confirms that this will deadlock the next
fault that tries to take the mmap semaphore.

I agree that we need to give the fault handling code some attention when it
comes to locking, and that we need to have better documentation.  I'll work on
this when I get some time.

In the meantime I still think it's worthwhile to take the short term fix for
the obvious generic/075 deadlock, yea?

- Ross

next prev parent reply	other threads:[~2015-09-24 15:50 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-23 20:40 [PATCH] dax: fix deadlock in __dax_fault Ross Zwisler
2015-09-23 20:40 ` Ross Zwisler
2015-09-24  2:52 ` Dave Chinner
2015-09-24  2:52   ` Dave Chinner
2015-09-24  9:03   ` Boaz Harrosh
2015-09-24  9:03     ` Boaz Harrosh
2015-09-24 15:50   ` Ross Zwisler [this message]
2015-09-24 15:50     ` Ross Zwisler
2015-09-25  2:53     ` Dave Chinner
2015-09-25  2:53       ` Dave Chinner
2015-09-25 18:23       ` Ross Zwisler
2015-09-25 18:23         ` Ross Zwisler
2015-09-25 23:30         ` Dave Chinner
2015-09-25 23:30           ` Dave Chinner
2015-09-26  3:17       ` Ross Zwisler
2015-09-26  3:17         ` Ross Zwisler
2015-09-28  0:59         ` Dave Chinner
2015-09-28  0:59           ` Dave Chinner
2015-09-28 10:12           ` Dave Chinner
2015-09-28 10:12             ` Dave Chinner
2015-09-28 10:23             ` kbuild test robot
2015-09-28 10:23               ` kbuild test robot
2015-09-28 10:23             ` kbuild test robot
2015-09-28 10:23               ` kbuild test robot
2015-09-28 12:13           ` Dan Williams
2015-09-28 12:13             ` Dan Williams
2015-09-28 21:35             ` Dave Chinner
2015-09-28 21:35               ` Dave Chinner
2015-09-28 22:57               ` Dan Williams
2015-09-28 22:57                 ` Dan Williams
2015-09-29  2:18                 ` Dave Chinner
2015-09-29  2:18                   ` Dave Chinner
2015-09-29  3:08                   ` Dan Williams
2015-09-29  3:08                     ` Dan Williams
2015-09-29  4:19                     ` Dave Chinner
2015-09-29  4:19                       ` Dave Chinner
2015-09-28 22:40           ` Ross Zwisler
2015-09-28 22:40             ` Ross Zwisler
2015-09-29  2:44             ` Dave Chinner
2015-09-29  2:44               ` Dave Chinner
2015-09-30  1:57               ` Dave Chinner
2015-09-30  1:57                 ` Dave Chinner
2015-09-30  2:04               ` Ross Zwisler
2015-09-30  2:04                 ` Ross Zwisler
2015-09-30  2:04                 ` Ross Zwisler
2015-09-30  3:22                 ` Dave Chinner
2015-09-30  3:22                   ` Dave Chinner
2015-10-02 12:55                 ` Jan Kara
2015-10-02 12:55                   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150924155029.GA6008@linux.intel.com \
    --to=ross.zwisler@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.