Re: [PATCH] dax: fix deadlock in __dax_fault

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Boaz Harrosh <boaz@plexistor.com>
To: Dave Chinner <david@fromorbit.com>,
	Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	linux-fsdevel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [PATCH] dax: fix deadlock in __dax_fault
Date: Thu, 24 Sep 2015 12:03:30 +0300	[thread overview]
Message-ID: <5603BC62.1050805@plexistor.com> (raw)
In-Reply-To: <20150924025225.GT3902@dastard>

On 09/24/2015 05:52 AM, Dave Chinner wrote:
> On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote:
>> Fix the deadlock exposed by xfstests generic/075.  Here is the sequence
>> that was causing us to deadlock:
>>
>> 1) enter __dax_fault()
>> 2) page = find_get_page() gives us a page, so skip
>> 	i_mmap_lock_write(mapping)
>> 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
>> 	passes, enter this block
>> 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
>> 	i_mmap_unlock_write(mapping);
>> 	return dax_load_hole(mapping, page, vmf);
>>
>> This causes us to up_write() a semaphore that we weren't holding.
>>
>> The up_write() on a semaphore we didn't down_write() happens twice in
>> a row, and then the next time we try and i_mmap_lock_write(), we hang.
>>
>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Reported-by: Dave Chinner <david@fromorbit.com>
>> ---
>>  fs/dax.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index 7ae6df7..df1b0ac 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>>  			if (error)
>>  				goto unlock;
>>  		} else {
>> -			i_mmap_unlock_write(mapping);
>> +			if (!page)
>> +				i_mmap_unlock_write(mapping);
>>  			return dax_load_hole(mapping, page, vmf);
>>  		}
>>  	}
> 
> I can't review this properly because I can't work out how this
> locking is supposed to work.  Captain, we have a Charlie Foxtrot
> situation here:
> 
> 	page = find_get_page(mapping, vmf->pgoff)
> 	if (page) {
> 		....
> 	} else {
> 		i_mmap_lock_write(mapping);
> 	}
> 
> So if there's no page in the page cache, we lock the i_mmap_lock.
> The we have the case the above patch fixes. Then later:
> 
> 	if (vmf->cow_page) {
> 		.....
> 		if (!page) {
> 			/* can fall through */
> 		}
> 		return VM_FAULT_LOCKED;
> 	}
> 
> Which means __dax_fault() can also return here with the
> i_mmap_lock_write() held. There's no documentation to indicate why
> this is valid, and only by looking about 4 function calls higher up
> the stack can I see that there's some attempt to handle this
> *specific return condition* (in do_cow_fault()). That also is
> lacking in documentation explaining the circumstances where we might
> have the i_mmap_lock_write() held and have to release it. (Not to
> mention the beautiful copy-n-waste of the unlock code, either.)
> 
> The above code in __dax_fault() is then followed by this gem:
> 
> 	/* Check we didn't race with a read fault installing a new page */
>         if (!page && major)
>                 page = find_lock_page(mapping, vmf->pgoff);
> 
> 	if (page) {
> 		/* mapping invalidation .... */
> 	}
> 	.....
> 
> 	if (!page)
> 		i_mmap_unlock_write(mapping);
> 
> Which means that if we had a race with another read fault, we'll
> remove the page from the page cache, insert the new direct mapped
> pfn into the mapping, and *then fail to unlock the i_mmap lock*.
> 
> Is this supposed to work this way? Or is it another bug?
> 
> Another difficult question this change of locking raised that I
> can't answer: is it valid to call into the filesystem via getblock()
> or complete_unwritten() while holding the i_mmap_rwsem? This puts
> filesystem transactions and locks inside the scope of i_mmap_rwsem,
> which may have impact on the fact that we already have an inode lock
> order dependency w.r.t. i_mmap_rwsem through truncate (and probably
> other paths, too).
> 
> So, please document the locking model, explain the corner cases and
> the intricacies like why *unbalanced, return value conditional
> locking* is necessary, and update the charts of lock order
> dependencies in places like mm/filemap.c, and then we might have
> some idea of how much of a train-wreck this actually is....
> 

Hi hi

I hate this VM_FAULT_LOCKED + !page which means i_mmap_lock. I still
think it solves nothing and that we've done a really really bad job.

If we *easily* involve the FS in the locking here (Which btw I think XFS
already does), then this all i_mmap_lock can be avoided.

Please remind me again what race it is suppose to avoid? I get confused.

> Cheers,
> Dave.
> 

Thanks
Boaz

WARNING: multiple messages have this Message-ID (diff)

From: Boaz Harrosh <boaz@plexistor.com>
To: Dave Chinner <david@fromorbit.com>,
	Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: linux-nvdimm@ml01.01.org, linux-kernel@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	linux-fsdevel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [PATCH] dax: fix deadlock in __dax_fault
Date: Thu, 24 Sep 2015 12:03:30 +0300	[thread overview]
Message-ID: <5603BC62.1050805@plexistor.com> (raw)
In-Reply-To: <20150924025225.GT3902@dastard>

On 09/24/2015 05:52 AM, Dave Chinner wrote:
> On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote:
>> Fix the deadlock exposed by xfstests generic/075.  Here is the sequence
>> that was causing us to deadlock:
>>
>> 1) enter __dax_fault()
>> 2) page = find_get_page() gives us a page, so skip
>> 	i_mmap_lock_write(mapping)
>> 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
>> 	passes, enter this block
>> 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
>> 	i_mmap_unlock_write(mapping);
>> 	return dax_load_hole(mapping, page, vmf);
>>
>> This causes us to up_write() a semaphore that we weren't holding.
>>
>> The up_write() on a semaphore we didn't down_write() happens twice in
>> a row, and then the next time we try and i_mmap_lock_write(), we hang.
>>
>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Reported-by: Dave Chinner <david@fromorbit.com>
>> ---
>>  fs/dax.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index 7ae6df7..df1b0ac 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>>  			if (error)
>>  				goto unlock;
>>  		} else {
>> -			i_mmap_unlock_write(mapping);
>> +			if (!page)
>> +				i_mmap_unlock_write(mapping);
>>  			return dax_load_hole(mapping, page, vmf);
>>  		}
>>  	}
> 
> I can't review this properly because I can't work out how this
> locking is supposed to work.  Captain, we have a Charlie Foxtrot
> situation here:
> 
> 	page = find_get_page(mapping, vmf->pgoff)
> 	if (page) {
> 		....
> 	} else {
> 		i_mmap_lock_write(mapping);
> 	}
> 
> So if there's no page in the page cache, we lock the i_mmap_lock.
> The we have the case the above patch fixes. Then later:
> 
> 	if (vmf->cow_page) {
> 		.....
> 		if (!page) {
> 			/* can fall through */
> 		}
> 		return VM_FAULT_LOCKED;
> 	}
> 
> Which means __dax_fault() can also return here with the
> i_mmap_lock_write() held. There's no documentation to indicate why
> this is valid, and only by looking about 4 function calls higher up
> the stack can I see that there's some attempt to handle this
> *specific return condition* (in do_cow_fault()). That also is
> lacking in documentation explaining the circumstances where we might
> have the i_mmap_lock_write() held and have to release it. (Not to
> mention the beautiful copy-n-waste of the unlock code, either.)
> 
> The above code in __dax_fault() is then followed by this gem:
> 
> 	/* Check we didn't race with a read fault installing a new page */
>         if (!page && major)
>                 page = find_lock_page(mapping, vmf->pgoff);
> 
> 	if (page) {
> 		/* mapping invalidation .... */
> 	}
> 	.....
> 
> 	if (!page)
> 		i_mmap_unlock_write(mapping);
> 
> Which means that if we had a race with another read fault, we'll
> remove the page from the page cache, insert the new direct mapped
> pfn into the mapping, and *then fail to unlock the i_mmap lock*.
> 
> Is this supposed to work this way? Or is it another bug?
> 
> Another difficult question this change of locking raised that I
> can't answer: is it valid to call into the filesystem via getblock()
> or complete_unwritten() while holding the i_mmap_rwsem? This puts
> filesystem transactions and locks inside the scope of i_mmap_rwsem,
> which may have impact on the fact that we already have an inode lock
> order dependency w.r.t. i_mmap_rwsem through truncate (and probably
> other paths, too).
> 
> So, please document the locking model, explain the corner cases and
> the intricacies like why *unbalanced, return value conditional
> locking* is necessary, and update the charts of lock order
> dependencies in places like mm/filemap.c, and then we might have
> some idea of how much of a train-wreck this actually is....
> 

Hi hi

I hate this VM_FAULT_LOCKED + !page which means i_mmap_lock. I still
think it solves nothing and that we've done a really really bad job.

If we *easily* involve the FS in the locking here (Which btw I think XFS
already does), then this all i_mmap_lock can be avoided.

Please remind me again what race it is suppose to avoid? I get confused.

> Cheers,
> Dave.
> 

Thanks
Boaz

next prev parent reply	other threads:[~2015-09-24  9:03 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-23 20:40 [PATCH] dax: fix deadlock in __dax_fault Ross Zwisler
2015-09-23 20:40 ` Ross Zwisler
2015-09-24  2:52 ` Dave Chinner
2015-09-24  2:52   ` Dave Chinner
2015-09-24  9:03   ` Boaz Harrosh [this message]
2015-09-24  9:03     ` Boaz Harrosh
2015-09-24 15:50   ` Ross Zwisler
2015-09-24 15:50     ` Ross Zwisler
2015-09-25  2:53     ` Dave Chinner
2015-09-25  2:53       ` Dave Chinner
2015-09-25 18:23       ` Ross Zwisler
2015-09-25 18:23         ` Ross Zwisler
2015-09-25 23:30         ` Dave Chinner
2015-09-25 23:30           ` Dave Chinner
2015-09-26  3:17       ` Ross Zwisler
2015-09-26  3:17         ` Ross Zwisler
2015-09-28  0:59         ` Dave Chinner
2015-09-28  0:59           ` Dave Chinner
2015-09-28 10:12           ` Dave Chinner
2015-09-28 10:12             ` Dave Chinner
2015-09-28 10:23             ` kbuild test robot
2015-09-28 10:23               ` kbuild test robot
2015-09-28 10:23             ` kbuild test robot
2015-09-28 10:23               ` kbuild test robot
2015-09-28 12:13           ` Dan Williams
2015-09-28 12:13             ` Dan Williams
2015-09-28 21:35             ` Dave Chinner
2015-09-28 21:35               ` Dave Chinner
2015-09-28 22:57               ` Dan Williams
2015-09-28 22:57                 ` Dan Williams
2015-09-29  2:18                 ` Dave Chinner
2015-09-29  2:18                   ` Dave Chinner
2015-09-29  3:08                   ` Dan Williams
2015-09-29  3:08                     ` Dan Williams
2015-09-29  4:19                     ` Dave Chinner
2015-09-29  4:19                       ` Dave Chinner
2015-09-28 22:40           ` Ross Zwisler
2015-09-28 22:40             ` Ross Zwisler
2015-09-29  2:44             ` Dave Chinner
2015-09-29  2:44               ` Dave Chinner
2015-09-30  1:57               ` Dave Chinner
2015-09-30  1:57                 ` Dave Chinner
2015-09-30  2:04               ` Ross Zwisler
2015-09-30  2:04                 ` Ross Zwisler
2015-09-30  2:04                 ` Ross Zwisler
2015-09-30  3:22                 ` Dave Chinner
2015-09-30  3:22                   ` Dave Chinner
2015-10-02 12:55                 ` Jan Kara
2015-10-02 12:55                   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5603BC62.1050805@plexistor.com \
    --to=boaz@plexistor.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=ross.zwisler@linux.intel.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.