Re: [PATCH, RFC 2/2] dax: use range_lock instead of i_mmap_lock

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Boaz Harrosh <boaz@plexistor.com>
Cc: Jan Kara <jack@suse.cz>, Dave Chinner <david@fromorbit.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <matthew.r.wilcox@intel.com>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, Davidlohr Bueso <dbueso@suse.de>,
	Theodore Ts'o <tytso@mit.edu>
Subject: Re: [PATCH, RFC 2/2] dax: use range_lock instead of i_mmap_lock
Date: Tue, 11 Aug 2015 23:26:39 +0300	[thread overview]
Message-ID: <20150811202639.GA1408@node.dhcp.inet.fi> (raw)
In-Reply-To: <55CA2008.7070702@plexistor.com>

On Tue, Aug 11, 2015 at 07:17:12PM +0300, Boaz Harrosh wrote:
> On 08/11/2015 06:28 PM, Kirill A. Shutemov wrote:
> > We also used lock_page() to make sure we shoot out all pages as we don't
> > exclude page faults during truncate. Consider this race:
> > 
> > 	<fault>			<truncate>
> > 	get_block
> > 	check i_size
> >     				update i_size
> > 				unmap
> > 	setup pte
> > 
> 
> Please consider this senario then:
> 
>  	<fault>			<truncate>
> 	read_lock(inode)
> 
>  	get_block
>  	check i_size
> 	
> 	read_unlock(inode)
> 
> 				write_lock(inode)
> 
>      				update i_size
> 				* remove allocated blocks
>  				unmap
> 
> 				write_unlock(inode)
> 
>  	setup pte
> 
> IS what you suppose to do in xfs

Do you realize that you describe a race? :-P

Exactly in this scenario pfn your pte point to is not belong to the file
anymore. Have fun.

> > With normal page cache we make sure that all pages beyond i_size is
> > dropped using lock_page() in truncate_inode_pages_range().
> > 
> 
> Yes there is no truncate_inode_pages_range() in DAX again radix tree is
> empty.
> 
> Please do you have a reproducer I would like to see this race and also
> experiment with xfs (I guess you saw it in ext4)

I don't. And I don't see how race like above can be FS-specific. All
critical points in generic code.
 
> > For DAX we need a way to stop all page faults to the pgoff range before
> > doing unmap.
> > 
> 
> Why ?

Because you can end up with ptes pointing to pfns which fs consider not be
part of the file.

	<truncate>		<fault>
	unmap..
				fault in pfn which unmap already unmapped
	..continue unmap

> >> Because with DAX there is no inode->mapping "mapping" at all. You have the call
> >> into the FS with get_block() to replace "holes" (zero pages) with real allocated
> >> blocks, on WRITE faults, but this conversion should be protected inside the FS
> >> already. Then there is the atomic exchange of the PTE which is fine.
> >> (And vis versa with holes mapping and writes)
> > 
> > Having unmap_mapping_range() in PMD fault handling is very unfortunate.
> > Go to rmap just to solve page fault is very wrong.
> > BTW, we need to do it in write path too.
> > 
> 
> Only the write path and only when we exchange a zero-page (hole) with
> a new allocated (written to) page. Both write fault and/or write-path

No. Always on new BH. We don't have anything (except rmap) to find out if
any other process have zero page for the pgoff.
 
> > I'm not convinced that all these "let's avoid backing storage allocation"
> > in DAX code is not layering violation. I think the right place to solve
> > this is filesystem. And we have almost all required handles for this in
> > place.  We only need to change vm_ops->page_mkwrite() interface to be able
> > to return different page than what was given on input.
> > 
> 
> What? there is no page returned for DAX page_mkwrite(), it is all insert_mixed
> with direct pmd.

That was bogus idea, please ignore.

> > Hm. Where does XFS take this read-write lock in fault path?
> > 
> > IIUC, truncation vs. page fault serialization relies on i_size being
> > updated before doing truncate_pagecache() and checking i_size under
> > page_lock() on fault side. We don't have i_size fence for punch hole.
> > 
> 
> again truncate_pagecache() is NONE.
> And yes the read-write locking will protect punch-hole just as truncate
> see my locking senario above.

Do you mean as racy as your truncate scenario? ;-P

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Boaz Harrosh <boaz@plexistor.com>
Cc: Jan Kara <jack@suse.cz>, Dave Chinner <david@fromorbit.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <matthew.r.wilcox@intel.com>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, Davidlohr Bueso <dbueso@suse.de>,
	"Theodore Ts'o" <tytso@mit.edu>
Subject: Re: [PATCH, RFC 2/2] dax: use range_lock instead of i_mmap_lock
Date: Tue, 11 Aug 2015 23:26:39 +0300	[thread overview]
Message-ID: <20150811202639.GA1408@node.dhcp.inet.fi> (raw)
In-Reply-To: <55CA2008.7070702@plexistor.com>

On Tue, Aug 11, 2015 at 07:17:12PM +0300, Boaz Harrosh wrote:
> On 08/11/2015 06:28 PM, Kirill A. Shutemov wrote:
> > We also used lock_page() to make sure we shoot out all pages as we don't
> > exclude page faults during truncate. Consider this race:
> > 
> > 	<fault>			<truncate>
> > 	get_block
> > 	check i_size
> >     				update i_size
> > 				unmap
> > 	setup pte
> > 
> 
> Please consider this senario then:
> 
>  	<fault>			<truncate>
> 	read_lock(inode)
> 
>  	get_block
>  	check i_size
> 	
> 	read_unlock(inode)
> 
> 				write_lock(inode)
> 
>      				update i_size
> 				* remove allocated blocks
>  				unmap
> 
> 				write_unlock(inode)
> 
>  	setup pte
> 
> IS what you suppose to do in xfs

Do you realize that you describe a race? :-P

Exactly in this scenario pfn your pte point to is not belong to the file
anymore. Have fun.

> > With normal page cache we make sure that all pages beyond i_size is
> > dropped using lock_page() in truncate_inode_pages_range().
> > 
> 
> Yes there is no truncate_inode_pages_range() in DAX again radix tree is
> empty.
> 
> Please do you have a reproducer I would like to see this race and also
> experiment with xfs (I guess you saw it in ext4)

I don't. And I don't see how race like above can be FS-specific. All
critical points in generic code.
 
> > For DAX we need a way to stop all page faults to the pgoff range before
> > doing unmap.
> > 
> 
> Why ?

Because you can end up with ptes pointing to pfns which fs consider not be
part of the file.

	<truncate>		<fault>
	unmap..
				fault in pfn which unmap already unmapped
	..continue unmap

> >> Because with DAX there is no inode->mapping "mapping" at all. You have the call
> >> into the FS with get_block() to replace "holes" (zero pages) with real allocated
> >> blocks, on WRITE faults, but this conversion should be protected inside the FS
> >> already. Then there is the atomic exchange of the PTE which is fine.
> >> (And vis versa with holes mapping and writes)
> > 
> > Having unmap_mapping_range() in PMD fault handling is very unfortunate.
> > Go to rmap just to solve page fault is very wrong.
> > BTW, we need to do it in write path too.
> > 
> 
> Only the write path and only when we exchange a zero-page (hole) with
> a new allocated (written to) page. Both write fault and/or write-path

No. Always on new BH. We don't have anything (except rmap) to find out if
any other process have zero page for the pgoff.
 
> > I'm not convinced that all these "let's avoid backing storage allocation"
> > in DAX code is not layering violation. I think the right place to solve
> > this is filesystem. And we have almost all required handles for this in
> > place.  We only need to change vm_ops->page_mkwrite() interface to be able
> > to return different page than what was given on input.
> > 
> 
> What? there is no page returned for DAX page_mkwrite(), it is all insert_mixed
> with direct pmd.

That was bogus idea, please ignore.

> > Hm. Where does XFS take this read-write lock in fault path?
> > 
> > IIUC, truncation vs. page fault serialization relies on i_size being
> > updated before doing truncate_pagecache() and checking i_size under
> > page_lock() on fault side. We don't have i_size fence for punch hole.
> > 
> 
> again truncate_pagecache() is NONE.
> And yes the read-write locking will protect punch-hole just as truncate
> see my locking senario above.

Do you mean as racy as your truncate scenario? ;-P

-- 
 Kirill A. Shutemov

next prev parent reply	other threads:[~2015-08-11 20:26 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-08-10 15:14 [PATCH, RFC 0/2] Recover some scalability for DAX Kirill A. Shutemov
2015-08-10 15:14 ` Kirill A. Shutemov
2015-08-10 15:14 ` [PATCH, RFC 1/2] lib: Implement range locks Kirill A. Shutemov
2015-08-10 15:14   ` Kirill A. Shutemov
2015-08-10 15:14 ` [PATCH, RFC 2/2] dax: use range_lock instead of i_mmap_lock Kirill A. Shutemov
2015-08-10 15:14   ` Kirill A. Shutemov
2015-08-11  8:19   ` Jan Kara
2015-08-11  8:19     ` Jan Kara
2015-08-11  9:37     ` Dave Chinner
2015-08-11  9:37       ` Dave Chinner
2015-08-11 11:09       ` Boaz Harrosh
2015-08-11 11:09         ` Boaz Harrosh
2015-08-11 12:03       ` Kirill A. Shutemov
2015-08-11 12:03         ` Kirill A. Shutemov
2015-08-11 13:50       ` Jan Kara
2015-08-11 13:50         ` Jan Kara
2015-08-11 14:31         ` Boaz Harrosh
2015-08-11 14:31           ` Boaz Harrosh
2015-08-11 15:28           ` Kirill A. Shutemov
2015-08-11 15:28             ` Kirill A. Shutemov
2015-08-11 16:17             ` Boaz Harrosh
2015-08-11 16:17               ` Boaz Harrosh
2015-08-11 20:26               ` Kirill A. Shutemov [this message]
2015-08-11 20:26                 ` Kirill A. Shutemov
2015-08-12  7:54                 ` Boaz Harrosh
2015-08-12  7:54                   ` Boaz Harrosh
2015-08-11 16:51           ` Wilcox, Matthew R
2015-08-11 16:51             ` Wilcox, Matthew R
2015-08-11 18:46             ` Boaz Harrosh
2015-08-11 18:46               ` Boaz Harrosh
2015-08-11 21:48             ` Dave Chinner
2015-08-11 21:48               ` Dave Chinner
2015-08-12  8:51               ` Boaz Harrosh
2015-08-12  8:51                 ` Boaz Harrosh
2015-08-13 11:30               ` Jan Kara
2015-08-13 11:30                 ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150811202639.GA1408@node.dhcp.inet.fi \
    --to=kirill@shutemov.name \
    --cc=akpm@linux-foundation.org \
    --cc=boaz@plexistor.com \
    --cc=david@fromorbit.com \
    --cc=dbueso@suse.de \
    --cc=jack@suse.cz \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=matthew.r.wilcox@intel.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.