hole-punch vs fault

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* hole-punch vs fault
@ 2013-11-27 13:48 Matthew Wilcox
  2013-11-27 22:19 ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2013-11-27 13:48 UTC (permalink / raw)
  To: linux-fsdevel

I'm trying to figure out what protects us in this scenario:

Thread A mmaps a file and reads from it.
Thread B punches holes in that file.

It seems to me that thread A can end up with references to pages that are
in the page cache that point to deallocated blocks (which the filesystem
might subsequently allocate to a different file).

Thread B calls ext4_punch_hole()
Thread B takes ->i_mutex
Thread B calls truncate_pagecache_range()
Thread B calls unmap_mapping_range() which removes the PTE from Thread A's
  page tables.
Thread B calls truncate_inode_pages_range() which removes the pages from the
  page cache.
Thread A calls filemap_fault().  It calls find_get_page(), but it's not there.
  It calls do_sync_mmap_readahead() then find_get_page() again, still not
  there.  So it calls page_cache_read() which calls ext4_readpage which calls
  mpage_readpage(), which calls ext4_get_block() which finds the blocks
  in question, nowhere taking the i_mutex.
Now Thread B gets back from its summer holiday, takes i_data_sem and calls
  ext4_es_remove_extent() and ext4_ext_remove_space() /
  ext4_free_hole_blocks() before releasing i_data_sem and i_mutex.
Thread A happily dirties the page it has.
Thread C extends a file, which gets the victim block allocated.
vmscan causes the page to be written back, since it's dirty and on the LRU lists
Bang, thread C has a corrupted file.

This doesn't happen with the normal truncate path since i_size is written
before calling unmap_mapping_range, and i_size is checked after filemap_fault
gets the page lock.

How should we solve this?  I see two realistic options.  One is an rwsem
in the inode or address_space taken for read by the fault path and for
write by the holepunch path.  It's kind of icky because each filesystem
will need to add support for it.

The second is to put some kind of generation counter in the inode or
address_space that is incremented on every deallocation.  The fault path
would read the generation counter before calling find_get_page() and then
check it hasn't changed after getting the page lock (goto retry_find).  I
like that one a little more since we can fix it all in common code, and I
think it'll be lower overhead in the fault path.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: hole-punch vs fault
  2013-11-27 13:48 hole-punch vs fault Matthew Wilcox
@ 2013-11-27 22:19 ` Jan Kara
  2013-11-28  2:33   ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2013-11-27 22:19 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel

  Hello,

On Wed 27-11-13 06:48:31, Matthew Wilcox wrote:
> I'm trying to figure out what protects us in this scenario:
> 
> Thread A mmaps a file and reads from it.
> Thread B punches holes in that file.
  Yeah, nothing. We discussed this problem with Dave Chinner here:
http://www.spinics.net/lists/linux-ext4/msg32059.html

> It seems to me that thread A can end up with references to pages that are
> in the page cache that point to deallocated blocks (which the filesystem
> might subsequently allocate to a different file).
> 
> Thread B calls ext4_punch_hole()
> Thread B takes ->i_mutex
> Thread B calls truncate_pagecache_range()
> Thread B calls unmap_mapping_range() which removes the PTE from Thread A's
>   page tables.
> Thread B calls truncate_inode_pages_range() which removes the pages from the
>   page cache.
> Thread A calls filemap_fault().  It calls find_get_page(), but it's not there.
>   It calls do_sync_mmap_readahead() then find_get_page() again, still not
>   there.  So it calls page_cache_read() which calls ext4_readpage which calls
>   mpage_readpage(), which calls ext4_get_block() which finds the blocks
>   in question, nowhere taking the i_mutex.
> Now Thread B gets back from its summer holiday, takes i_data_sem and calls
>   ext4_es_remove_extent() and ext4_ext_remove_space() /
>   ext4_free_hole_blocks() before releasing i_data_sem and i_mutex.
> Thread A happily dirties the page it has.
> Thread C extends a file, which gets the victim block allocated.
> vmscan causes the page to be written back, since it's dirty and on the LRU lists
> Bang, thread C has a corrupted file.
> 
> This doesn't happen with the normal truncate path since i_size is written
> before calling unmap_mapping_range, and i_size is checked after filemap_fault
> gets the page lock.
  Yup.
 
> How should we solve this?  I see two realistic options.  One is an rwsem
> in the inode or address_space taken for read by the fault path and for
> write by the holepunch path.  It's kind of icky because each filesystem
> will need to add support for it.
  This is doable. We actually want to go for range locks to scratch some
other itches we have in this area (see the above mentioned thread). But lock
ordering with mmap_sem gets interesting and requires changes to how
mmap_sem is used - which is likely the hardest-to-deal-with lock we currently
have in the kernel. For example I'm now getting to know video4linux drivers
more than I ever wanted to untangle mmap_sem from that code...

> The second is to put some kind of generation counter in the inode or
> address_space that is incremented on every deallocation.  The fault path
> would read the generation counter before calling find_get_page() and then
> check it hasn't changed after getting the page lock (goto retry_find).  I
> like that one a little more since we can fix it all in common code, and I
> think it'll be lower overhead in the fault path.
  I'm not sure I understand how this should work. After pagecache is
truncated, fault can come and happily instantiate the page again. After
that fault is done, hole punching awakes and removes blocks from under that
page. So checking the generation counter after you get the page lock seems
useless to me.

							Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: hole-punch vs fault
  2013-11-27 22:19 ` Jan Kara
@ 2013-11-28  2:33   ` Matthew Wilcox
  2013-11-28  3:30     ` Matthew Wilcox
                       ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Matthew Wilcox @ 2013-11-28  2:33 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Wed, Nov 27, 2013 at 11:19:32PM +0100, Jan Kara wrote:
> > The second is to put some kind of generation counter in the inode or
> > address_space that is incremented on every deallocation.  The fault path
> > would read the generation counter before calling find_get_page() and then
> > check it hasn't changed after getting the page lock (goto retry_find).  I
> > like that one a little more since we can fix it all in common code, and I
> > think it'll be lower overhead in the fault path.
>   I'm not sure I understand how this should work. After pagecache is
> truncated, fault can come and happily instantiate the page again. After
> that fault is done, hole punching awakes and removes blocks from under that
> page. So checking the generation counter after you get the page lock seems
> useless to me.

Yeah, I don't think the page lock is enough (maybe someone can convince
me otherwise).  How does this look?  Checking the generation number
after we get i_mutex ensures that the truncate has finished running.

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 54eed4f..df6278b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3648,6 +3648,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 					    stop_block);
 
 	ext4_discard_preallocations(inode);
+	damage_mapping(mapping);
 	up_write(&EXT4_I(inode)->i_data_sem);
 	if (IS_SYNC(inode))
 		ext4_handle_sync(handle);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1a04525..190f38c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -415,6 +415,7 @@ struct address_space {
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
 	spinlock_t		tree_lock;	/* and lock protecting it */
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
+	unsigned		i_damaged;	/* damage count */
 	struct rb_root		i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
 	struct mutex		i_mmap_mutex;	/* protect tree, count, list */
@@ -503,6 +504,31 @@ static inline int mapping_writably_mapped(struct address_space *mapping)
 }
 
 /*
+ * A mapping is damaged when blocks are removed from the filesystem's
+ * data structures.
+ */
+static inline unsigned mapping_damage(struct address_space *mapping)
+{
+	unsigned seq = ACCESS_ONCE(mapping->i_damaged);
+	smp_rmb();	/* Subsequent reads of damagable data structures */
+	return seq;
+}
+
+/* Must be called with i_mutex held */
+static inline bool
+mapping_is_damaged(struct address_space *mapping, unsigned seq)
+{
+	return mapping->i_damaged != seq;
+}
+
+/* Must be called with i_mutex held */
+static inline void damage_mapping(struct address_space *mapping)
+{
+	smp_wmb();	/* Prior writes to damagable data structures */
+	mapping->i_damaged++;
+}
+
+/*
  * Use sequence counter to get consistent i_size on 32-bit processors.
  */
 #if BITS_PER_LONG==32 && defined(CONFIG_SMP)
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a9..71c936d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1617,6 +1617,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	pgoff_t offset = vmf->pgoff;
 	struct page *page;
 	pgoff_t size;
+	unsigned damage = mapping_damage(mapping);
 	int ret = 0;
 
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
@@ -1676,6 +1677,18 @@ retry_find:
 		return VM_FAULT_SIGBUS;
 	}
 
+	/*
+	 * Check if we were the unlucky victim of a holepunch
+	 */
+	mutex_lock(&inode->i_mutex);
+	if (unlikely(mapping_is_damaged(mapping, damage))) {
+		mutex_unlock(&inode->i_mutex);
+		unlock_page(page);
+		page_cache_release(page);
+		damage = mapping_damage(mapping);
+		goto retry_find;
+	}
+
 	vmf->page = page;
 	return ret | VM_FAULT_LOCKED;
 

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: hole-punch vs fault
  2013-11-28  2:33   ` Matthew Wilcox
@ 2013-11-28  3:30     ` Matthew Wilcox
  2013-11-28  4:22     ` Dave Chinner
  2013-11-28  4:44     ` Matthew Wilcox
  2 siblings, 0 replies; 15+ messages in thread
From: Matthew Wilcox @ 2013-11-28  3:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Wed, Nov 27, 2013 at 07:33:43PM -0700, Matthew Wilcox wrote:
> Yeah, I don't think the page lock is enough (maybe someone can convince
> me otherwise).  How does this look?  Checking the generation number
> after we get i_mutex ensures that the truncate has finished running.

I should probably confess that my original reason for looking into this
was that I noticed that ext2-xip has a race between truncate() and fault.
Because there's no struct page, there's no page lock, so there's nothing to
ensure that the mapping doesn't go away while we're inserting the PTE.

So I looked to see how this was solved for hole punches, and found out
that it wasn't.  So I tried to come up with a solution that would work
for both.  I did implement the per-address space rwsem idea, but I don't
like it.

Here's a version that includes the changes to ext2 and filemap_xip:

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 8a33764..33c1d59 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1214,6 +1214,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 
 	truncate_setsize(inode, newsize);
 	__ext2_truncate_blocks(inode, newsize);
+	if (mapping_is_xip(inode->i_mapping))
+		damage_mapping(inode->i_mapping);
 
 	inode->i_mtime = inode->i_ctime = CURRENT_TIME_SEC;
 	if (inode_needs_sync(inode)) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0757634..6f90da6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3622,6 +3622,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 					    stop_block);
 
 	ext4_discard_preallocations(inode);
+	damage_mapping(mapping);
 	up_write(&EXT4_I(inode)->i_data_sem);
 	if (IS_SYNC(inode))
 		ext4_handle_sync(handle);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 121f11f..1db6032 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -411,6 +411,7 @@ struct address_space {
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
 	spinlock_t		tree_lock;	/* and lock protecting it */
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
+	unsigned		i_damaged;	/* damage count */
 	struct rb_root		i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
 	struct mutex		i_mmap_mutex;	/* protect tree, count, list */
@@ -499,6 +500,31 @@ static inline int mapping_writably_mapped(struct address_space *mapping)
 }
 
 /*
+ * A mapping is damaged when blocks are removed from the filesystem's
+ * data structures.
+ */
+static inline unsigned mapping_damage(struct address_space *mapping)
+{
+	unsigned seq = ACCESS_ONCE(mapping->i_damaged);
+	smp_rmb();	/* Subsequent reads of damagable data structures */
+	return seq;
+}
+
+/* Must be called with i_mutex held */
+static inline bool
+mapping_is_damaged(struct address_space *mapping, unsigned seq)
+{
+	return mapping->i_damaged != seq;
+}
+
+/* Must be called with i_mutex held */
+static inline void damage_mapping(struct address_space *mapping)
+{
+	smp_wmb();	/* Prior writes to damagable data structures */
+	mapping->i_damaged++;
+}
+
+/*
  * Use sequence counter to get consistent i_size on 32-bit processors.
  */
 #if BITS_PER_LONG==32 && defined(CONFIG_SMP)
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a9..71c936d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1617,6 +1617,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	pgoff_t offset = vmf->pgoff;
 	struct page *page;
 	pgoff_t size;
+	unsigned damage = mapping_damage(mapping);
 	int ret = 0;
 
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
@@ -1676,6 +1677,18 @@ retry_find:
 		return VM_FAULT_SIGBUS;
 	}
 
+	/*
+	 * Check if we were the unlucky victim of a holepunch
+	 */
+	mutex_lock(&inode->i_mutex);
+	if (unlikely(mapping_is_damaged(mapping, damage))) {
+		mutex_unlock(&inode->i_mutex);
+		unlock_page(page);
+		page_cache_release(page);
+		damage = mapping_damage(mapping);
+		goto retry_find;
+	}
+
 	vmf->page = page;
 	return ret | VM_FAULT_LOCKED;
 
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..c8bf231 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -224,6 +224,7 @@ static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	struct inode *inode = mapping->host;
+	unsigned damage;
 	pgoff_t size;
 	void *xip_mem;
 	unsigned long xip_pfn;
@@ -232,6 +233,7 @@ static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 	/* XXX: are VM_FAULT_ codes OK? */
 again:
+	damage = mapping_damage(mapping);
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	if (vmf->pgoff >= size)
 		return VM_FAULT_SIGBUS;
@@ -260,8 +262,14 @@ again:
 		__xip_unmap(mapping, vmf->pgoff);
 
 found:
+		mutex_lock(&inode->i_mutex);
+		if (unlikely(mapping_is_damaged(mapping, damage))) {
+			mutex_unlock(&inode->i_mutex);
+			goto again;
+		}
 		err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
 							xip_pfn);
+		mutex_unlock(&inode->i_mutex);
 		if (err == -ENOMEM)
 			return VM_FAULT_OOM;
 		/*

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: hole-punch vs fault
  2013-11-28  2:33   ` Matthew Wilcox
  2013-11-28  3:30     ` Matthew Wilcox
@ 2013-11-28  4:22     ` Dave Chinner
  2013-11-28  4:44     ` Matthew Wilcox
  2 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2013-11-28  4:22 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Jan Kara, linux-fsdevel

On Wed, Nov 27, 2013 at 07:33:43PM -0700, Matthew Wilcox wrote:
> On Wed, Nov 27, 2013 at 11:19:32PM +0100, Jan Kara wrote:
> > > The second is to put some kind of generation counter in the inode or
> > > address_space that is incremented on every deallocation.  The fault path
> > > would read the generation counter before calling find_get_page() and then
> > > check it hasn't changed after getting the page lock (goto retry_find).  I
> > > like that one a little more since we can fix it all in common code, and I
> > > think it'll be lower overhead in the fault path.
> >   I'm not sure I understand how this should work. After pagecache is
> > truncated, fault can come and happily instantiate the page again. After
> > that fault is done, hole punching awakes and removes blocks from under that
> > page. So checking the generation counter after you get the page lock seems
> > useless to me.
> 
> Yeah, I don't think the page lock is enough (maybe someone can convince
> me otherwise).  How does this look?  Checking the generation number
> after we get i_mutex ensures that the truncate has finished running.

But there's no guarantee that whoever is removing the pages from the
page cache is holding the i_mutex (XFS hole punching and most page
cache operations serialise on XFS_IOLOCK_EXCL, not i_mutex), so you
can't use i_mutex in this way for generic code.

Besides...

> @@ -1617,6 +1617,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  	pgoff_t offset = vmf->pgoff;
>  	struct page *page;
>  	pgoff_t size;
> +	unsigned damage = mapping_damage(mapping);
>  	int ret = 0;
>  
>  	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> @@ -1676,6 +1677,18 @@ retry_find:
>  		return VM_FAULT_SIGBUS;
>  	}
>  
> +	/*
> +	 * Check if we were the unlucky victim of a holepunch
> +	 */
> +	mutex_lock(&inode->i_mutex);
> +	if (unlikely(mapping_is_damaged(mapping, damage))) {
> +		mutex_unlock(&inode->i_mutex);
> +		unlock_page(page);
> +		page_cache_release(page);
> +		damage = mapping_damage(mapping);
> +		goto retry_find;
> +	}

... aren't we holding the mmap_sem here? i.e. taking the i_mutex
here will lead to deadlocks....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: hole-punch vs fault
  2013-11-28  2:33   ` Matthew Wilcox
  2013-11-28  3:30     ` Matthew Wilcox
  2013-11-28  4:22     ` Dave Chinner
@ 2013-11-28  4:44     ` Matthew Wilcox
  2013-11-28 12:24       ` Matthew Wilcox
  2013-11-28 22:12       ` Dave Chinner
  2 siblings, 2 replies; 15+ messages in thread
From: Matthew Wilcox @ 2013-11-28  4:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Wed, Nov 27, 2013 at 07:33:43PM -0700, Matthew Wilcox wrote:
> On Wed, Nov 27, 2013 at 11:19:32PM +0100, Jan Kara wrote:
> > > The second is to put some kind of generation counter in the inode or
> > > address_space that is incremented on every deallocation.  The fault path
> > > would read the generation counter before calling find_get_page() and then
> > > check it hasn't changed after getting the page lock (goto retry_find).  I
> > > like that one a little more since we can fix it all in common code, and I
> > > think it'll be lower overhead in the fault path.
> >   I'm not sure I understand how this should work. After pagecache is
> > truncated, fault can come and happily instantiate the page again. After
> > that fault is done, hole punching awakes and removes blocks from under that
> > page. So checking the generation counter after you get the page lock seems
> > useless to me.
> 
> Yeah, I don't think the page lock is enough (maybe someone can convince
> me otherwise).  How does this look?  Checking the generation number
> after we get i_mutex ensures that the truncate has finished running.

Ohh.  We can't take i_mutex because that's an inversion with mmap_sem.
So option 1 is to convince ourselves that page lock is enough (and that
doesn't solve the problem for ext2-xip).

Option 2 is to do something similar to seqcount_t.  We can't use
seqcount_t as-is because it spins in read_seqcount_begin, and it could
be spinning for a very long time while we read a page in from disc!

How about something like this:

typedef struct seqtex {
	unsigned		sequence;
	spinlock_t		wait_lock;
	struct list_head	wait_list;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
	struct lockdep_map	dep_map;
#endif
}

Take seqtex_read_lock() at the beginning of filemap_fault(), release
it after we have the page locked (since truncate_inode_pages_range will
block on the pagelock before tearing down the data structures).

Take seqtex_write_lock() at the start of truncate & holepunch operations,
drop it at the end of truncate & holepunch operations.

seqtex_read_lock blocks if we're inside a write lock.  seqtex_read_unlock
tells us whether to try again.  write_unlock wakes anybody waiting.
write_lock doesn't block because we're assuming the i_mutex is held above
us, so we can only have one writer at a time (similar to seqcount_t).

I keep hoping for an option 3.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: hole-punch vs fault
  2013-11-28  4:44     ` Matthew Wilcox
@ 2013-11-28 12:24       ` Matthew Wilcox
  2013-11-28 22:12       ` Dave Chinner
  1 sibling, 0 replies; 15+ messages in thread
From: Matthew Wilcox @ 2013-11-28 12:24 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Wed, Nov 27, 2013 at 09:44:54PM -0700, Matthew Wilcox wrote:
> Ohh.  We can't take i_mutex because that's an inversion with mmap_sem.

Or the simple option is just to use a different mutex.  That makes life
a bit more interesting because the filesystem now has to take the mutex
at the appropriate point to ensure that the pages found by fault won't
be damagable.  I think this is correct for ext2/4:


diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 8a33764..25ce796 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1212,8 +1212,14 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 	if (error)
 		return error;
 
+	if (mapping_is_xip(inode->i_mapping))
+		damage_lock(inode->i_mapping);
 	truncate_setsize(inode, newsize);
 	__ext2_truncate_blocks(inode, newsize);
+	if (mapping_is_xip(inode->i_mapping)) {
+		damage_mapping(inode->i_mapping);
+		damage_unlock(inode->i_mapping);
+	}
 
 	inode->i_mtime = inode->i_ctime = CURRENT_TIME_SEC;
 	if (inode_needs_sync(inode)) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0757634..4d3d533 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3571,6 +3571,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 	first_block_offset = round_up(offset, sb->s_blocksize);
 	last_block_offset = round_down((offset + length), sb->s_blocksize) - 1;
 
+	damage_lock(mapping);
 	/* Now release the pages and zero block aligned part of pages*/
 	if (last_block_offset > first_block_offset)
 		truncate_pagecache_range(inode, first_block_offset,
@@ -3610,6 +3611,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 	ret = ext4_es_remove_extent(inode, first_block,
 				    stop_block - first_block);
 	if (ret) {
+		damage_unlock(mapping);
 		up_write(&EXT4_I(inode)->i_data_sem);
 		goto out_stop;
 	}
@@ -3622,6 +3624,8 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 					    stop_block);
 
 	ext4_discard_preallocations(inode);
+	damage_mapping(mapping);
+	damage_unlock(mapping);
 	up_write(&EXT4_I(inode)->i_data_sem);
 	if (IS_SYNC(inode))
 		ext4_handle_sync(handle);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 121f11f..1ed4fa7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -411,6 +411,8 @@ struct address_space {
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
 	spinlock_t		tree_lock;	/* and lock protecting it */
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
+	struct mutex		i_damage_lock;	/* fs data structures */
+	unsigned		i_damaged;	/* damage count */
 	struct rb_root		i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
 	struct mutex		i_mmap_mutex;	/* protect tree, count, list */
@@ -499,6 +501,34 @@ static inline int mapping_writably_mapped(struct address_space *mapping)
 }
 
 /*
+ * A mapping is damaged when blocks are removed from the filesystem's
+ * data structures.
+ */
+static inline unsigned mapping_damage(struct address_space *mapping)
+{
+	unsigned seq = ACCESS_ONCE(mapping->i_damaged);
+	smp_rmb();	/* Subsequent reads of damagable data structures */
+	return seq;
+}
+
+#define damage_lock(mapping)	mutex_lock(&mapping->i_damage_lock)
+#define damage_unlock(mapping)	mutex_unlock(&mapping->i_damage_lock)
+
+/* Must be called with i_damage_lock held */
+static inline bool
+mapping_is_damaged(struct address_space *mapping, unsigned seq)
+{
+	return mapping->i_damaged != seq;
+}
+
+/* Must be called with i_damage_lock held */
+static inline void damage_mapping(struct address_space *mapping)
+{
+	smp_wmb();	/* Prior writes to damagable data structures */
+	mapping->i_damaged++;
+}
+
+/*
  * Use sequence counter to get consistent i_size on 32-bit processors.
  */
 #if BITS_PER_LONG==32 && defined(CONFIG_SMP)
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a9..f99e831 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1617,6 +1617,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	pgoff_t offset = vmf->pgoff;
 	struct page *page;
 	pgoff_t size;
+	unsigned damage = mapping_damage(mapping);
 	int ret = 0;
 
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
@@ -1676,6 +1677,19 @@ retry_find:
 		return VM_FAULT_SIGBUS;
 	}
 
+	/*
+	 * Check if we were the unlucky victim of a holepunch
+	 */
+	damage_lock(mapping);
+	if (unlikely(mapping_is_damaged(mapping, damage))) {
+		damage_unlock(mapping);
+		unlock_page(page);
+		page_cache_release(page);
+		damage = mapping_damage(mapping);
+		goto retry_find;
+	}
+	damage_unlock(mapping);
+
 	vmf->page = page;
 	return ret | VM_FAULT_LOCKED;
 
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..dc478eb 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -224,6 +224,7 @@ static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	struct inode *inode = mapping->host;
+	unsigned damage;
 	pgoff_t size;
 	void *xip_mem;
 	unsigned long xip_pfn;
@@ -232,6 +233,7 @@ static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 	/* XXX: are VM_FAULT_ codes OK? */
 again:
+	damage = mapping_damage(mapping);
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	if (vmf->pgoff >= size)
 		return VM_FAULT_SIGBUS;
@@ -260,8 +262,14 @@ again:
 		__xip_unmap(mapping, vmf->pgoff);
 
 found:
+		damage_lock(mapping);
+		if (unlikely(mapping_is_damaged(mapping, damage))) {
+			mutex_unlock(&inode->i_mutex);
+			goto again;
+		}
 		err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
 							xip_pfn);
+		damage_unlock(mapping);
 		if (err == -ENOMEM)
 			return VM_FAULT_OOM;
 		/*

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: hole-punch vs fault
  2013-11-28  4:44     ` Matthew Wilcox
  2013-11-28 12:24       ` Matthew Wilcox
@ 2013-11-28 22:12       ` Dave Chinner
  2013-11-29 13:11         ` Matthew Wilcox
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2013-11-28 22:12 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Jan Kara, linux-fsdevel

On Wed, Nov 27, 2013 at 09:44:54PM -0700, Matthew Wilcox wrote:
> On Wed, Nov 27, 2013 at 07:33:43PM -0700, Matthew Wilcox wrote:
> > On Wed, Nov 27, 2013 at 11:19:32PM +0100, Jan Kara wrote:
> > > > The second is to put some kind of generation counter in the inode or
> > > > address_space that is incremented on every deallocation.  The fault path
> > > > would read the generation counter before calling find_get_page() and then
> > > > check it hasn't changed after getting the page lock (goto retry_find).  I
> > > > like that one a little more since we can fix it all in common code, and I
> > > > think it'll be lower overhead in the fault path.
> > >   I'm not sure I understand how this should work. After pagecache is
> > > truncated, fault can come and happily instantiate the page again. After
> > > that fault is done, hole punching awakes and removes blocks from under that
> > > page. So checking the generation counter after you get the page lock seems
> > > useless to me.
> > 
> > Yeah, I don't think the page lock is enough (maybe someone can convince
> > me otherwise).  How does this look?  Checking the generation number
> > after we get i_mutex ensures that the truncate has finished running.
> 
> Ohh.  We can't take i_mutex because that's an inversion with mmap_sem.
> So option 1 is to convince ourselves that page lock is enough (and that
> doesn't solve the problem for ext2-xip).
> 
> Option 2 is to do something similar to seqcount_t.  We can't use
> seqcount_t as-is because it spins in read_seqcount_begin, and it could
> be spinning for a very long time while we read a page in from disc!

Option 2 is really any other method of synchronisation. It doesn't
matter is it's a seqlock like construct, or a more complex
shared/exclusive range lock - it's still the "same" solution.

Indeed, we've already discussed this at length - if we are going to
introduce a new method of synchronisation between mmap and
holepunch, then we cannot ignore all the other mmap vs IO
synchronisation problems we have that are caused by the same issue.
Hacking workarounds into the holepunch/mmap paths is not a robust or
maintainable solution.

> How about something like this:
> 
> typedef struct seqtex {
> 	unsigned		sequence;
> 	spinlock_t		wait_lock;
> 	struct list_head	wait_list;
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> 	struct lockdep_map	dep_map;
> #endif
> }

The biggest problem I see here is that

	a) it is a entire file lock and truncate/holepunch do not
	   affect the entire scope of the file. i.e. there is no
	   reason why holepunch cannot run concurrently with IO if
	   they are to different ranges of a file.

	b) it increases the struct inode by at least 5%. If we are
	   going to increase the inode by that much, we better make
	   sure we solve more than one isolated, relatively rare
	   corner case behaviour.

We need to do better than playing whack-a-mole with a big hammer
here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: hole-punch vs fault
  2013-11-28 22:12       ` Dave Chinner
@ 2013-11-29 13:11         ` Matthew Wilcox
  2013-12-01 21:52           ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2013-11-29 13:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-fsdevel

On Fri, Nov 29, 2013 at 09:12:46AM +1100, Dave Chinner wrote:
> Indeed, we've already discussed this at length - if we are going to
> introduce a new method of synchronisation between mmap and
> holepunch, then we cannot ignore all the other mmap vs IO
> synchronisation problems we have that are caused by the same issue.
> Hacking workarounds into the holepunch/mmap paths is not a robust or
> maintainable solution.

This isn't hacking in a workaround.  This is a solution to the problem
that assumes holepunches are rare, and it's OK to make the fault path retry
if a holepunch happened.  I agree it needs to be documented and the exact
details of what is protected by i_damaged_mutex needs to be clear.

> 	a) it is a entire file lock and truncate/holepunch do not
> 	   affect the entire scope of the file. i.e. there is no
> 	   reason why holepunch cannot run concurrently with IO if
> 	   they are to different ranges of a file.

Yes, if we care about the scalability of truncate and holepunch.  I really
don't think that's an important workload.

> 	b) it increases the struct inode by at least 5%. If we are
> 	   going to increase the inode by that much, we better make
> 	   sure we solve more than one isolated, relatively rare
> 	   corner case behaviour.

About 7% (568 -> 608 bytes) with my .config (SMP, SPIN_ON_OWNER,
!DEBUG_MUTEXES, !DEBUG_LOCK_ALLOC).  I rearranged the order (thanks,
pahole!) to put i_damage in the hole created by private_lock.

I'm solving two cases here, one is holepunch vs fault for regular files.
The other is truncate vs fault for XIP files, which is probably
a little less rare.  I think the same mechanism can be used when a
filesystem is doing a defragment or otherwise moving blocks around on
an XIP block device.  Look, I'm not attached to the i_damage solution,
but I do need something to protect against truncate vs fault on XIP.
Fixing holepunch vs fault too just felt like the right thing to do.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: hole-punch vs fault
  2013-11-29 13:11         ` Matthew Wilcox
@ 2013-12-01 21:52           ` Dave Chinner
  2013-12-02  8:33             ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2013-12-01 21:52 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Jan Kara, linux-fsdevel

On Fri, Nov 29, 2013 at 06:11:55AM -0700, Matthew Wilcox wrote:
> On Fri, Nov 29, 2013 at 09:12:46AM +1100, Dave Chinner wrote:
> > Indeed, we've already discussed this at length - if we are going to
> > introduce a new method of synchronisation between mmap and
> > holepunch, then we cannot ignore all the other mmap vs IO
> > synchronisation problems we have that are caused by the same issue.
> > Hacking workarounds into the holepunch/mmap paths is not a robust or
> > maintainable solution.
> 
> This isn't hacking in a workaround.  This is a solution to the problem
> that assumes holepunches are rare, and it's OK to make the fault path retry
> if a holepunch happened. 

Workloads that use hole punches tend to make extensive use of them,
so I suspect that what is OK for you is going to be a fairly
significant problem for others...

> I agree it needs to be documented and the exact
> details of what is protected by i_damaged_mutex needs to be clear.
> 
> > 	a) it is a entire file lock and truncate/holepunch do not
> > 	   affect the entire scope of the file. i.e. there is no
> > 	   reason why holepunch cannot run concurrently with IO if
> > 	   they are to different ranges of a file.
> 
> Yes, if we care about the scalability of truncate and holepunch.  I really
> don't think that's an important workload.

See my comment above about workloads that use holepunch often care
about scalability. i.e. Workloads that use preallocation and hole
punching typically care about IO scalability.

For example, sparse image files on loopback devices where
mounted filesystems use discard to run hole punches to ensure
mimimal space usage. We most certainly care about the scalability of
hole punching w.r.t. other IO going to the same file (probably
direct IO) in these workloads. Right now a hole punch is effectively
an unqueuable operation - it causes a pipeline stall - just like
TRIM on SATA....

> > 	b) it increases the struct inode by at least 5%. If we are
> > 	   going to increase the inode by that much, we better make
> > 	   sure we solve more than one isolated, relatively rare
> > 	   corner case behaviour.
> 
> About 7% (568 -> 608 bytes) with my .config (SMP, SPIN_ON_OWNER,
> !DEBUG_MUTEXES, !DEBUG_LOCK_ALLOC).  I rearranged the order (thanks,
> pahole!) to put i_damage in the hole created by private_lock.

Which is a huge price to pay for solving a problem that the majority
of users will never hit or care about. As I say to everyone that
thinks they can just add stuff to the struct inode without caring
about the size increase: use filesystem developers will kill to save
4 bytes in the struct inode, so you better have a damn good reason
for adding 40+ bytes to the structure....

> I'm solving two cases here, one is holepunch vs fault for regular
> files.  The other is truncate vs fault for XIP files, which is
> probably a little less rare.  I think the same mechanism can be
> used when a filesystem is doing a defragment or otherwise moving
> blocks around on an XIP block device.

Well, you kind of raised that as a secondary issue in passing, not
that it was the problem you actually need to solve.

> Look, I'm not attached to the i_damage solution,
> but I do need something to protect against truncate vs fault on XIP.
> Fixing holepunch vs fault too just felt like the right thing to do.

So, your real problem is that ext2-xip has a truncate vs fault race
because of a lack of a page lock to serialise page faults against
truncate. In fact, it appears to bypass the page cache completely
for both reads and writes (writes go directly to block device
memory), so it would seem that all the assumptions we've made about
serialisation on page locks throughout every filesystem are not
valid for XIP?

That seems like an XIP design fault, yes?  Why should we be
hacking crap into the struct inode to fix a XIP design flaw?

Indeed, XIP could make use of the structures we already
have in the struct inode/address space to behave more like a normal
filesystem. We already use the mapping tree to store
per-page information that is used to serialise truncate vs page
faults, so why not make XIP do exactly the same thing? 

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: hole-punch vs fault
  2013-12-01 21:52           ` Dave Chinner
@ 2013-12-02  8:33             ` Jan Kara
  2013-12-02 15:58               ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2013-12-02  8:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Matthew Wilcox, Jan Kara, linux-fsdevel

On Mon 02-12-13 08:52:21, Dave Chinner wrote:
<snip>
> > I'm solving two cases here, one is holepunch vs fault for regular
> > files.  The other is truncate vs fault for XIP files, which is
> > probably a little less rare.  I think the same mechanism can be
> > used when a filesystem is doing a defragment or otherwise moving
> > blocks around on an XIP block device.
> 
> Well, you kind of raised that as a secondary issue in passing, not
> that it was the problem you actually need to solve.
> 
> > Look, I'm not attached to the i_damage solution,
> > but I do need something to protect against truncate vs fault on XIP.
> > Fixing holepunch vs fault too just felt like the right thing to do.
> 
> So, your real problem is that ext2-xip has a truncate vs fault race
> because of a lack of a page lock to serialise page faults against
> truncate. In fact, it appears to bypass the page cache completely
> for both reads and writes (writes go directly to block device
> memory), so it would seem that all the assumptions we've made about
> serialisation on page locks throughout every filesystem are not
> valid for XIP?
  Yeah, agreed.

> Indeed, XIP could make use of the structures we already
> have in the struct inode/address space to behave more like a normal
> filesystem. We already use the mapping tree to store
> per-page information that is used to serialise truncate vs page
> faults, so why not make XIP do exactly the same thing? 
  I believe that grabbing mmap_sem for writing during truncate (in case of
ext2 around xip_truncate_page() & truncate_setsize() calls should do the
trick. But I need to verify with lockdep that it doesn't introduce new
locking problems.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: hole-punch vs fault
  2013-12-02  8:33             ` Jan Kara
@ 2013-12-02 15:58               ` Matthew Wilcox
  2013-12-02 20:11                 ` Jan Kara
  2013-12-02 20:13                 ` Matthew Wilcox
  0 siblings, 2 replies; 15+ messages in thread
From: Matthew Wilcox @ 2013-12-02 15:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Chinner, linux-fsdevel

On Mon, Dec 02, 2013 at 09:33:18AM +0100, Jan Kara wrote:
> > Indeed, XIP could make use of the structures we already
> > have in the struct inode/address space to behave more like a normal
> > filesystem. We already use the mapping tree to store
> > per-page information that is used to serialise truncate vs page
> > faults, so why not make XIP do exactly the same thing? 
>   I believe that grabbing mmap_sem for writing during truncate (in case of
> ext2 around xip_truncate_page() & truncate_setsize() calls should do the
> trick. But I need to verify with lockdep that it doesn't introduce new
> locking problems.

Umm ... mmap_sem for write?  On each of the tasks that have a file mmaped?
I know we have double_down that orders by memory address, but I don't
think we have an N_down_write().

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: hole-punch vs fault
  2013-12-02 15:58               ` Matthew Wilcox
@ 2013-12-02 20:11                 ` Jan Kara
  2013-12-02 20:13                 ` Matthew Wilcox
  1 sibling, 0 replies; 15+ messages in thread
From: Jan Kara @ 2013-12-02 20:11 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Jan Kara, Dave Chinner, linux-fsdevel

On Mon 02-12-13 08:58:25, Matthew Wilcox wrote:
> On Mon, Dec 02, 2013 at 09:33:18AM +0100, Jan Kara wrote:
> > > Indeed, XIP could make use of the structures we already
> > > have in the struct inode/address space to behave more like a normal
> > > filesystem. We already use the mapping tree to store
> > > per-page information that is used to serialise truncate vs page
> > > faults, so why not make XIP do exactly the same thing? 
> >   I believe that grabbing mmap_sem for writing during truncate (in case of
> > ext2 around xip_truncate_page() & truncate_setsize() calls should do the
> > trick. But I need to verify with lockdep that it doesn't introduce new
> > locking problems.
> 
> Umm ... mmap_sem for write?  On each of the tasks that have a file mmaped?
> I know we have double_down that orders by memory address, but I don't
> think we have an N_down_write().
  No, sorry, that was my mistake. I was thinking about threads of a single
process racing but the race is there obviously for different processes as
well so mmap_sem won't help us.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: hole-punch vs fault
  2013-12-02 15:58               ` Matthew Wilcox
  2013-12-02 20:11                 ` Jan Kara
@ 2013-12-02 20:13                 ` Matthew Wilcox
  2013-12-02 23:13                   ` Jan Kara
  1 sibling, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2013-12-02 20:13 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Chinner, linux-fsdevel

On Mon, Dec 02, 2013 at 08:58:25AM -0700, Matthew Wilcox wrote:
> On Mon, Dec 02, 2013 at 09:33:18AM +0100, Jan Kara wrote:
> > > Indeed, XIP could make use of the structures we already
> > > have in the struct inode/address space to behave more like a normal
> > > filesystem. We already use the mapping tree to store
> > > per-page information that is used to serialise truncate vs page
> > > faults, so why not make XIP do exactly the same thing? 
> >   I believe that grabbing mmap_sem for writing during truncate (in case of
> > ext2 around xip_truncate_page() & truncate_setsize() calls should do the
> > trick. But I need to verify with lockdep that it doesn't introduce new
> > locking problems.
> 
> Umm ... mmap_sem for write?  On each of the tasks that have a file mmaped?
> I know we have double_down that orders by memory address, but I don't
> think we have an N_down_write().

Here's a simpler solution: Grab i_mmap_sem over vm_insert_mixed().
i_mmap_sem is already taken in unmap_mapping_range(), called from
truncate_pagecache(), so we know that while we hold i_mmap_sem that
i_size has already been updated and the block can't be deallocated
from under us.  We know that i_mmap_sem is safe to take here because it
already is (in __xip_unmap()).

Now we need only worry about holepunch, and it's the same problem for
pagecache and XIP files alike.

diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..f6de4c3 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -260,8 +260,17 @@ again:
 		__xip_unmap(mapping, vmf->pgoff);
 
 found:
+		/* We must recheck i_size under i_mmap_mutex */
+		mutex_lock(&mapping->i_mmap_mutex);
+		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+							PAGE_CACHE_SHIFT;
+		if (unlikely(vmf->pgoff >= size)) {
+			mutex_unlock(&mapping->i_mmap_mutex);
+			return VM_FAULT_SIGBUS;
+		}
 		err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
 							xip_pfn);
+		mutex_unlock(&mapping->i_mmap_mutex);
 		if (err == -ENOMEM)
 			return VM_FAULT_OOM;
 		/*
@@ -285,6 +294,15 @@ found:
 		}
 		if (error != -ENODATA)
 			goto out;
+
+		/* We must recheck i_size under i_mmap_mutex */
+		mutex_lock(&mapping->i_mmap_mutex);
+		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+							PAGE_CACHE_SHIFT;
+		if (unlikely(vmf->pgoff >= size)) {
+			ret = VM_FAULT_SIGBUS;
+			goto out;
+		}
 		/* not shared and writable, use xip_sparse_page() */
 		page = xip_sparse_page();
 		if (!page)
@@ -296,6 +314,7 @@ found:
 
 		ret = VM_FAULT_NOPAGE;
 out:
+		mutex_unlock(&mapping->i_mmap_mutex);
 		write_seqcount_end(&xip_sparse_seq);
 		mutex_unlock(&xip_sparse_mutex);
 

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: hole-punch vs fault
  2013-12-02 20:13                 ` Matthew Wilcox
@ 2013-12-02 23:13                   ` Jan Kara
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2013-12-02 23:13 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Jan Kara, Dave Chinner, linux-fsdevel

On Mon 02-12-13 13:13:15, Matthew Wilcox wrote:
> On Mon, Dec 02, 2013 at 08:58:25AM -0700, Matthew Wilcox wrote:
> > On Mon, Dec 02, 2013 at 09:33:18AM +0100, Jan Kara wrote:
> > > > Indeed, XIP could make use of the structures we already
> > > > have in the struct inode/address space to behave more like a normal
> > > > filesystem. We already use the mapping tree to store
> > > > per-page information that is used to serialise truncate vs page
> > > > faults, so why not make XIP do exactly the same thing? 
> > >   I believe that grabbing mmap_sem for writing during truncate (in case of
> > > ext2 around xip_truncate_page() & truncate_setsize() calls should do the
> > > trick. But I need to verify with lockdep that it doesn't introduce new
> > > locking problems.
> > 
> > Umm ... mmap_sem for write?  On each of the tasks that have a file mmaped?
> > I know we have double_down that orders by memory address, but I don't
> > think we have an N_down_write().
> 
> Here's a simpler solution: Grab i_mmap_sem over vm_insert_mixed().
> i_mmap_sem is already taken in unmap_mapping_range(), called from
> truncate_pagecache(), so we know that while we hold i_mmap_sem that
> i_size has already been updated and the block can't be deallocated
> from under us.  We know that i_mmap_sem is safe to take here because it
> already is (in __xip_unmap()).
  Yeah, that should do the trick.

								Honza

> Now we need only worry about holepunch, and it's the same problem for
> pagecache and XIP files alike.
> 
> diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
> index d8d9fe3..f6de4c3 100644
> --- a/mm/filemap_xip.c
> +++ b/mm/filemap_xip.c
> @@ -260,8 +260,17 @@ again:
>  		__xip_unmap(mapping, vmf->pgoff);
>  
>  found:
> +		/* We must recheck i_size under i_mmap_mutex */
> +		mutex_lock(&mapping->i_mmap_mutex);
> +		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
> +							PAGE_CACHE_SHIFT;
> +		if (unlikely(vmf->pgoff >= size)) {
> +			mutex_unlock(&mapping->i_mmap_mutex);
> +			return VM_FAULT_SIGBUS;
> +		}
>  		err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
>  							xip_pfn);
> +		mutex_unlock(&mapping->i_mmap_mutex);
>  		if (err == -ENOMEM)
>  			return VM_FAULT_OOM;
>  		/*
> @@ -285,6 +294,15 @@ found:
>  		}
>  		if (error != -ENODATA)
>  			goto out;
> +
> +		/* We must recheck i_size under i_mmap_mutex */
> +		mutex_lock(&mapping->i_mmap_mutex);
> +		size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
> +							PAGE_CACHE_SHIFT;
> +		if (unlikely(vmf->pgoff >= size)) {
> +			ret = VM_FAULT_SIGBUS;
> +			goto out;
> +		}
>  		/* not shared and writable, use xip_sparse_page() */
>  		page = xip_sparse_page();
>  		if (!page)
> @@ -296,6 +314,7 @@ found:
>  
>  		ret = VM_FAULT_NOPAGE;
>  out:
> +		mutex_unlock(&mapping->i_mmap_mutex);
>  		write_seqcount_end(&xip_sparse_seq);
>  		mutex_unlock(&xip_sparse_mutex);
>  
> 
> -- 
> Matthew Wilcox				Intel Open Source Technology Centre
> "Bill, look, we understand that you're interested in selling us this
> operating system, but compare it to ours.  We can't possibly take such
> a retrograde step."
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-12-02 23:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-27 13:48 hole-punch vs fault Matthew Wilcox
2013-11-27 22:19 ` Jan Kara
2013-11-28  2:33   ` Matthew Wilcox
2013-11-28  3:30     ` Matthew Wilcox
2013-11-28  4:22     ` Dave Chinner
2013-11-28  4:44     ` Matthew Wilcox
2013-11-28 12:24       ` Matthew Wilcox
2013-11-28 22:12       ` Dave Chinner
2013-11-29 13:11         ` Matthew Wilcox
2013-12-01 21:52           ` Dave Chinner
2013-12-02  8:33             ` Jan Kara
2013-12-02 15:58               ` Matthew Wilcox
2013-12-02 20:11                 ` Jan Kara
2013-12-02 20:13                 ` Matthew Wilcox
2013-12-02 23:13                   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).