[PATCH 0/2] Fix data corruption when blocksize

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Fix data corruption when blocksize < pagesize
@ 2014-09-23 15:03 Jan Kara
  2014-09-23 15:03 ` [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data Jan Kara
  2014-09-23 15:03 ` [PATCH 2/2] ext4: Fix mmap data corruption when blocksize < pagesize Jan Kara
  0 siblings, 2 replies; 10+ messages in thread
From: Jan Kara @ 2014-09-23 15:03 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Dave Chinner, linux-ext4, Ted Tso

  Hello,

  these two patches fix the data corruption triggered by xfstests
generic/030 test for ext4. I believe XFS can use the same function to
deal with the problem... Dave can you verify? If the function is indeed
usable for XFS as well, through which tree are we going to merge it?
ext4 or xfs?

								Honza

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data
  2014-09-23 15:03 [PATCH 0/2] Fix data corruption when blocksize < pagesize Jan Kara
@ 2014-09-23 15:03 ` Jan Kara
  2014-09-25  1:32   ` Dave Chinner
  2014-09-23 15:03 ` [PATCH 2/2] ext4: Fix mmap data corruption when blocksize < pagesize Jan Kara
  1 sibling, 1 reply; 10+ messages in thread
From: Jan Kara @ 2014-09-23 15:03 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Dave Chinner, linux-ext4, Ted Tso, Jan Kara

->page_mkwrite() is used by filesystems to allocate blocks under a page
which is becoming writeably mmapped in some process' address space. This
allows a filesystem to return a page fault if there is not enough space
available, user exceeds quota or similar problem happens, rather than
silently discarding data later when writepage is called.

However VFS fails to call ->page_mkwrite() in all the cases where
filesystems need it when blocksize < pagesize. For example when
blocksize = 1024, pagesize = 4096 the following is problematic:
  ftruncate(fd, 0);
  pwrite(fd, buf, 1024, 0);
  map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
  map[0] = 'a';       ----> page_mkwrite() for index 0 is called
  ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
  mremap(map, 1024, 10000, 0);
  map[4095] = 'a';    ----> no page_mkwrite() called

At the moment ->page_mkwrite() is called, filesystem can allocate only
one block for the page because i_size == 1024. Otherwise it would create
blocks beyond i_size which is generally undesirable. But later at
->writepage() time, we also need to store data at offset 4095 but we
don't have block allocated for it.

This patch introduces a helper function filesystems can use to have
->page_mkwrite() called at all the necessary moments.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c                 | 57 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/buffer_head.h |  7 ++++++
 2 files changed, 64 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 8f05111bbb8b..2e3a1190dd0a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -41,6 +41,7 @@
 #include <linux/bitops.h>
 #include <linux/mpage.h>
 #include <linux/bit_spinlock.h>
+#include <linux/rmap.h>
 #include <trace/events/block.h>
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
@@ -2010,6 +2011,59 @@ static int __block_commit_write(struct inode *inode, struct page *page,
 	return 0;
 }
 
+#ifdef CONFIG_MMU
+/**
+ * block_create_hole - handle creation of a hole in a file
+ * @inode:	inode where the hole is created
+ * @from:	offset in bytes where the hole starts
+ * @to:		offset in bytes where the hole ends.
+ *
+ * Handle creation of a hole in a file either caused by extending truncate or
+ * by write starting after current i_size. We mark the page straddling @from RO
+ * so that page_mkwrite() is called on the nearest write access to the page.
+ * This way filesystem can be sure that page_mkwrite() is called on the page
+ * before user writes to the page via mmap after the i_size has been changed.
+ *
+ * This function must be called after i_size is updated so that page_mkwrite()
+ * happenning immediately after we unlock the page initializes it correctly.
+ * Also the function must be called while we still hold i_mutex - this not only
+ * makes sure i_size is stable but also that userspace cannot observe new
+ * i_size value before we are prepared to store mmap writes at new inode size.
+ */
+void block_create_hole(struct inode *inode, loff_t from, loff_t to)
+{
+	int bsize = 1 << inode->i_blkbits;
+	loff_t rounded_from;
+	struct page *page;
+	pgoff_t index;
+
+	WARN_ON(!mutex_is_locked(&inode->i_mutex));
+	WARN_ON(to > inode->i_size);
+
+	if (from >= to || bsize == PAGE_CACHE_SIZE)
+		return;
+	/* Currently last page will not have any hole block created? */
+	rounded_from = ALIGN(from, bsize);
+	if (to <= rounded_from || !(rounded_from & (PAGE_CACHE_SIZE - 1)))
+		return;
+
+	index = from >> PAGE_CACHE_SHIFT;
+	page = find_lock_page(inode->i_mapping, index);
+	/* Page not cached? Nothing to do */
+	if (!page)
+		return;
+	/*
+	 * See clear_page_dirty_for_io() for details why set_page_dirty()
+	 * is needed.
+	 */
+	if (page_mkclean(page))
+		set_page_dirty(page);
+	unlock_page(page);
+	page_cache_release(page);
+}
+EXPORT_SYMBOL(block_create_hole);
+#endif
+
 /*
  * block_write_begin takes care of the basic task of block allocation and
  * bringing partial write blocks uptodate first.
@@ -2080,6 +2134,7 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 			struct page *page, void *fsdata)
 {
 	struct inode *inode = mapping->host;
+	loff_t old_size = inode->i_size;
 	int i_size_changed = 0;
 
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
@@ -2099,6 +2154,8 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 	unlock_page(page);
 	page_cache_release(page);
 
+	if (old_size < pos)
+		block_create_hole(inode, old_size, pos);
 	/*
 	 * Don't mark the inode dirty under page lock. First, it unnecessarily
 	 * makes the holding time of page lock longer. Second, it forces lock
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 324329ceea1e..b4f79eeca7c5 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -244,6 +244,13 @@ static inline int block_page_mkwrite_return(int err)
 	/* -ENOSPC, -EDQUOT, -EIO ... */
 	return VM_FAULT_SIGBUS;
 }
+#ifdef CONFIG_MMU
+void block_create_hole(struct inode *inode, loff_t from, loff_t to);
+#else
+static inline void block_create_hole(struct inode *inode, loff_t from, loff_t to)
+{
+}
+#endif
 sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *);
 int block_truncate_page(struct address_space *, loff_t, get_block_t *);
 int nobh_write_begin(struct address_space *, loff_t, unsigned, unsigned,
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] ext4: Fix mmap data corruption when blocksize < pagesize
  2014-09-23 15:03 [PATCH 0/2] Fix data corruption when blocksize < pagesize Jan Kara
  2014-09-23 15:03 ` [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data Jan Kara
@ 2014-09-23 15:03 ` Jan Kara
  2014-09-24  8:45   ` Jan Kara
  1 sibling, 1 reply; 10+ messages in thread
From: Jan Kara @ 2014-09-23 15:03 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Dave Chinner, linux-ext4, Ted Tso, Jan Kara

Use block_create_hole() when hole is being created in a file so that
->page_mkwrite() will get called for the partial tail page if it is
mmaped (see the first patch in the series for details).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3aa26e9117c4..fdcb007c2c9e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4536,8 +4536,12 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 				ext4_orphan_del(NULL, inode);
 				goto err_out;
 			}
-		} else
+		} else {
+			loff_t old_size = inode->i_size;
+
 			i_size_write(inode, attr->ia_size);
+			block_create_hole(inode, old_size, inode->i_size);
+		}
 
 		/*
 		 * Blocks are going to be removed from the inode. Wait
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] ext4: Fix mmap data corruption when blocksize < pagesize
  2014-09-23 15:03 ` [PATCH 2/2] ext4: Fix mmap data corruption when blocksize < pagesize Jan Kara
@ 2014-09-24  8:45   ` Jan Kara
  2014-09-24  8:57     ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2014-09-24  8:45 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Dave Chinner, linux-ext4, Ted Tso, Jan Kara, xfs

On Tue 23-09-14 17:03:23, Jan Kara wrote:
> Use block_create_hole() when hole is being created in a file so that
> ->page_mkwrite() will get called for the partial tail page if it is
> mmaped (see the first patch in the series for details).
  Just out of curiosity I did a change similar to this one for ext4 to XFS
and indeed it fixed generic/030 test failures for XFS with blocksize 1k.

								Honza

PS: I forgot to CC xfs list in the original posting. You can find the VFS
    patch e.g. at http://www.spinics.net/lists/linux-mm/msg78976.html

> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/inode.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3aa26e9117c4..fdcb007c2c9e 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4536,8 +4536,12 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
>  				ext4_orphan_del(NULL, inode);
>  				goto err_out;
>  			}
> -		} else
> +		} else {
> +			loff_t old_size = inode->i_size;
> +
>  			i_size_write(inode, attr->ia_size);
> +			block_create_hole(inode, old_size, inode->i_size);
> +		}
>  
>  		/*
>  		 * Blocks are going to be removed from the inode. Wait
> -- 
> 1.8.1.4
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] ext4: Fix mmap data corruption when blocksize < pagesize
  2014-09-24  8:45   ` Jan Kara
@ 2014-09-24  8:57     ` Jan Kara
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2014-09-24  8:57 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Dave Chinner, linux-ext4, Ted Tso, Jan Kara, xfs

[-- Attachment #1: Type: text/plain, Size: 1624 bytes --]

On Wed 24-09-14 10:45:19, Jan Kara wrote:
> On Tue 23-09-14 17:03:23, Jan Kara wrote:
> > Use block_create_hole() when hole is being created in a file so that
> > ->page_mkwrite() will get called for the partial tail page if it is
> > mmaped (see the first patch in the series for details).
>   Just out of curiosity I did a change similar to this one for ext4 to XFS
> and indeed it fixed generic/030 test failures for XFS with blocksize 1k.
  Just for reference attached the patch I was testing - I can resend with
proper changelog etc. if people are fine with this approach.

								Honza

> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/ext4/inode.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 3aa26e9117c4..fdcb007c2c9e 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -4536,8 +4536,12 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
> >  				ext4_orphan_del(NULL, inode);
> >  				goto err_out;
> >  			}
> > -		} else
> > +		} else {
> > +			loff_t old_size = inode->i_size;
> > +
> >  			i_size_write(inode, attr->ia_size);
> > +			block_create_hole(inode, old_size, inode->i_size);
> > +		}
> >  
> >  		/*
> >  		 * Blocks are going to be removed from the inode. Wait
> > -- 
> > 1.8.1.4
> > 
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-xfs-Fix-mmap-data-corruption.patch --]
[-- Type: text/x-patch, Size: 738 bytes --]

>From ceedc36b1c99c25ebb55c88f6ab059347b52f4ec Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 24 Sep 2014 10:18:25 +0200
Subject: [PATCH] xfs: Fix mmap data corruption

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/xfs_iops.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 72129493e9d3..19ce64bfb4f7 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -848,6 +848,8 @@ xfs_setattr_size(
 	if (error)
 		return error;
 	truncate_setsize(inode, newsize);
+	if (oldsize < newsize)
+		block_create_hole(inode, oldsize, newsize);
 
 	tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_SIZE);
 	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_itruncate, 0, 0);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data
  2014-09-23 15:03 ` [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data Jan Kara
@ 2014-09-25  1:32   ` Dave Chinner
  2014-09-25  9:34     ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2014-09-25  1:32 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, linux-ext4, Ted Tso

On Tue, Sep 23, 2014 at 05:03:22PM +0200, Jan Kara wrote:
> ->page_mkwrite() is used by filesystems to allocate blocks under a page
> which is becoming writeably mmapped in some process' address space. This
> allows a filesystem to return a page fault if there is not enough space
> available, user exceeds quota or similar problem happens, rather than
> silently discarding data later when writepage is called.
> 
> However VFS fails to call ->page_mkwrite() in all the cases where
> filesystems need it when blocksize < pagesize. For example when
> blocksize = 1024, pagesize = 4096 the following is problematic:
>   ftruncate(fd, 0);
>   pwrite(fd, buf, 1024, 0);
>   map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
>   map[0] = 'a';       ----> page_mkwrite() for index 0 is called
>   ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
>   mremap(map, 1024, 10000, 0);
>   map[4095] = 'a';    ----> no page_mkwrite() called
> 
> At the moment ->page_mkwrite() is called, filesystem can allocate only
> one block for the page because i_size == 1024. Otherwise it would create
> blocks beyond i_size which is generally undesirable. But later at
> ->writepage() time, we also need to store data at offset 4095 but we
> don't have block allocated for it.
...
>  
> +#ifdef CONFIG_MMU
> +/**
> + * block_create_hole - handle creation of a hole in a file
> + * @inode:	inode where the hole is created
> + * @from:	offset in bytes where the hole starts
> + * @to:		offset in bytes where the hole ends.

This function doesn't create holes.  It also manipulates page state,
not block state.  Probably could do with a better name, but I'm not
sure what a better name is - something like
pagecache_extend_isize(old_eof, new_eof)?


> +void block_create_hole(struct inode *inode, loff_t from, loff_t to)
> +{
> +	int bsize = 1 << inode->i_blkbits;
> +	loff_t rounded_from;
> +	struct page *page;
> +	pgoff_t index;
> +
> +	WARN_ON(!mutex_is_locked(&inode->i_mutex));
> +	WARN_ON(to > inode->i_size);

We've already changed i_size, so shouldn't that be:

	WARN_ON(to != inode->i_size);

> +
> +	if (from >= to || bsize == PAGE_CACHE_SIZE)
> +		return;
> +	/* Currently last page will not have any hole block created? */
> +	rounded_from = ALIGN(from, bsize);

That rounds down? or up? round_down/round_up are much better than
ALIGN() because they tell you exactly what rounding was intended...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data
  2014-09-25  1:32   ` Dave Chinner
@ 2014-09-25  9:34     ` Jan Kara
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2014-09-25  9:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-fsdevel, linux-mm, linux-ext4, Ted Tso

On Thu 25-09-14 11:32:16, Dave Chinner wrote:
> On Tue, Sep 23, 2014 at 05:03:22PM +0200, Jan Kara wrote:
> > ->page_mkwrite() is used by filesystems to allocate blocks under a page
> > which is becoming writeably mmapped in some process' address space. This
> > allows a filesystem to return a page fault if there is not enough space
> > available, user exceeds quota or similar problem happens, rather than
> > silently discarding data later when writepage is called.
> > 
> > However VFS fails to call ->page_mkwrite() in all the cases where
> > filesystems need it when blocksize < pagesize. For example when
> > blocksize = 1024, pagesize = 4096 the following is problematic:
> >   ftruncate(fd, 0);
> >   pwrite(fd, buf, 1024, 0);
> >   map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
> >   map[0] = 'a';       ----> page_mkwrite() for index 0 is called
> >   ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
> >   mremap(map, 1024, 10000, 0);
> >   map[4095] = 'a';    ----> no page_mkwrite() called
> > 
> > At the moment ->page_mkwrite() is called, filesystem can allocate only
> > one block for the page because i_size == 1024. Otherwise it would create
> > blocks beyond i_size which is generally undesirable. But later at
> > ->writepage() time, we also need to store data at offset 4095 but we
> > don't have block allocated for it.
> ...
> >  
> > +#ifdef CONFIG_MMU
> > +/**
> > + * block_create_hole - handle creation of a hole in a file
> > + * @inode:	inode where the hole is created
> > + * @from:	offset in bytes where the hole starts
> > + * @to:		offset in bytes where the hole ends.
> 
> This function doesn't create holes.  It also manipulates page state,
> not block state.  Probably could do with a better name, but I'm not
> sure what a better name is - something like
> pagecache_extend_isize(old_eof, new_eof)?
  Yeah, you are right. I should be actually better off moving that function
to mm/truncate.c. Regarding the name I agree block_create_hole() isn't
very accurate but I don't like pagecache_extend_isize() too much either -
see below for reason.

> > +void block_create_hole(struct inode *inode, loff_t from, loff_t to)
> > +{
> > +	int bsize = 1 << inode->i_blkbits;
> > +	loff_t rounded_from;
> > +	struct page *page;
> > +	pgoff_t index;
> > +
> > +	WARN_ON(!mutex_is_locked(&inode->i_mutex));
> > +	WARN_ON(to > inode->i_size);
> 
> We've already changed i_size, so shouldn't that be:
  Not quite. When you have 1k blocksize, filesize == 512 and you do
pwrite(file, buf, 1024, 8192);
  Then you want the function to be called for range 512 - 8192, however
i_size is already 9216. So the assertion is correct as is. This is also
the reason why I don't like pagecache_extend_isize() because it suggests
'to' is the final i_size but it's not.

Maybe the most simple interface would be to really call the function
pagecache_extend_isize(), let it take just 'to' and it will write the new
i_size and handle pagecache tricks. The extending writes will then be
handled by first calling pagecache_extend_isize() to extend to 'pos' and
then just i_size_write() the final size...

								Honza
> 	WARN_ON(to != inode->i_size);
> 
> > +
> > +	if (from >= to || bsize == PAGE_CACHE_SIZE)
> > +		return;
> > +	/* Currently last page will not have any hole block created? */
> > +	rounded_from = ALIGN(from, bsize);
> 
> That rounds down? or up? round_down/round_up are much better than
> ALIGN() because they tell you exactly what rounding was intended...
  Good point. I'll use round_up().

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data
  2014-09-25 12:41 [PATCH 0/2 v2] Fix data corruption when blocksize < pagesize for mmapped data Jan Kara
@ 2014-09-25 12:41 ` Jan Kara
  2014-10-02  2:06   ` Theodore Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2014-09-25 12:41 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-ext4, xfs, Dave Chinner, Ted Tso, Jan Kara

->page_mkwrite() is used by filesystems to allocate blocks under a page
which is becoming writeably mmapped in some process' address space. This
allows a filesystem to return a page fault if there is not enough space
available, user exceeds quota or similar problem happens, rather than
silently discarding data later when writepage is called.

However VFS fails to call ->page_mkwrite() in all the cases where
filesystems need it when blocksize < pagesize. For example when
blocksize = 1024, pagesize = 4096 the following is problematic:
  ftruncate(fd, 0);
  pwrite(fd, buf, 1024, 0);
  map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
  map[0] = 'a';       ----> page_mkwrite() for index 0 is called
  ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
  mremap(map, 1024, 10000, 0);
  map[4095] = 'a';    ----> no page_mkwrite() called

At the moment ->page_mkwrite() is called, filesystem can allocate only
one block for the page because i_size == 1024. Otherwise it would create
blocks beyond i_size which is generally undesirable. But later at
->writepage() time, we also need to store data at offset 4095 but we
don't have block allocated for it.

This patch introduces a helper function filesystems can use to have
->page_mkwrite() called at all the necessary moments.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c        |  3 +++
 include/linux/mm.h |  8 ++++++++
 mm/truncate.c      | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 8f05111bbb8b..3ba5a6a1bc5f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2080,6 +2080,7 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 			struct page *page, void *fsdata)
 {
 	struct inode *inode = mapping->host;
+	loff_t old_size = inode->i_size;
 	int i_size_changed = 0;
 
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
@@ -2099,6 +2100,8 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 	unlock_page(page);
 	page_cache_release(page);
 
+	if (old_size < pos)
+		pagecache_isize_extended(inode, old_size, pos);
 	/*
 	 * Don't mark the inode dirty under page lock. First, it unnecessarily
 	 * makes the holding time of page lock longer. Second, it forces lock
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc882ed2..f0e53e5a3b17 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1155,6 +1155,14 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
 
 extern void truncate_pagecache(struct inode *inode, loff_t new);
 extern void truncate_setsize(struct inode *inode, loff_t newsize);
+#ifdef CONFIG_MMU
+void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to);
+#else
+static inline void pagecache_isize_extended(struct inode *inode, loff_t from,
+					    loff_t to)
+{
+}
+#endif
 void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end);
 int truncate_inode_page(struct address_space *mapping, struct page *page);
 int generic_error_remove_page(struct address_space *mapping, struct page *page);
diff --git a/mm/truncate.c b/mm/truncate.c
index 96d167372d89..261eaf6e5a19 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -20,6 +20,7 @@
 #include <linux/buffer_head.h>	/* grr. try_to_release_page,
 				   do_invalidatepage */
 #include <linux/cleancache.h>
+#include <linux/rmap.h>
 #include "internal.h"
 
 static void clear_exceptional_entry(struct address_space *mapping,
@@ -719,12 +720,68 @@ EXPORT_SYMBOL(truncate_pagecache);
  */
 void truncate_setsize(struct inode *inode, loff_t newsize)
 {
+	loff_t oldsize = inode->i_size;
+
 	i_size_write(inode, newsize);
+	if (newsize > oldsize)
+		pagecache_isize_extended(inode, oldsize, newsize);
 	truncate_pagecache(inode, newsize);
 }
 EXPORT_SYMBOL(truncate_setsize);
 
 /**
+ * pagecache_isize_extended - update pagecache after extension of i_size
+ * @inode:	inode for which i_size was extended
+ * @from:	original inode size
+ * @to:		new inode size
+ *
+ * Handle extension of inode size either caused by extending truncate or by
+ * write starting after current i_size. We mark the page straddling current
+ * i_size RO so that page_mkwrite() is called on the nearest write access to
+ * the page.  This way filesystem can be sure that page_mkwrite() is called on
+ * the page before user writes to the page via mmap after the i_size has been
+ * changed.
+ *
+ * The function must be called after i_size is updated so that page fault
+ * coming after we unlock the page will already see the new i_size.
+ * The function must be called while we still hold i_mutex - this not only
+ * makes sure i_size is stable but also that userspace cannot observe new
+ * i_size value before we are prepared to store mmap writes at new inode size.
+ */
+void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to)
+{
+	int bsize = 1 << inode->i_blkbits;
+	loff_t rounded_from;
+	struct page *page;
+	pgoff_t index;
+
+	WARN_ON(!mutex_is_locked(&inode->i_mutex));
+	WARN_ON(to > inode->i_size);
+
+	if (from >= to || bsize == PAGE_CACHE_SIZE)
+		return;
+	/* Page straddling @from will not have any hole block created? */
+	rounded_from = round_up(from, bsize);
+	if (to <= rounded_from || !(rounded_from & (PAGE_CACHE_SIZE - 1)))
+		return;
+
+	index = from >> PAGE_CACHE_SHIFT;
+	page = find_lock_page(inode->i_mapping, index);
+	/* Page not cached? Nothing to do */
+	if (!page)
+		return;
+	/*
+	 * See clear_page_dirty_for_io() for details why set_page_dirty()
+	 * is needed.
+	 */
+	if (page_mkclean(page))
+		set_page_dirty(page);
+	unlock_page(page);
+	page_cache_release(page);
+}
+EXPORT_SYMBOL(pagecache_isize_extended);
+
+/**
  * truncate_pagecache_range - unmap and remove pagecache that is hole-punched
  * @inode: inode
  * @lstart: offset of beginning of hole
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data
  2014-09-25 12:41 ` [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data Jan Kara
@ 2014-10-02  2:06   ` Theodore Ts'o
  0 siblings, 0 replies; 10+ messages in thread
From: Theodore Ts'o @ 2014-10-02  2:06 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-ext4, xfs, Dave Chinner

On Thu, Sep 25, 2014 at 02:41:55PM +0200, Jan Kara wrote:
> ->page_mkwrite() is used by filesystems to allocate blocks under a page
> which is becoming writeably mmapped in some process' address space. This
> allows a filesystem to return a page fault if there is not enough space
> available, user exceeds quota or similar problem happens, rather than
> silently discarding data later when writepage is called.
> 
> However VFS fails to call ->page_mkwrite() in all the cases where
> filesystems need it when blocksize < pagesize. For example when
> blocksize = 1024, pagesize = 4096 the following is problematic:
>   ftruncate(fd, 0);
>   pwrite(fd, buf, 1024, 0);
>   map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
>   map[0] = 'a';       ----> page_mkwrite() for index 0 is called
>   ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
>   mremap(map, 1024, 10000, 0);
>   map[4095] = 'a';    ----> no page_mkwrite() called
> 
> At the moment ->page_mkwrite() is called, filesystem can allocate only
> one block for the page because i_size == 1024. Otherwise it would create
> blocks beyond i_size which is generally undesirable. But later at
> ->writepage() time, we also need to store data at offset 4095 but we
> don't have block allocated for it.
> 
> This patch introduces a helper function filesystems can use to have
> ->page_mkwrite() called at all the necessary moments.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Applied to the ext4 tree, thanks.

					- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data
  2014-10-10 14:23 [PATCH 0/2 v2] Fix data corruption when blocksize < pagesize for mmapped data Jan Kara
@ 2014-10-10 14:23 ` Jan Kara
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

->page_mkwrite() is used by filesystems to allocate blocks under a page
which is becoming writeably mmapped in some process' address space. This
allows a filesystem to return a page fault if there is not enough space
available, user exceeds quota or similar problem happens, rather than
silently discarding data later when writepage is called.

However VFS fails to call ->page_mkwrite() in all the cases where
filesystems need it when blocksize < pagesize. For example when
blocksize = 1024, pagesize = 4096 the following is problematic:
  ftruncate(fd, 0);
  pwrite(fd, buf, 1024, 0);
  map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
  map[0] = 'a';       ----> page_mkwrite() for index 0 is called
  ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
  mremap(map, 1024, 10000, 0);
  map[4095] = 'a';    ----> no page_mkwrite() called

At the moment ->page_mkwrite() is called, filesystem can allocate only
one block for the page because i_size == 1024. Otherwise it would create
blocks beyond i_size which is generally undesirable. But later at
->writepage() time, we also need to store data at offset 4095 but we
don't have block allocated for it.

This patch introduces a helper function filesystems can use to have
->page_mkwrite() called at all the necessary moments.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c        |  3 +++
 include/linux/mm.h |  8 ++++++++
 mm/truncate.c      | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 8f05111bbb8b..3ba5a6a1bc5f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2080,6 +2080,7 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 			struct page *page, void *fsdata)
 {
 	struct inode *inode = mapping->host;
+	loff_t old_size = inode->i_size;
 	int i_size_changed = 0;
 
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
@@ -2099,6 +2100,8 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 	unlock_page(page);
 	page_cache_release(page);
 
+	if (old_size < pos)
+		pagecache_isize_extended(inode, old_size, pos);
 	/*
 	 * Don't mark the inode dirty under page lock. First, it unnecessarily
 	 * makes the holding time of page lock longer. Second, it forces lock
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc882ed2..f0e53e5a3b17 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1155,6 +1155,14 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
 
 extern void truncate_pagecache(struct inode *inode, loff_t new);
 extern void truncate_setsize(struct inode *inode, loff_t newsize);
+#ifdef CONFIG_MMU
+void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to);
+#else
+static inline void pagecache_isize_extended(struct inode *inode, loff_t from,
+					    loff_t to)
+{
+}
+#endif
 void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end);
 int truncate_inode_page(struct address_space *mapping, struct page *page);
 int generic_error_remove_page(struct address_space *mapping, struct page *page);
diff --git a/mm/truncate.c b/mm/truncate.c
index 96d167372d89..261eaf6e5a19 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -20,6 +20,7 @@
 #include <linux/buffer_head.h>	/* grr. try_to_release_page,
 				   do_invalidatepage */
 #include <linux/cleancache.h>
+#include <linux/rmap.h>
 #include "internal.h"
 
 static void clear_exceptional_entry(struct address_space *mapping,
@@ -719,12 +720,68 @@ EXPORT_SYMBOL(truncate_pagecache);
  */
 void truncate_setsize(struct inode *inode, loff_t newsize)
 {
+	loff_t oldsize = inode->i_size;
+
 	i_size_write(inode, newsize);
+	if (newsize > oldsize)
+		pagecache_isize_extended(inode, oldsize, newsize);
 	truncate_pagecache(inode, newsize);
 }
 EXPORT_SYMBOL(truncate_setsize);
 
 /**
+ * pagecache_isize_extended - update pagecache after extension of i_size
+ * @inode:	inode for which i_size was extended
+ * @from:	original inode size
+ * @to:		new inode size
+ *
+ * Handle extension of inode size either caused by extending truncate or by
+ * write starting after current i_size. We mark the page straddling current
+ * i_size RO so that page_mkwrite() is called on the nearest write access to
+ * the page.  This way filesystem can be sure that page_mkwrite() is called on
+ * the page before user writes to the page via mmap after the i_size has been
+ * changed.
+ *
+ * The function must be called after i_size is updated so that page fault
+ * coming after we unlock the page will already see the new i_size.
+ * The function must be called while we still hold i_mutex - this not only
+ * makes sure i_size is stable but also that userspace cannot observe new
+ * i_size value before we are prepared to store mmap writes at new inode size.
+ */
+void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to)
+{
+	int bsize = 1 << inode->i_blkbits;
+	loff_t rounded_from;
+	struct page *page;
+	pgoff_t index;
+
+	WARN_ON(!mutex_is_locked(&inode->i_mutex));
+	WARN_ON(to > inode->i_size);
+
+	if (from >= to || bsize == PAGE_CACHE_SIZE)
+		return;
+	/* Page straddling @from will not have any hole block created? */
+	rounded_from = round_up(from, bsize);
+	if (to <= rounded_from || !(rounded_from & (PAGE_CACHE_SIZE - 1)))
+		return;
+
+	index = from >> PAGE_CACHE_SHIFT;
+	page = find_lock_page(inode->i_mapping, index);
+	/* Page not cached? Nothing to do */
+	if (!page)
+		return;
+	/*
+	 * See clear_page_dirty_for_io() for details why set_page_dirty()
+	 * is needed.
+	 */
+	if (page_mkclean(page))
+		set_page_dirty(page);
+	unlock_page(page);
+	page_cache_release(page);
+}
+EXPORT_SYMBOL(pagecache_isize_extended);
+
+/**
  * truncate_pagecache_range - unmap and remove pagecache that is hole-punched
  * @inode: inode
  * @lstart: offset of beginning of hole
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-10-10 14:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-23 15:03 [PATCH 0/2] Fix data corruption when blocksize < pagesize Jan Kara
2014-09-23 15:03 ` [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data Jan Kara
2014-09-25  1:32   ` Dave Chinner
2014-09-25  9:34     ` Jan Kara
2014-09-23 15:03 ` [PATCH 2/2] ext4: Fix mmap data corruption when blocksize < pagesize Jan Kara
2014-09-24  8:45   ` Jan Kara
2014-09-24  8:57     ` Jan Kara
  -- strict thread matches above, loose matches on Subject: below --
2014-09-25 12:41 [PATCH 0/2 v2] Fix data corruption when blocksize < pagesize for mmapped data Jan Kara
2014-09-25 12:41 ` [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data Jan Kara
2014-10-02  2:06   ` Theodore Ts'o
2014-10-10 14:23 [PATCH 0/2 v2] Fix data corruption when blocksize < pagesize for mmapped data Jan Kara
2014-10-10 14:23 ` [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).