[patch 1/3] fs: buffer_head writepage no invalidate

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch 1/3] fs: buffer_head writepage no invalidate
       [not found] <20090710073028.782561541@suse.de>
@ 2009-07-10  9:33 ` Nick Piggin
  2009-07-10  9:34   ` [patch 2/3] fs: buffer_head writepage no zero Nick Piggin
                     ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Nick Piggin @ 2009-07-10  9:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, viro, jack, linux-mm

After the previous patchset, this is my progress on the page_mkwrite
thing... These patches are RFC only and have some bugs.
--
invalidate should not be required in the writeout path. The truncate
sequence will first reduce i_size, then clean and discard any existing
pagecache (and no new dirty pagecache can be added because i_size was
reduced and i_mutex is being held), then filesystem data structures
are updated.

Filesystem needs to be able to handle writeout at any point before
the last step, and once the 2nd step completes, there should be no
unfreeable dirty buffers anyway (truncate performs the do_invalidatepage).

Having filesystem changes depend on reading i_size without holding
i_mutex is confusing at least. There is still a case in writepage
paths in buffer.c uses i_size (testing which block to write out), but
this is a small improvement.
---
 fs/buffer.c |   20 ++------------------
 1 file changed, 2 insertions(+), 18 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2663,18 +2663,8 @@ int nobh_writepage(struct page *page, ge
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
 	if (page->index >= end_index+1 || !offset) {
-		/*
-		 * The page may have dirty, unmapped buffers.  For example,
-		 * they may have been added in ext3_writepage().  Make them
-		 * freeable here, so the page does not leak.
-		 */
-#if 0
-		/* Not really sure about this  - do we need this ? */
-		if (page->mapping->a_ops->invalidatepage)
-			page->mapping->a_ops->invalidatepage(page, offset);
-#endif
 		unlock_page(page);
-		return 0; /* don't care */
+		return 0;
 	}
 
 	/*
@@ -2867,14 +2857,8 @@ int block_write_full_page_endio(struct p
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
 	if (page->index >= end_index+1 || !offset) {
-		/*
-		 * The page may have dirty, unmapped buffers.  For example,
-		 * they may have been added in ext3_writepage().  Make them
-		 * freeable here, so the page does not leak.
-		 */
-		do_invalidatepage(page, 0);
 		unlock_page(page);
-		return 0; /* don't care */
+		return 0;
 	}
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [patch 2/3] fs: buffer_head writepage no zero
  2009-07-10  9:33 ` [patch 1/3] fs: buffer_head writepage no invalidate Nick Piggin
@ 2009-07-10  9:34   ` Nick Piggin
  2009-07-10 11:46     ` Jan Kara
  2009-07-10  9:35   ` [patch 3/3] fs: buffer_head page_lock i_size relax Nick Piggin
  2009-07-10 11:08   ` [patch 1/3] fs: buffer_head writepage no invalidate Jan Kara
  2 siblings, 1 reply; 6+ messages in thread
From: Nick Piggin @ 2009-07-10  9:34 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, viro, jack, linux-mm


When writing a page to filesystem, buffer.c zeroes out parts of the page past
i_size in an attempt to get zeroes into those blocks on disk, so as to honour
the requirement that an expanding truncate should zero-fill the file.

Unfortunately, this is racy. The reason we can get something other than
zeroes here is via an mmaped write to the block beyond i_size. Zeroing it
out before writepage narrows the window, but it is still possible to store
junk beyond i_size on disk, by storing into the page after writepage zeroes,
but before DMA (or copy) completes. This allows process A to break posix
semantics for process B (or even inadvertently for itsef).

It could also be possible that the filesystem has written data into the
block but not yet expanded the inode size when the system crashes for
some reason. Unless its journal reply / fsck process etc checks for this
condition, it could also cause subsequent breakage in semantics.

---
 fs/buffer.c      |   94 +++++++++++++++++++++++++------------------------------
 fs/ext2/inode.c  |   30 ++++++++++++++++-
 fs/mpage.c       |   13 +------
 mm/filemap_xip.c |   15 ++++----
 mm/truncate.c    |    1 
 5 files changed, 82 insertions(+), 71 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2656,26 +2656,14 @@ int nobh_writepage(struct page *page, ge
 	unsigned offset;
 	int ret;
 
-	/* Is the page fully inside i_size? */
-	if (page->index < end_index)
-		goto out;
-
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
-	if (page->index >= end_index+1 || !offset) {
+	if (page->index >= end_index &&
+			(page->index >= end_index+1 || !offset)) {
 		unlock_page(page);
 		return 0;
 	}
 
-	/*
-	 * The page straddles i_size.  It must be zeroed out on each and every
-	 * writepage invocation because it may be mmapped.  "A file is mapped
-	 * in multiples of the page size.  For a file that is not a multiple of
-	 * the  page size, the remaining memory is zeroed when mapped, and
-	 * writes to that region are not written out to the file."
-	 */
-	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
-out:
 	ret = mpage_writepage(page, get_block, wbc);
 	if (ret == -EAGAIN)
 		ret = __block_write_full_page(inode, page, get_block, wbc,
@@ -2695,22 +2683,23 @@ int nobh_truncate_page(struct address_sp
 	struct inode *inode = mapping->host;
 	struct page *page;
 	struct buffer_head map_bh;
-	int err;
+	int err = 0;
 
 	blocksize = 1 << inode->i_blkbits;
 	length = offset & (blocksize - 1);
 
 	/* Block boundary? Nothing to do */
 	if (!length)
-		return 0;
+		goto out;
 
 	length = blocksize - length;
 	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
 
 	page = grab_cache_page(mapping, index);
-	err = -ENOMEM;
-	if (!page)
+	if (!page) {
+		err = -ENOMEM;
 		goto out;
+	}
 
 	if (page_has_buffers(page)) {
 has_buffers:
@@ -2752,7 +2741,6 @@ has_buffers:
 	}
 	zero_user(page, offset, length);
 	set_page_dirty(page);
-	err = 0;
 
 unlock:
 	unlock_page(page);
@@ -2762,8 +2750,8 @@ out:
 }
 EXPORT_SYMBOL(nobh_truncate_page);
 
-int block_truncate_page(struct address_space *mapping,
-			loff_t from, get_block_t *get_block)
+int __block_truncate_page(struct address_space *mapping,
+			loff_t from, loff_t to, get_block_t *get_block)
 {
 	pgoff_t index = from >> PAGE_CACHE_SHIFT;
 	unsigned offset = from & (PAGE_CACHE_SIZE-1);
@@ -2773,22 +2761,22 @@ int block_truncate_page(struct address_s
 	struct inode *inode = mapping->host;
 	struct page *page;
 	struct buffer_head *bh;
-	int err;
+	int err = 0;
 
-	blocksize = 1 << inode->i_blkbits;
-	length = offset & (blocksize - 1);
+	/* Page boundary? Nothing to do */
+	if (!offset)
+		goto out;
 
-	/* Block boundary? Nothing to do */
-	if (!length)
-		return 0;
+	blocksize = 1 << inode->i_blkbits;
 
-	length = blocksize - length;
+	length = (unsigned)min_t(loff_t, to - from, PAGE_CACHE_SIZE - offset);
 	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
 	
 	page = grab_cache_page(mapping, index);
-	err = -ENOMEM;
-	if (!page)
+	if (!page) {
+		err = -ENOMEM;
 		goto out;
+	}
 
 	if (!page_has_buffers(page))
 		create_empty_buffers(page, blocksize, 0);
@@ -2802,15 +2790,20 @@ int block_truncate_page(struct address_s
 		pos += blocksize;
 	}
 
-	err = 0;
 	if (!buffer_mapped(bh)) {
 		WARN_ON(bh->b_size != blocksize);
 		err = get_block(inode, iblock, bh, 0);
 		if (err)
 			goto unlock;
-		/* unmapped? It's a hole - nothing to do */
-		if (!buffer_mapped(bh))
+		/*
+		 * unmapped? It's a hole - must zero out partial
+		 * in the case of an extending truncate where mmap has
+		 * previously written past i_size of the page
+		 */
+		if (!buffer_mapped(bh)) {
+			zero_user(page, offset, length);
 			goto unlock;
+		}
 	}
 
 	/* Ok, it's mapped. Make sure it's up-to-date */
@@ -2818,17 +2811,17 @@ int block_truncate_page(struct address_s
 		set_buffer_uptodate(bh);
 
 	if (!buffer_uptodate(bh) && !buffer_delay(bh) && !buffer_unwritten(bh)) {
-		err = -EIO;
 		ll_rw_block(READ, 1, &bh);
 		wait_on_buffer(bh);
 		/* Uhhuh. Read error. Complain and punt. */
-		if (!buffer_uptodate(bh))
+		if (!buffer_uptodate(bh)) {
+			err = -EIO;
 			goto unlock;
+		}
 	}
 
 	zero_user(page, offset, length);
 	mark_buffer_dirty(bh);
-	err = 0;
 
 unlock:
 	unlock_page(page);
@@ -2836,6 +2829,19 @@ unlock:
 out:
 	return err;
 }
+EXPORT_SYMBOL(__block_truncate_page);
+
+int block_truncate_page(struct address_space *mapping,
+			loff_t from, get_block_t *get_block)
+{
+	struct inode *inode = mapping->host;
+	int err = 0;
+
+	if (from > inode->i_size)
+		err = __block_truncate_page(mapping, inode->i_size, from, get_block);
+
+	return err;
+}
 
 /*
  * The generic ->writepage function for buffer-backed address_spaces
@@ -2849,26 +2855,14 @@ int block_write_full_page_endio(struct p
 	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
 	unsigned offset;
 
-	/* Is the page fully inside i_size? */
-	if (page->index < end_index)
-		return __block_write_full_page(inode, page, get_block, wbc,
-					       handler);
-
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
-	if (page->index >= end_index+1 || !offset) {
+	if (page->index >= end_index &&
+			(page->index >= end_index+1 || !offset)) {
 		unlock_page(page);
 		return 0;
 	}
 
-	/*
-	 * The page straddles i_size.  It must be zeroed out on each and every
-	 * writepage invokation because it may be mmapped.  "A file is mapped
-	 * in multiples of the page size.  For a file that is not a multiple of
-	 * the  page size, the remaining memory is zeroed when mapped, and
-	 * writes to that region are not written out to the file."
-	 */
-	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
 	return __block_write_full_page(inode, page, get_block, wbc, handler);
 }
 
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -440,22 +440,23 @@ EXPORT_SYMBOL_GPL(xip_file_write);
 int
 xip_truncate_page(struct address_space *mapping, loff_t from)
 {
+	struct inode *inode = mapping->host;
 	pgoff_t index = from >> PAGE_CACHE_SHIFT;
 	unsigned offset = from & (PAGE_CACHE_SIZE-1);
 	unsigned blocksize;
 	unsigned length;
 	void *xip_mem;
 	unsigned long xip_pfn;
-	int err;
+	int err = 0;
 
 	BUG_ON(!mapping->a_ops->get_xip_mem);
 
-	blocksize = 1 << mapping->host->i_blkbits;
+	blocksize = 1 << inode->i_blkbits;
 	length = offset & (blocksize - 1);
 
 	/* Block boundary? Nothing to do */
 	if (!length)
-		return 0;
+		goto out;
 
 	length = blocksize - length;
 
@@ -464,11 +465,11 @@ xip_truncate_page(struct address_space *
 	if (unlikely(err)) {
 		if (err == -ENODATA)
 			/* Hole? No need to truncate */
-			return 0;
-		else
-			return err;
+			err = 0;
+		goto out;
 	}
 	memset(xip_mem + offset, 0, length);
-	return 0;
+out:
+	return err;
 }
 EXPORT_SYMBOL_GPL(xip_truncate_page);
Index: linux-2.6/fs/mpage.c
===================================================================
--- linux-2.6.orig/fs/mpage.c
+++ linux-2.6/fs/mpage.c
@@ -559,19 +559,10 @@ static int __mpage_writepage(struct page
 page_is_mapped:
 	end_index = i_size >> PAGE_CACHE_SHIFT;
 	if (page->index >= end_index) {
-		/*
-		 * The page straddles i_size.  It must be zeroed out on each
-		 * and every writepage invokation because it may be mmapped.
-		 * "A file is mapped in multiples of the page size.  For a file
-		 * that is not a multiple of the page size, the remaining memory
-		 * is zeroed when mapped, and writes to that region are not
-		 * written out to the file."
-		 */
 		unsigned offset = i_size & (PAGE_CACHE_SIZE - 1);
 
-		if (page->index > end_index || !offset)
-			goto confused;
-		zero_user_segment(page, offset, PAGE_CACHE_SIZE);
+		if (page->index >= end_index+1 || !offset)
+			goto confused; /* page fully outside i_size */
 	}
 
 	/*
Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c
+++ linux-2.6/fs/ext2/inode.c
@@ -777,14 +777,40 @@ ext2_write_begin(struct file *file, stru
 	return ret;
 }
 
+int __block_truncate_page(struct address_space *mapping,
+			loff_t from, loff_t to, get_block_t *get_block);
 static int ext2_write_end(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned copied,
 			struct page *page, void *fsdata)
 {
+	struct inode *inode = mapping->host;
 	int ret;
 
-	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
-	if (ret < len) {
+	ret = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+	unlock_page(page);
+	page_cache_release(page);
+        if (pos+copied > inode->i_size) {
+		int err;
+                if (pos > inode->i_size) {
+                        /* expanding a hole */
+			err = __block_truncate_page(mapping, inode->i_size,
+						pos, ext2_get_block);
+			if (err) {
+				ret = err;
+				goto out;
+			}
+			err = __block_truncate_page(mapping, pos+copied,
+						LLONG_MAX, ext2_get_block);
+			if (err) {
+				ret = err;
+				goto out;
+			}
+                }
+                i_size_write(inode, pos+copied);
+                mark_inode_dirty(inode);
+        }
+out:
+	if (ret < 0 || ret < len) {
 		struct inode *inode = mapping->host;
 		loff_t isize = inode->i_size;
 		if (pos + len > isize)
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -49,7 +49,6 @@ void do_invalidatepage(struct page *page
 
 static inline void truncate_partial_page(struct page *page, unsigned partial)
 {
-	zero_user_segment(page, partial, PAGE_CACHE_SIZE);
 	if (page_has_private(page))
 		do_invalidatepage(page, partial);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [patch 3/3] fs: buffer_head page_lock i_size relax
  2009-07-10  9:33 ` [patch 1/3] fs: buffer_head writepage no invalidate Nick Piggin
  2009-07-10  9:34   ` [patch 2/3] fs: buffer_head writepage no zero Nick Piggin
@ 2009-07-10  9:35   ` Nick Piggin
  2009-07-10 11:08   ` [patch 1/3] fs: buffer_head writepage no invalidate Jan Kara
  2 siblings, 0 replies; 6+ messages in thread
From: Nick Piggin @ 2009-07-10  9:35 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, viro, jack, linux-mm


The previous patch allows us to relax the buffer.c requirement that the page
lock be held in order to avoid writepage zeroing out new data beyond isize.
[actually as I said I think it still has a bug due to writepage requiring
i_size_read]

---
 fs/buffer.c |   28 ++++++++--------------------
 mm/shmem.c  |    6 +++---
 2 files changed, 11 insertions(+), 23 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2049,33 +2049,20 @@ int generic_write_end(struct file *file,
 			struct page *page, void *fsdata)
 {
 	struct inode *inode = mapping->host;
-	int i_size_changed = 0;
 
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
 
+	unlock_page(page);
+	page_cache_release(page);
+
 	/*
 	 * No need to use i_size_read() here, the i_size
 	 * cannot change under us because we hold i_mutex.
-	 *
-	 * But it's important to update i_size while still holding page lock:
-	 * page writeout could otherwise come in and zero beyond i_size.
 	 */
 	if (pos+copied > inode->i_size) {
 		i_size_write(inode, pos+copied);
-		i_size_changed = 1;
-	}
-
-	unlock_page(page);
-	page_cache_release(page);
-
-	/*
-	 * Don't mark the inode dirty under page lock. First, it unnecessarily
-	 * makes the holding time of page lock longer. Second, it forces lock
-	 * ordering of page lock and transaction start for journaling
-	 * filesystems.
-	 */
-	if (i_size_changed)
 		mark_inode_dirty(inode);
+	}
 
 	return copied;
 }
@@ -2624,14 +2611,15 @@ int nobh_write_end(struct file *file, st
 
 	SetPageUptodate(page);
 	set_page_dirty(page);
+
+	unlock_page(page);
+	page_cache_release(page);
+
 	if (pos+copied > inode->i_size) {
 		i_size_write(inode, pos+copied);
 		mark_inode_dirty(inode);
 	}
 
-	unlock_page(page);
-	page_cache_release(page);
-
 	while (head) {
 		bh = head;
 		head = head->b_this_page;
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -1631,13 +1631,13 @@ shmem_write_end(struct file *file, struc
 {
 	struct inode *inode = mapping->host;
 
-	if (pos + copied > inode->i_size)
-		i_size_write(inode, pos + copied);
-
 	unlock_page(page);
 	set_page_dirty(page);
 	page_cache_release(page);
 
+	if (pos + copied > inode->i_size)
+		i_size_write(inode, pos + copied);
+
 	return copied;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch 1/3] fs: buffer_head writepage no invalidate
  2009-07-10  9:33 ` [patch 1/3] fs: buffer_head writepage no invalidate Nick Piggin
  2009-07-10  9:34   ` [patch 2/3] fs: buffer_head writepage no zero Nick Piggin
  2009-07-10  9:35   ` [patch 3/3] fs: buffer_head page_lock i_size relax Nick Piggin
@ 2009-07-10 11:08   ` Jan Kara
  2 siblings, 0 replies; 6+ messages in thread
From: Jan Kara @ 2009-07-10 11:08 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, hch, viro, linux-mm

On Fri 10-07-09 11:33:25, Nick Piggin wrote:
> After the previous patchset, this is my progress on the page_mkwrite
> thing... These patches are RFC only and have some bugs.
> --
> invalidate should not be required in the writeout path. The truncate
> sequence will first reduce i_size, then clean and discard any existing
> pagecache (and no new dirty pagecache can be added because i_size was
> reduced and i_mutex is being held), then filesystem data structures
> are updated.
> 
> Filesystem needs to be able to handle writeout at any point before
> the last step, and once the 2nd step completes, there should be no
> unfreeable dirty buffers anyway (truncate performs the do_invalidatepage).
> 
> Having filesystem changes depend on reading i_size without holding
> i_mutex is confusing at least. There is still a case in writepage
> paths in buffer.c uses i_size (testing which block to write out), but
> this is a small improvement.
  This looks good.
Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/buffer.c |   20 ++------------------
>  1 file changed, 2 insertions(+), 18 deletions(-)
> 
> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c
> +++ linux-2.6/fs/buffer.c
> @@ -2663,18 +2663,8 @@ int nobh_writepage(struct page *page, ge
>  	/* Is the page fully outside i_size? (truncate in progress) */
>  	offset = i_size & (PAGE_CACHE_SIZE-1);
>  	if (page->index >= end_index+1 || !offset) {
> -		/*
> -		 * The page may have dirty, unmapped buffers.  For example,
> -		 * they may have been added in ext3_writepage().  Make them
> -		 * freeable here, so the page does not leak.
> -		 */
> -#if 0
> -		/* Not really sure about this  - do we need this ? */
> -		if (page->mapping->a_ops->invalidatepage)
> -			page->mapping->a_ops->invalidatepage(page, offset);
> -#endif
>  		unlock_page(page);
> -		return 0; /* don't care */
> +		return 0;
>  	}
>  
>  	/*
> @@ -2867,14 +2857,8 @@ int block_write_full_page_endio(struct p
>  	/* Is the page fully outside i_size? (truncate in progress) */
>  	offset = i_size & (PAGE_CACHE_SIZE-1);
>  	if (page->index >= end_index+1 || !offset) {
> -		/*
> -		 * The page may have dirty, unmapped buffers.  For example,
> -		 * they may have been added in ext3_writepage().  Make them
> -		 * freeable here, so the page does not leak.
> -		 */
> -		do_invalidatepage(page, 0);
>  		unlock_page(page);
> -		return 0; /* don't care */
> +		return 0;
>  	}
>  
>  	/*
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch 2/3] fs: buffer_head writepage no zero
  2009-07-10  9:34   ` [patch 2/3] fs: buffer_head writepage no zero Nick Piggin
@ 2009-07-10 11:46     ` Jan Kara
  2009-07-13  6:54       ` Nick Piggin
  0 siblings, 1 reply; 6+ messages in thread
From: Jan Kara @ 2009-07-10 11:46 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, hch, viro, jack, linux-mm

On Fri 10-07-09 11:34:03, Nick Piggin wrote:
> 
> When writing a page to filesystem, buffer.c zeroes out parts of the page past
> i_size in an attempt to get zeroes into those blocks on disk, so as to honour
> the requirement that an expanding truncate should zero-fill the file.
> 
> Unfortunately, this is racy. The reason we can get something other than
> zeroes here is via an mmaped write to the block beyond i_size. Zeroing it
> out before writepage narrows the window, but it is still possible to store
> junk beyond i_size on disk, by storing into the page after writepage zeroes,
> but before DMA (or copy) completes. This allows process A to break posix
> semantics for process B (or even inadvertently for itsef).
> 
> It could also be possible that the filesystem has written data into the
> block but not yet expanded the inode size when the system crashes for
> some reason. Unless its journal reply / fsck process etc checks for this
> condition, it could also cause subsequent breakage in semantics.
  Actually, it should be possible to fix the posix semantics by zeroing out
the page when i_size is going to be extended - hmm, I see you're trying to
do something like that in ext2 code. Ugh. Since we have to lock the
old last page to make mkwrite work anyway, I think we should do it in a
generic code (probably in a separate patch and just note it here...).
  I can include it in my mkwrite fixes when I port them on top of your
patches.

> ---
>  fs/buffer.c      |   94 +++++++++++++++++++++++++------------------------------
>  fs/ext2/inode.c  |   30 ++++++++++++++++-
>  fs/mpage.c       |   13 +------
>  mm/filemap_xip.c |   15 ++++----
>  mm/truncate.c    |    1 
>  5 files changed, 82 insertions(+), 71 deletions(-)
> 
> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c
> +++ linux-2.6/fs/buffer.c
> @@ -2656,26 +2656,14 @@ int nobh_writepage(struct page *page, ge
>  	unsigned offset;
>  	int ret;
>  
> -	/* Is the page fully inside i_size? */
> -	if (page->index < end_index)
> -		goto out;
> -
>  	/* Is the page fully outside i_size? (truncate in progress) */
>  	offset = i_size & (PAGE_CACHE_SIZE-1);
> -	if (page->index >= end_index+1 || !offset) {
> +	if (page->index >= end_index &&
> +			(page->index >= end_index+1 || !offset)) {
>  		unlock_page(page);
>  		return 0;
>  	}
>  
> -	/*
> -	 * The page straddles i_size.  It must be zeroed out on each and every
> -	 * writepage invocation because it may be mmapped.  "A file is mapped
> -	 * in multiples of the page size.  For a file that is not a multiple of
> -	 * the  page size, the remaining memory is zeroed when mapped, and
> -	 * writes to that region are not written out to the file."
> -	 */
> -	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> -out:
>  	ret = mpage_writepage(page, get_block, wbc);
>  	if (ret == -EAGAIN)
>  		ret = __block_write_full_page(inode, page, get_block, wbc,
> @@ -2695,22 +2683,23 @@ int nobh_truncate_page(struct address_sp
>  	struct inode *inode = mapping->host;
>  	struct page *page;
>  	struct buffer_head map_bh;
> -	int err;
> +	int err = 0;
>  
>  	blocksize = 1 << inode->i_blkbits;
>  	length = offset & (blocksize - 1);
>  
>  	/* Block boundary? Nothing to do */
>  	if (!length)
> -		return 0;
> +		goto out;
>  
>  	length = blocksize - length;
>  	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
>  
>  	page = grab_cache_page(mapping, index);
> -	err = -ENOMEM;
> -	if (!page)
> +	if (!page) {
> +		err = -ENOMEM;
>  		goto out;
> +	}
>  
>  	if (page_has_buffers(page)) {
>  has_buffers:
> @@ -2752,7 +2741,6 @@ has_buffers:
>  	}
>  	zero_user(page, offset, length);
>  	set_page_dirty(page);
> -	err = 0;
>  
>  unlock:
>  	unlock_page(page);
  Above two chunks are just style cleanup, aren't they? Could you maybe separate
it from the logical changes?

> @@ -2762,8 +2750,8 @@ out:
>  }
>  EXPORT_SYMBOL(nobh_truncate_page);
>  
> -int block_truncate_page(struct address_space *mapping,
> -			loff_t from, get_block_t *get_block)
> +int __block_truncate_page(struct address_space *mapping,
> +			loff_t from, loff_t to, get_block_t *get_block)
>  {
>  	pgoff_t index = from >> PAGE_CACHE_SHIFT;
>  	unsigned offset = from & (PAGE_CACHE_SIZE-1);
> @@ -2773,22 +2761,22 @@ int block_truncate_page(struct address_s
>  	struct inode *inode = mapping->host;
>  	struct page *page;
>  	struct buffer_head *bh;
> -	int err;
> +	int err = 0;
>  
> -	blocksize = 1 << inode->i_blkbits;
> -	length = offset & (blocksize - 1);
> +	/* Page boundary? Nothing to do */
> +	if (!offset)
> +		goto out;
>  
> -	/* Block boundary? Nothing to do */
> -	if (!length)
> -		return 0;
> +	blocksize = 1 << inode->i_blkbits;
>  
> -	length = blocksize - length;
> +	length = (unsigned)min_t(loff_t, to - from, PAGE_CACHE_SIZE - offset);
>  	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
>  	page = grab_cache_page(mapping, index);
> -	err = -ENOMEM;
> -	if (!page)
> +	if (!page) {
> +		err = -ENOMEM;
>  		goto out;
> +	}
>  
>  	if (!page_has_buffers(page))
>  		create_empty_buffers(page, blocksize, 0);
> @@ -2802,15 +2790,20 @@ int block_truncate_page(struct address_s
>  		pos += blocksize;
>  	}
>  
> -	err = 0;
>  	if (!buffer_mapped(bh)) {
>  		WARN_ON(bh->b_size != blocksize);
>  		err = get_block(inode, iblock, bh, 0);
>  		if (err)
>  			goto unlock;
> -		/* unmapped? It's a hole - nothing to do */
> -		if (!buffer_mapped(bh))
> +		/*
> +		 * unmapped? It's a hole - must zero out partial
> +		 * in the case of an extending truncate where mmap has
> +		 * previously written past i_size of the page
> +		 */
> +		if (!buffer_mapped(bh)) {
> +			zero_user(page, offset, length);
>  			goto unlock;
  Hmm, but who was zeroing out the page previously? Because the end of the
page gets zeroed already now...

> +		}
>  	}
>  
>  	/* Ok, it's mapped. Make sure it's up-to-date */
> @@ -2818,17 +2811,17 @@ int block_truncate_page(struct address_s
>  		set_buffer_uptodate(bh);
>  
>  	if (!buffer_uptodate(bh) && !buffer_delay(bh) && !buffer_unwritten(bh)) {
> -		err = -EIO;
>  		ll_rw_block(READ, 1, &bh);
>  		wait_on_buffer(bh);
>  		/* Uhhuh. Read error. Complain and punt. */
> -		if (!buffer_uptodate(bh))
> +		if (!buffer_uptodate(bh)) {
> +			err = -EIO;
>  			goto unlock;
> +		}
>  	}
>  
>  	zero_user(page, offset, length);
>  	mark_buffer_dirty(bh);
> -	err = 0;
>  
>  unlock:
>  	unlock_page(page);
> @@ -2836,6 +2829,19 @@ unlock:
>  out:
>  	return err;
>  }
> +EXPORT_SYMBOL(__block_truncate_page);
> +
> +int block_truncate_page(struct address_space *mapping,
> +			loff_t from, get_block_t *get_block)
> +{
> +	struct inode *inode = mapping->host;
> +	int err = 0;
> +
> +	if (from > inode->i_size)
> +		err = __block_truncate_page(mapping, inode->i_size, from, get_block);
> +
> +	return err;
> +}
>  
>  /*
>   * The generic ->writepage function for buffer-backed address_spaces
> @@ -2849,26 +2855,14 @@ int block_write_full_page_endio(struct p
>  	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
>  	unsigned offset;
>  
> -	/* Is the page fully inside i_size? */
> -	if (page->index < end_index)
> -		return __block_write_full_page(inode, page, get_block, wbc,
> -					       handler);
> -
>  	/* Is the page fully outside i_size? (truncate in progress) */
>  	offset = i_size & (PAGE_CACHE_SIZE-1);
> -	if (page->index >= end_index+1 || !offset) {
> +	if (page->index >= end_index &&
> +			(page->index >= end_index+1 || !offset)) {
>  		unlock_page(page);
>  		return 0;
>  	}
>  
> -	/*
> -	 * The page straddles i_size.  It must be zeroed out on each and every
> -	 * writepage invokation because it may be mmapped.  "A file is mapped
> -	 * in multiples of the page size.  For a file that is not a multiple of
> -	 * the  page size, the remaining memory is zeroed when mapped, and
> -	 * writes to that region are not written out to the file."
> -	 */
> -	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
>  	return __block_write_full_page(inode, page, get_block, wbc, handler);
>  }
  I suppose you should also update __block_write_full_page() - there's
comment about zeroing. Also I'm not sure that marking buffer as uptodate
there is a good idea when the buffer isn't zeroed.

> Index: linux-2.6/mm/filemap_xip.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap_xip.c
> +++ linux-2.6/mm/filemap_xip.c
> @@ -440,22 +440,23 @@ EXPORT_SYMBOL_GPL(xip_file_write);
>  int
>  xip_truncate_page(struct address_space *mapping, loff_t from)
>  {
> +	struct inode *inode = mapping->host;
>  	pgoff_t index = from >> PAGE_CACHE_SHIFT;
>  	unsigned offset = from & (PAGE_CACHE_SIZE-1);
>  	unsigned blocksize;
>  	unsigned length;
>  	void *xip_mem;
>  	unsigned long xip_pfn;
> -	int err;
> +	int err = 0;
>  
>  	BUG_ON(!mapping->a_ops->get_xip_mem);
>  
> -	blocksize = 1 << mapping->host->i_blkbits;
> +	blocksize = 1 << inode->i_blkbits;
>  	length = offset & (blocksize - 1);
>  
>  	/* Block boundary? Nothing to do */
>  	if (!length)
> -		return 0;
> +		goto out;
>  
>  	length = blocksize - length;
>  
> @@ -464,11 +465,11 @@ xip_truncate_page(struct address_space *
>  	if (unlikely(err)) {
>  		if (err == -ENODATA)
>  			/* Hole? No need to truncate */
> -			return 0;
> -		else
> -			return err;
> +			err = 0;
> +		goto out;
>  	}
>  	memset(xip_mem + offset, 0, length);
> -	return 0;
> +out:
> +	return err;
>  }
>  EXPORT_SYMBOL_GPL(xip_truncate_page);
  Again, only a style change, right?

> Index: linux-2.6/fs/mpage.c
> ===================================================================
> --- linux-2.6.orig/fs/mpage.c
> +++ linux-2.6/fs/mpage.c
> @@ -559,19 +559,10 @@ static int __mpage_writepage(struct page
>  page_is_mapped:
>  	end_index = i_size >> PAGE_CACHE_SHIFT;
>  	if (page->index >= end_index) {
> -		/*
> -		 * The page straddles i_size.  It must be zeroed out on each
> -		 * and every writepage invokation because it may be mmapped.
> -		 * "A file is mapped in multiples of the page size.  For a file
> -		 * that is not a multiple of the page size, the remaining memory
> -		 * is zeroed when mapped, and writes to that region are not
> -		 * written out to the file."
> -		 */
>  		unsigned offset = i_size & (PAGE_CACHE_SIZE - 1);
>  
> -		if (page->index > end_index || !offset)
> -			goto confused;
> -		zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> +		if (page->index >= end_index+1 || !offset)
> +			goto confused; /* page fully outside i_size */
>  	}
>  
>  	/*
> Index: linux-2.6/fs/ext2/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/ext2/inode.c
> +++ linux-2.6/fs/ext2/inode.c
> @@ -777,14 +777,40 @@ ext2_write_begin(struct file *file, stru
>  	return ret;
>  }
>  
> +int __block_truncate_page(struct address_space *mapping,
> +			loff_t from, loff_t to, get_block_t *get_block);
  Uf, that's ugly... Shouldn't it be in some header?

>  static int ext2_write_end(struct file *file, struct address_space *mapping,
>  			loff_t pos, unsigned len, unsigned copied,
>  			struct page *page, void *fsdata)
>  {
> +	struct inode *inode = mapping->host;
>  	int ret;
>  
> -	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> -	if (ret < len) {
> +	ret = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> +	unlock_page(page);
> +	page_cache_release(page);
> +        if (pos+copied > inode->i_size) {
> +		int err;
> +                if (pos > inode->i_size) {
> +                        /* expanding a hole */
> +			err = __block_truncate_page(mapping, inode->i_size,
> +						pos, ext2_get_block);
> +			if (err) {
> +				ret = err;
> +				goto out;
> +			}
> +			err = __block_truncate_page(mapping, pos+copied,
> +						LLONG_MAX, ext2_get_block);
> +			if (err) {
> +				ret = err;
> +				goto out;
> +			}
> +                }
> +                i_size_write(inode, pos+copied);
> +                mark_inode_dirty(inode);
> +        }
> +out:
> +	if (ret < 0 || ret < len) {
>  		struct inode *inode = mapping->host;
>  		loff_t isize = inode->i_size;
>  		if (pos + len > isize)
  There are whitespace problems above... Also calling __block_truncate_page()
on old i_size looks strange - we just want to zero-out the page if it
exists (this way we'd unnecessarily read it from disk). Also I think
block_write_end() should do this.
  Finally, zeroing after pos+copied does not make sence to be conditioned
by pos > inode->i_size and again I don't think it's needed...

> Index: linux-2.6/mm/truncate.c
> ===================================================================
> --- linux-2.6.orig/mm/truncate.c
> +++ linux-2.6/mm/truncate.c
> @@ -49,7 +49,6 @@ void do_invalidatepage(struct page *page
>  
>  static inline void truncate_partial_page(struct page *page, unsigned partial)
>  {
> -	zero_user_segment(page, partial, PAGE_CACHE_SIZE);
>  	if (page_has_private(page))
>  		do_invalidatepage(page, partial);
>  }

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch 2/3] fs: buffer_head writepage no zero
  2009-07-10 11:46     ` Jan Kara
@ 2009-07-13  6:54       ` Nick Piggin
  0 siblings, 0 replies; 6+ messages in thread
From: Nick Piggin @ 2009-07-13  6:54 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, hch, viro, linux-mm

On Fri, Jul 10, 2009 at 01:46:51PM +0200, Jan Kara wrote:
> On Fri 10-07-09 11:34:03, Nick Piggin wrote:
> > 
> > When writing a page to filesystem, buffer.c zeroes out parts of the page past
> > i_size in an attempt to get zeroes into those blocks on disk, so as to honour
> > the requirement that an expanding truncate should zero-fill the file.
> > 
> > Unfortunately, this is racy. The reason we can get something other than
> > zeroes here is via an mmaped write to the block beyond i_size. Zeroing it
> > out before writepage narrows the window, but it is still possible to store
> > junk beyond i_size on disk, by storing into the page after writepage zeroes,
> > but before DMA (or copy) completes. This allows process A to break posix
> > semantics for process B (or even inadvertently for itsef).
> > 
> > It could also be possible that the filesystem has written data into the
> > block but not yet expanded the inode size when the system crashes for
> > some reason. Unless its journal reply / fsck process etc checks for this
> > condition, it could also cause subsequent breakage in semantics.
>   Actually, it should be possible to fix the posix semantics by zeroing out
> the page when i_size is going to be extended - hmm, I see you're trying to
> do something like that in ext2 code. Ugh. Since we have to lock the

Yeah, it could probably do it in write_begin in generic code, that
part was a bit ugly.

> old last page to make mkwrite work anyway, I think we should do it in a
> generic code (probably in a separate patch and just note it here...).
>   I can include it in my mkwrite fixes when I port them on top of your
> patches.
> 
> > @@ -2752,7 +2741,6 @@ has_buffers:
> >  	}
> >  	zero_user(page, offset, length);
> >  	set_page_dirty(page);
> > -	err = 0;
> >  
> >  unlock:
> >  	unlock_page(page);
>   Above two chunks are just style cleanup, aren't they? Could you maybe separate
> it from the logical changes?

Yes I think so. They devolved from something that was actually useful,
and I should remove them.

 
> > @@ -2802,15 +2790,20 @@ int block_truncate_page(struct address_s
> >  		pos += blocksize;
> >  	}
> >  
> > -	err = 0;
> >  	if (!buffer_mapped(bh)) {
> >  		WARN_ON(bh->b_size != blocksize);
> >  		err = get_block(inode, iblock, bh, 0);
> >  		if (err)
> >  			goto unlock;
> > -		/* unmapped? It's a hole - nothing to do */
> > -		if (!buffer_mapped(bh))
> > +		/*
> > +		 * unmapped? It's a hole - must zero out partial
> > +		 * in the case of an extending truncate where mmap has
> > +		 * previously written past i_size of the page
> > +		 */
> > +		if (!buffer_mapped(bh)) {
> > +			zero_user(page, offset, length);
> >  			goto unlock;
>   Hmm, but who was zeroing out the page previously? Because the end of the
> page gets zeroed already now...

Yes it does aready get zeroed, however I think ftruncate semantics
say that expanding ftruncate shoud leave the new area with zero
filled. A partial-mmap on the last page could have dirtied these
parts of the page and so break the guarantee.

I guess it could be ignored because such partial mmap writes are
supposed to result in undefined behaviour, however I think it is
a bit wrong (also the result could change based on memory pressure
even when another program opens the file, so I think zeroing here
is best).
 

> > -	/*
> > -	 * The page straddles i_size.  It must be zeroed out on each and every
> > -	 * writepage invokation because it may be mmapped.  "A file is mapped
> > -	 * in multiples of the page size.  For a file that is not a multiple of
> > -	 * the  page size, the remaining memory is zeroed when mapped, and
> > -	 * writes to that region are not written out to the file."
> > -	 */
> > -	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> >  	return __block_write_full_page(inode, page, get_block, wbc, handler);
> >  }
>   I suppose you should also update __block_write_full_page() - there's
> comment about zeroing. Also I'm not sure that marking buffer as uptodate
> there is a good idea when the buffer isn't zeroed.

Thanks I'll check it out.


> >  EXPORT_SYMBOL_GPL(xip_truncate_page);
>   Again, only a style change, right?

Yes.

> > Index: linux-2.6/fs/ext2/inode.c
> > ===================================================================
> > --- linux-2.6.orig/fs/ext2/inode.c
> > +++ linux-2.6/fs/ext2/inode.c
> > @@ -777,14 +777,40 @@ ext2_write_begin(struct file *file, stru
> >  	return ret;
> >  }
> >  
> > +int __block_truncate_page(struct address_space *mapping,
> > +			loff_t from, loff_t to, get_block_t *get_block);
>   Uf, that's ugly... Shouldn't it be in some header?
> 
> >  static int ext2_write_end(struct file *file, struct address_space *mapping,
> >  			loff_t pos, unsigned len, unsigned copied,
> >  			struct page *page, void *fsdata)
> >  {
> > +	struct inode *inode = mapping->host;
> >  	int ret;
> >  
> > -	ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> > -	if (ret < len) {
> > +	ret = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> > +	unlock_page(page);
> > +	page_cache_release(page);
> > +        if (pos+copied > inode->i_size) {
> > +		int err;
> > +                if (pos > inode->i_size) {
> > +                        /* expanding a hole */
> > +			err = __block_truncate_page(mapping, inode->i_size,
> > +						pos, ext2_get_block);
> > +			if (err) {
> > +				ret = err;
> > +				goto out;
> > +			}
> > +			err = __block_truncate_page(mapping, pos+copied,
> > +						LLONG_MAX, ext2_get_block);
> > +			if (err) {
> > +				ret = err;
> > +				goto out;
> > +			}
> > +                }
> > +                i_size_write(inode, pos+copied);
> > +                mark_inode_dirty(inode);
> > +        }
> > +out:
> > +	if (ret < 0 || ret < len) {
> >  		struct inode *inode = mapping->host;
> >  		loff_t isize = inode->i_size;
> >  		if (pos + len > isize)
>   There are whitespace problems above... Also calling __block_truncate_page()
> on old i_size looks strange - we just want to zero-out the page if it
> exists (this way we'd unnecessarily read it from disk). Also I think
> block_write_end() should do this.
>   Finally, zeroing after pos+copied does not make sence to be conditioned
> by pos > inode->i_size and again I don't think it's needed...

Yeah this part was ugly because it was just a result of working
through bugs and I didn't really try to make it nice. I agree if
we can move as much as possible to generic code it woud be
best.

Thanks for review. I'll try to post another version soon.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-07-13  6:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20090710073028.782561541@suse.de>
2009-07-10  9:33 ` [patch 1/3] fs: buffer_head writepage no invalidate Nick Piggin
2009-07-10  9:34   ` [patch 2/3] fs: buffer_head writepage no zero Nick Piggin
2009-07-10 11:46     ` Jan Kara
2009-07-13  6:54       ` Nick Piggin
2009-07-10  9:35   ` [patch 3/3] fs: buffer_head page_lock i_size relax Nick Piggin
2009-07-10 11:08   ` [patch 1/3] fs: buffer_head writepage no invalidate Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).