[PATCH 0/5] [RFC] Fix page_mkwrite for blocksize

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] [RFC] Fix page_mkwrite for blocksize < pagesize
@ 2009-08-10 22:20 Jan Kara
  2009-08-10 22:20 ` [PATCH 1/5] fs: buffer_head writepage no invalidate Jan Kara
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Jan Kara @ 2009-08-10 22:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, npiggin

  Hi,

  below is a patch series that is my new approach to solve problems with
page_mkwrite() when blocksize < pagesize. To refresh memory the main issue is
as follows:

We'd like to use page_mkwrite() to allocate blocks under a page which is
becoming writeably mmapped in some process address space. This allows a
filesystem to return a page fault if there is not enough space available, user
exceeds quota or similar problem happens, rather than silently discarding data
later when writepage is called.

On filesystems where blocksize < pagesize the situation is complicated though.
Think for example that blocksize = 1024, pagesize = 4096 and a process does:
  ftruncate(fd, 0);
  pwrite(fd, buf, 1024, 0);
  map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
  map[0] = 'a';  ----> page_mkwrite() for index 0 is called
  ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
  fsync(fd); ----> writepage() for index 0 is called

At the moment page_mkwrite() is called, filesystem can allocate only one block
for the page because i_size == 1024. Otherwise it would create blocks beyond
i_size which is generally undesirable. But later at writepage() time, we would
like to have blocks allocated for the whole page (and in principle we have to
allocate them because user could have filled the page with data after the
second ftruncate()).

This series is an attempt to fix the above issue. The idea is that we do i_size
update after an extending write or truncate not under the page lock of the page
where the i_size ends up but under the page lock of the page where i_size was
originally. This also allows us to solve a posix compliance issue where we
could have exposed data written via mmap beyond i_size.

I see two disputable things with this approach:
1) set_page_dirty_buffers() and create_empty_buffers() now checks i_size.
That's a bit ugly although not marking buffers dirty beyond i_size makes a lot
of sence to me.

2) to fix the problem with non-zeros written via mmap beyond EOF and then
being exposed by truncate, I've added zeroing to a function doing all the work
when extending i_size (which is essentially the only place where we can reliably
do the work and avoid races with mmap). That's a good fit but basically all
filesystems now have to extend i_size with this function which I don't find that
pleasing. Anybody has an idea how to avoid that conversion of every filesystem
or make it less painful?

Thanks for comments in advance.
									Honza

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/5] fs: buffer_head writepage no invalidate
  2009-08-10 22:20 [PATCH 0/5] [RFC] Fix page_mkwrite for blocksize < pagesize Jan Kara
@ 2009-08-10 22:20 ` Jan Kara
  2009-08-10 22:20 ` [PATCH 2/5] fs: Don't zero out page on writepage() Jan Kara
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Jan Kara @ 2009-08-10 22:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, npiggin

From: Nick Piggin <npiggin@suse.de>

After the previous patchset, this is my progress on the page_mkwrite
thing... These patches are RFC only and have some bugs.
--
invalidate should not be required in the writeout path. The truncate
sequence will first reduce i_size, then clean and discard any existing
pagecache (and no new dirty pagecache can be added because i_size was
reduced and i_mutex is being held), then filesystem data structures
are updated.

Filesystem needs to be able to handle writeout at any point before
the last step, and once the 2nd step completes, there should be no
unfreeable dirty buffers anyway (truncate performs the do_invalidatepage).

Having filesystem changes depend on reading i_size without holding
i_mutex is confusing at least. There is still a case in writepage
paths in buffer.c uses i_size (testing which block to write out), but
this is a small improvement.
---
 fs/buffer.c |   20 ++------------------
 1 files changed, 2 insertions(+), 18 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index a3ef091..c160aa0 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2663,18 +2663,8 @@ int nobh_writepage(struct page *page, get_block_t *get_block,
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
 	if (page->index >= end_index+1 || !offset) {
-		/*
-		 * The page may have dirty, unmapped buffers.  For example,
-		 * they may have been added in ext3_writepage().  Make them
-		 * freeable here, so the page does not leak.
-		 */
-#if 0
-		/* Not really sure about this  - do we need this ? */
-		if (page->mapping->a_ops->invalidatepage)
-			page->mapping->a_ops->invalidatepage(page, offset);
-#endif
 		unlock_page(page);
-		return 0; /* don't care */
+		return 0;
 	}
 
 	/*
@@ -2867,14 +2857,8 @@ int block_write_full_page_endio(struct page *page, get_block_t *get_block,
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
 	if (page->index >= end_index+1 || !offset) {
-		/*
-		 * The page may have dirty, unmapped buffers.  For example,
-		 * they may have been added in ext3_writepage().  Make them
-		 * freeable here, so the page does not leak.
-		 */
-		do_invalidatepage(page, 0);
 		unlock_page(page);
-		return 0; /* don't care */
+		return 0;
 	}
 
 	/*
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/5] fs: Don't zero out page on writepage()
  2009-08-10 22:20 [PATCH 0/5] [RFC] Fix page_mkwrite for blocksize < pagesize Jan Kara
  2009-08-10 22:20 ` [PATCH 1/5] fs: buffer_head writepage no invalidate Jan Kara
@ 2009-08-10 22:20 ` Jan Kara
  2009-08-10 22:20 ` [PATCH 3/5] vfs: Create dirty buffer only inside i_size Jan Kara
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Jan Kara @ 2009-08-10 22:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, npiggin, Jan Kara

From: Nick Piggin <npiggin@suse.de>

When writing a page to filesystem, vfs zeroes out parts of the page past
i_size in an attempt to get zeroes into those blocks on disk, so as to honour
the requirement that an expanding truncate should zero-fill the file.

Unfortunately, this is racy. The reason we can get something other than
zeroes here is via an mmaped write to the block beyond i_size. Zeroing it
out before writepage narrows the window, but it is still possible to store
junk beyond i_size on disk, by storing into the page after writepage zeroes,
but before DMA (or copy) completes. This allows process A to break posix
semantics for process B (or even inadvertently for itsef).

It could also be possible that the filesystem has written data into the
block but not yet expanded the inode size when the system crashes for
some reason. Unless its journal reply / fsck process etc checks for this
condition, it could also cause subsequent breakage in semantics.

So just remove this zeroing completely for the time being. It removes the
constraint that i_size has to be updated under page lock in write_end()
which helps subsequent patches. The Posix semantics will be fixed later.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c |   35 ++++-------------------------------
 fs/mpage.c  |   13 ++-----------
 2 files changed, 6 insertions(+), 42 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index c160aa0..33da488 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1667,9 +1667,6 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 			 * this page can be outside i_size when there is a
 			 * truncate in progress.
 			 */
-			/*
-			 * The buffer was zeroed by block_write_full_page()
-			 */
 			clear_buffer_dirty(bh);
 			set_buffer_uptodate(bh);
 		} else if ((!buffer_mapped(bh) || buffer_delay(bh)) &&
@@ -2656,26 +2653,14 @@ int nobh_writepage(struct page *page, get_block_t *get_block,
 	unsigned offset;
 	int ret;
 
-	/* Is the page fully inside i_size? */
-	if (page->index < end_index)
-		goto out;
-
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
-	if (page->index >= end_index+1 || !offset) {
+	if (page->index >= end_index &&
+			(page->index >= end_index+1 || !offset)) {
 		unlock_page(page);
 		return 0;
 	}
 
-	/*
-	 * The page straddles i_size.  It must be zeroed out on each and every
-	 * writepage invocation because it may be mmapped.  "A file is mapped
-	 * in multiples of the page size.  For a file that is not a multiple of
-	 * the  page size, the remaining memory is zeroed when mapped, and
-	 * writes to that region are not written out to the file."
-	 */
-	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
-out:
 	ret = mpage_writepage(page, get_block, wbc);
 	if (ret == -EAGAIN)
 		ret = __block_write_full_page(inode, page, get_block, wbc,
@@ -2849,26 +2834,14 @@ int block_write_full_page_endio(struct page *page, get_block_t *get_block,
 	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
 	unsigned offset;
 
-	/* Is the page fully inside i_size? */
-	if (page->index < end_index)
-		return __block_write_full_page(inode, page, get_block, wbc,
-					       handler);
-
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
-	if (page->index >= end_index+1 || !offset) {
+	if (page->index >= end_index &&
+	    (page->index >= end_index + 1 || !offset)) {
 		unlock_page(page);
 		return 0;
 	}
 
-	/*
-	 * The page straddles i_size.  It must be zeroed out on each and every
-	 * writepage invokation because it may be mmapped.  "A file is mapped
-	 * in multiples of the page size.  For a file that is not a multiple of
-	 * the  page size, the remaining memory is zeroed when mapped, and
-	 * writes to that region are not written out to the file."
-	 */
-	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
 	return __block_write_full_page(inode, page, get_block, wbc, handler);
 }
 
diff --git a/fs/mpage.c b/fs/mpage.c
index 42381bd..9317762 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -559,19 +559,10 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
 page_is_mapped:
 	end_index = i_size >> PAGE_CACHE_SHIFT;
 	if (page->index >= end_index) {
-		/*
-		 * The page straddles i_size.  It must be zeroed out on each
-		 * and every writepage invokation because it may be mmapped.
-		 * "A file is mapped in multiples of the page size.  For a file
-		 * that is not a multiple of the page size, the remaining memory
-		 * is zeroed when mapped, and writes to that region are not
-		 * written out to the file."
-		 */
 		unsigned offset = i_size & (PAGE_CACHE_SIZE - 1);
 
-		if (page->index > end_index || !offset)
-			goto confused;
-		zero_user_segment(page, offset, PAGE_CACHE_SIZE);
+		if (page->index >= end_index+1 || !offset)
+			goto confused; /* page fully outside i_size */
 	}
 
 	/*
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/5] vfs: Create dirty buffer only inside i_size
  2009-08-10 22:20 [PATCH 0/5] [RFC] Fix page_mkwrite for blocksize < pagesize Jan Kara
  2009-08-10 22:20 ` [PATCH 1/5] fs: buffer_head writepage no invalidate Jan Kara
  2009-08-10 22:20 ` [PATCH 2/5] fs: Don't zero out page on writepage() Jan Kara
@ 2009-08-10 22:20 ` Jan Kara
  2009-08-10 22:20 ` [PATCH 4/5] fs: Move i_size update in write_end() from under page lock Jan Kara
  2009-08-10 22:20 ` [PATCH 5/5] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize Jan Kara
  4 siblings, 0 replies; 6+ messages in thread
From: Jan Kara @ 2009-08-10 22:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, npiggin, Jan Kara

Change set_page_dirty_buffers() and create_empty_buffers() to create dirty
buffers only inside current i_size. With this patch, we can rely on
buffer_dirty() to really mean "buffer has data that need to be written to
disk" in __block_write_full_page(). With this we are able to distinguish
cases
 a) block_commit_write() has marked buffers dirty but i_size is not yet updated
and
 b) buffers are beyond i_size and were marked dirty by an accident.

So we can write all dirty buffers in __block_write_full_page() and be sure
that we have written all the dirty data a page carries (regardless whether
the i_size update after a write already happened or not) and on the other
hand don't call get_block() on buffers that are beyond end of file.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c |   38 +++++++++++++++++++++++---------------
 1 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 33da488..2b8cabe 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -710,13 +710,19 @@ int __set_page_dirty_buffers(struct page *page)
 
 	spin_lock(&mapping->private_lock);
 	if (page_has_buffers(page)) {
+		struct inode *inode = mapping->host;
 		struct buffer_head *head = page_buffers(page);
 		struct buffer_head *bh = head;
+		sector_t last_block = (i_size_read(inode) - 1)
+						>> inode->i_blkbits;
+		sector_t block = (sector_t)page->index <<
+				(PAGE_CACHE_SHIFT - inode->i_blkbits);
 
 		do {
 			set_buffer_dirty(bh);
 			bh = bh->b_this_page;
-		} while (bh != head);
+			block++;
+		} while (bh != head && block <= last_block);
 	}
 	newly_dirty = !TestSetPageDirty(page);
 	spin_unlock(&mapping->private_lock);
@@ -1527,7 +1533,9 @@ void create_empty_buffers(struct page *page,
 			unsigned long blocksize, unsigned long b_state)
 {
 	struct buffer_head *bh, *head, *tail;
+	int dirty = b_state & BH_Dirty;
 
+	b_state &= ~BH_Dirty;
 	head = alloc_page_buffers(page, blocksize, 1);
 	bh = head;
 	do {
@@ -1538,14 +1546,19 @@ void create_empty_buffers(struct page *page,
 	tail->b_this_page = head;
 
 	spin_lock(&page->mapping->private_lock);
-	if (PageUptodate(page) || PageDirty(page)) {
+	dirty |= PageDirty(page);
+	if (PageUptodate(page) || dirty) {
+		loff_t size = i_size_read(inode);
+		loff_t start = page_offset(page);
+
 		bh = head;
 		do {
-			if (PageDirty(page))
+			if (dirty && start < size)
 				set_buffer_dirty(bh);
 			if (PageUptodate(page))
 				set_buffer_uptodate(bh);
 			bh = bh->b_this_page;
+			start += blocksize;
 		} while (bh != head);
 	}
 	attach_page_buffers(page, head);
@@ -1626,7 +1639,6 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 {
 	int err;
 	sector_t block;
-	sector_t last_block;
 	struct buffer_head *bh, *head;
 	const unsigned blocksize = 1 << inode->i_blkbits;
 	int nr_underway = 0;
@@ -1635,8 +1647,6 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 
 	BUG_ON(!PageLocked(page));
 
-	last_block = (i_size_read(inode) - 1) >> inode->i_blkbits;
-
 	if (!page_has_buffers(page)) {
 		create_empty_buffers(page, blocksize,
 					(1 << BH_Dirty)|(1 << BH_Uptodate));
@@ -1661,15 +1671,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 	 * handle any aliases from the underlying blockdev's mapping.
 	 */
 	do {
-		if (block > last_block) {
-			/*
-			 * mapped buffers outside i_size will occur, because
-			 * this page can be outside i_size when there is a
-			 * truncate in progress.
-			 */
-			clear_buffer_dirty(bh);
-			set_buffer_uptodate(bh);
-		} else if ((!buffer_mapped(bh) || buffer_delay(bh)) &&
+		if ((!buffer_mapped(bh) || buffer_delay(bh)) &&
 			   buffer_dirty(bh)) {
 			WARN_ON(bh->b_size != blocksize);
 			err = get_block(inode, block, bh, 1);
@@ -1703,6 +1705,12 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 			redirty_page_for_writepage(wbc, page);
 			continue;
 		}
+		/*
+		 * We write all mapped & dirty buffers. They can be beyond
+		 * current i_size if there's a truncate in progress or if
+		 * writepage() got called after block_commit_write() but before
+		 * i_size was updated.
+		 */
 		if (test_clear_buffer_dirty(bh)) {
 			mark_buffer_async_write_endio(bh, handler);
 		} else {
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 4/5] fs: Move i_size update in write_end() from under page lock
  2009-08-10 22:20 [PATCH 0/5] [RFC] Fix page_mkwrite for blocksize < pagesize Jan Kara
                   ` (2 preceding siblings ...)
  2009-08-10 22:20 ` [PATCH 3/5] vfs: Create dirty buffer only inside i_size Jan Kara
@ 2009-08-10 22:20 ` Jan Kara
  2009-08-10 22:20 ` [PATCH 5/5] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize Jan Kara
  4 siblings, 0 replies; 6+ messages in thread
From: Jan Kara @ 2009-08-10 22:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, npiggin, Jan Kara

From: Nick Piggin <npiggin@suse.de>

The previous patch allows us to relax the requirement that the page lock must
be held in order to avoid writepage zeroing out new data beyond isize.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c |   28 ++++++++--------------------
 mm/shmem.c  |    6 +++---
 2 files changed, 11 insertions(+), 23 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 2b8cabe..15e7f40 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2049,33 +2049,20 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 			struct page *page, void *fsdata)
 {
 	struct inode *inode = mapping->host;
-	int i_size_changed = 0;
 
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
 
+	unlock_page(page);
+	page_cache_release(page);
+
 	/*
 	 * No need to use i_size_read() here, the i_size
 	 * cannot change under us because we hold i_mutex.
-	 *
-	 * But it's important to update i_size while still holding page lock:
-	 * page writeout could otherwise come in and zero beyond i_size.
 	 */
 	if (pos+copied > inode->i_size) {
 		i_size_write(inode, pos+copied);
-		i_size_changed = 1;
-	}
-
-	unlock_page(page);
-	page_cache_release(page);
-
-	/*
-	 * Don't mark the inode dirty under page lock. First, it unnecessarily
-	 * makes the holding time of page lock longer. Second, it forces lock
-	 * ordering of page lock and transaction start for journaling
-	 * filesystems.
-	 */
-	if (i_size_changed)
 		mark_inode_dirty(inode);
+	}
 
 	return copied;
 }
@@ -2629,14 +2616,15 @@ int nobh_write_end(struct file *file, struct address_space *mapping,
 
 	SetPageUptodate(page);
 	set_page_dirty(page);
+
+	unlock_page(page);
+	page_cache_release(page);
+
 	if (pos+copied > inode->i_size) {
 		i_size_write(inode, pos+copied);
 		mark_inode_dirty(inode);
 	}
 
-	unlock_page(page);
-	page_cache_release(page);
-
 	while (head) {
 		bh = head;
 		head = head->b_this_page;
diff --git a/mm/shmem.c b/mm/shmem.c
index d713239..52ac65c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1627,13 +1627,13 @@ shmem_write_end(struct file *file, struct address_space *mapping,
 {
 	struct inode *inode = mapping->host;
 
-	if (pos + copied > inode->i_size)
-		i_size_write(inode, pos + copied);
-
 	unlock_page(page);
 	set_page_dirty(page);
 	page_cache_release(page);
 
+	if (pos + copied > inode->i_size)
+		i_size_write(inode, pos + copied);
+
 	return copied;
 }
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 5/5] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
  2009-08-10 22:20 [PATCH 0/5] [RFC] Fix page_mkwrite for blocksize < pagesize Jan Kara
                   ` (3 preceding siblings ...)
  2009-08-10 22:20 ` [PATCH 4/5] fs: Move i_size update in write_end() from under page lock Jan Kara
@ 2009-08-10 22:20 ` Jan Kara
  4 siblings, 0 replies; 6+ messages in thread
From: Jan Kara @ 2009-08-10 22:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, npiggin, Jan Kara

page_mkwrite() is meant to be used by filesystems to allocate blocks under a
page which is becoming writeably mmapped in some process address space. This
allows a filesystem to return a page fault if there is not enough space
available, user exceeds quota or similar problem happens, rather than silently
discarding data later when writepage is called.

On filesystems where blocksize < pagesize the situation is more complicated.
Think for example that blocksize = 1024, pagesize = 4096 and a process does:
  ftruncate(fd, 0);
  pwrite(fd, buf, 1024, 0);
  map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
  map[0] = 'a';  ----> page_mkwrite() for index 0 is called
  ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
  fsync(fd); ----> writepage() for index 0 is called

At the moment page_mkwrite() is called, filesystem can allocate only one block
for the page because i_size == 1024. Otherwise it would create blocks beyond
i_size which is generally undesirable. But later at writepage() time, we would
like to have blocks allocated for the whole page (and in principle we have to
allocate them because user could have filled the page with data after the
second ftruncate()). This patch introduces a framework which allows filesystems
to handle this with a reasonable effort.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c                 |   68 +++++++++++++++++++++++++++++++++++++++++-
 include/linux/buffer_head.h |    4 ++
 include/linux/fs.h          |    4 ++-
 mm/filemap.c                |   10 ++++++-
 mm/filemap_xip.c            |    3 +-
 mm/memory.c                 |    2 +-
 mm/nommu.c                  |    2 +-
 mm/shmem.c                  |    2 +-
 8 files changed, 87 insertions(+), 8 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 15e7f40..00f8bdd 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -40,6 +40,7 @@
 #include <linux/cpu.h>
 #include <linux/bitops.h>
 #include <linux/mpage.h>
+#include <linux/rmap.h>
 #include <linux/bit_spinlock.h>
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
@@ -2060,7 +2061,7 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 	 * cannot change under us because we hold i_mutex.
 	 */
 	if (pos+copied > inode->i_size) {
-		i_size_write(inode, pos+copied);
+		extend_i_size(inode, pos, copied);
 		mark_inode_dirty(inode);
 	}
 
@@ -2360,6 +2361,69 @@ int block_commit_write(struct page *page, unsigned from, unsigned to)
 }
 
 /*
+ * Lock inode with I_HOLE_EXTEND if the write is going to create a hole
+ * under a mmapped page. Also mark the page RO so that page_mkwrite()
+ * is called on the nearest write access to the page and clear dirty bits
+ * beyond i_size.
+ *
+ * @pos is offset to which write/truncate is happenning.
+ */
+void block_extend_i_size(struct inode *inode, loff_t pos, loff_t len)
+{
+	int bsize = 1 << inode->i_blkbits;
+	loff_t rounded_i_size;
+	struct page *page;
+	pgoff_t index;
+	struct buffer_head *head, *bh;
+	sector_t block, last_block;
+	int start. end;
+
+	WARN_ON(!mutex_is_locked(&inode->i_mutex));
+
+	/*
+	 * Make sure page_mkwrite() is called on this page before
+	 * user is able to write any data beyond current i_size via
+	 * mmap.
+	 *
+	 * See clear_page_dirty_for_io() for details why set_page_dirty()
+	 * is needed.
+	 */
+	index = inode->i_size >> PAGE_CACHE_SHIFT;
+	page = find_lock_page(inode->i_mapping, index);
+	/* Page not cached? Nothing to do */
+	if (!page)
+		goto write_size;
+	/* Optimize for common case */
+	if (PAGE_CACHE_SIZE == bsize)
+		goto zero_and_write;
+	/* Currently last page will not have any hole block created? */
+	rounded_i_size = ALIGN(inode->i_size, bsize);
+	if (pos <= rounded_i_size || !(rounded_i_size & (PAGE_CACHE_SIZE - 1)))
+		goto zero_and_write;
+#ifdef CONFIG_MMU
+	if (page_mkclean(page))
+		set_page_dirty(page);
+#endif
+zero_and_write:
+	/*
+	 * Zero out end of the page as it could have been modified via mmap
+	 * and it now becomes to be inside i_size
+	 */
+	if (pos > inode->i_size) {
+		start = inode->i_size & (PAGE_CACHE_SIZE - 1);
+		end = min_t(int, PAGE_CACHE_SIZE, pos - inode->i_size + start);
+		zero_user_segment(page, offset, end);
+	}
+	i_size_write(inode, pos + len);
+	unlock_page(page);
+	page_cache_release(page);
+	return;
+write_size:
+	i_size_write(inode, pos + len);
+}
+EXPORT_SYMBOL(block_extend_i_size);
+
+/*
  * block_page_mkwrite() is not allowed to change the file size as it gets
  * called from a page fault handler when a page is first dirtied. Hence we must
  * be careful to check for EOF conditions here. We set the page up correctly
@@ -2621,7 +2685,7 @@ int nobh_write_end(struct file *file, struct address_space *mapping,
 	page_cache_release(page);
 
 	if (pos+copied > inode->i_size) {
-		i_size_write(inode, pos+copied);
+		extend_i_size(inode, pos, copied);
 		mark_inode_dirty(inode);
 	}
 
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 16ed028..56a0162 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -219,6 +219,10 @@ int cont_write_begin(struct file *, struct address_space *, loff_t,
 			get_block_t *, loff_t *);
 int generic_cont_expand_simple(struct inode *inode, loff_t size);
 int block_commit_write(struct page *page, unsigned from, unsigned to);
+int block_lock_hole_extend(struct inode *inode, loff_t pos);
+void block_unlock_hole_extend(struct inode *inode);
+int block_wait_on_hole_extend(struct inode *inode, loff_t pos);
+void block_extend_i_size(struct inode *inode, loff_t pos, loff_t len);
 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 				get_block_t get_block);
 void block_sync_page(struct page *);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a36ffa5..9440666 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -580,7 +580,7 @@ struct address_space_operations {
 	int (*write_end)(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned copied,
 				struct page *page, void *fsdata);
-
+	void (*extend_i_size)(struct inode *, loff_t pos, loff_t len);
 	/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
 	sector_t (*bmap)(struct address_space *, sector_t);
 	void (*invalidatepage) (struct page *, unsigned long);
@@ -597,6 +597,8 @@ struct address_space_operations {
 					unsigned long);
 };
 
+void extend_i_size(struct inode *inode, loff_t pos, loff_t len);
+
 /*
  * pagecache_write_begin/pagecache_write_end must be used by general code
  * to write into the pagecache.
diff --git a/mm/filemap.c b/mm/filemap.c
index ccea3b6..bf5e527 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2102,6 +2102,14 @@ int pagecache_write_end(struct file *file, struct address_space *mapping,
 }
 EXPORT_SYMBOL(pagecache_write_end);
 
+void extend_i_size(struct inode *inode, loff_t pos, loff_t len)
+{
+	if (inode->i_mapping->a_ops->extend_i_size)
+		inode->i_mapping->a_ops->extend_i_size(inode, pos, len);
+	else
+		block_extend_i_size(inode, pos, len);
+}
+
 ssize_t
 generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long *nr_segs, loff_t pos, loff_t *ppos,
@@ -2162,7 +2170,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 	if (written > 0) {
 		loff_t end = pos + written;
 		if (end > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
-			i_size_write(inode,  end);
+			extend_i_size(inode, pos, written);
 			mark_inode_dirty(inode);
 		}
 		*ppos = end;
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 427dfe3..3f7e15d 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -321,6 +321,7 @@ __xip_file_write(struct file *filp, const char __user *buf,
 	long		status = 0;
 	size_t		bytes;
 	ssize_t		written = 0;
+	loff_t		orig_pos = pos;
 
 	BUG_ON(!mapping->a_ops->get_xip_mem);
 
@@ -378,7 +379,7 @@ __xip_file_write(struct file *filp, const char __user *buf,
 	 * cannot change under us because we hold i_mutex.
 	 */
 	if (pos > inode->i_size) {
-		i_size_write(inode, pos);
+		extend_i_size(inode, orig_pos, written);
 		mark_inode_dirty(inode);
 	}
 
diff --git a/mm/memory.c b/mm/memory.c
index aede2ce..e034616 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2429,7 +2429,7 @@ int vmtruncate(struct inode * inode, loff_t offset)
 			goto out_sig;
 		if (offset > inode->i_sb->s_maxbytes)
 			goto out_big;
-		i_size_write(inode, offset);
+		extend_i_size(inode, offset, 0);
 	} else {
 		struct address_space *mapping = inode->i_mapping;
 
diff --git a/mm/nommu.c b/mm/nommu.c
index 53cab10..93c1c03 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -111,7 +111,7 @@ do_expand:
 		goto out_sig;
 	if (offset > inode->i_sb->s_maxbytes)
 		goto out;
-	i_size_write(inode, offset);
+	extend_i_size(inode, offset, 0);
 
 out_truncate:
 	if (inode->i_op->truncate)
diff --git a/mm/shmem.c b/mm/shmem.c
index 52ac65c..01df19b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1632,7 +1632,7 @@ shmem_write_end(struct file *file, struct address_space *mapping,
 	page_cache_release(page);
 
 	if (pos + copied > inode->i_size)
-		i_size_write(inode, pos + copied);
+		extend_i_size(inode, pos, copied);
 
 	return copied;
 }
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-08-10 22:20 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-10 22:20 [PATCH 0/5] [RFC] Fix page_mkwrite for blocksize < pagesize Jan Kara
2009-08-10 22:20 ` [PATCH 1/5] fs: buffer_head writepage no invalidate Jan Kara
2009-08-10 22:20 ` [PATCH 2/5] fs: Don't zero out page on writepage() Jan Kara
2009-08-10 22:20 ` [PATCH 3/5] vfs: Create dirty buffer only inside i_size Jan Kara
2009-08-10 22:20 ` [PATCH 4/5] fs: Move i_size update in write_end() from under page lock Jan Kara
2009-08-10 22:20 ` [PATCH 5/5] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).