[PATCH RFC] ext3 data=guarded v3

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC] ext3 data=guarded v3
@ 2009-04-15 17:22 Chris Mason
  2009-04-15 17:22 ` [PATCH 1/3] Export filemap_write_and_wait_range Chris Mason
                   ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Chris Mason @ 2009-04-15 17:22 UTC (permalink / raw)
  To: Jan Kara, Linus Torvalds, Theodore Ts'o,
	Linux Kernel Developers List, Ext4 

Hello everyone,

Here is another version of the data=guarded work for ext3.  The main
difference between this code and yesterday's is the guarded writepage
function now sends any newly allocated block through the old data=ordered code.

This is important because at the time we're walking the buffers, the page
may be unlocked, so we can't trust anything inside the page.  In general,
any allocation done by writepage is to fill a hole, so the old data=ordered
is what we want anyway.

This passed a longer stress test and generally seems to be working.  I
don't think anyone would recommend it as a default for 2.6.30, but it
may be a good idea to have a review party and decide if it is safe enough
to include so people can experiment with it.

Overall diffstat of the series:

 fs/buffer.c                 |   45 ++-
 fs/ext3/Makefile            |    3
 fs/ext3/fsync.c             |   12
 fs/ext3/inode.c             |  546 +++++++++++++++++++++++++++++++++++++++++++-
 fs/ext3/namei.c             |    3
 fs/ext3/ordered-data.c      |  318 +++++++++++++++++++++++++
 fs/ext3/super.c             |   48 +++
 fs/jbd/transaction.c        |    1
 include/linux/buffer_head.h |    3
 include/linux/ext3_fs.h     |   33 ++
 include/linux/ext3_fs_i.h   |   44 +++
 include/linux/ext3_fs_sb.h  |    6
 include/linux/ext3_jbd.h    |   11
 include/linux/jbd.h         |   10
 mm/filemap.c                |    1

-chris

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/3] Export filemap_write_and_wait_range
  2009-04-15 17:22 [PATCH RFC] ext3 data=guarded v3 Chris Mason
@ 2009-04-15 17:22 ` Chris Mason
  2009-04-15 17:22   ` [PATCH 2/3] Add block_write_full_page_endio for passing endio handler Chris Mason
  2009-04-15 17:22   ` [PATCH 3/3] Add ext3 data=guarded mode Chris Mason
  2009-04-15 19:10 ` [PATCH RFC] ext3 data=guarded v3 Eric Sandeen
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 34+ messages in thread
From: Chris Mason @ 2009-04-15 17:22 UTC (permalink / raw)
  To: Jan Kara, Linus Torvalds, Theodore Ts'o,
	Linux Kernel Developers List, Ext4 
  Cc: Chris Mason

This wasn't exported before and is used by the ext3 data=guarded code

Signed-off-by: Chris Mason <chris.mason@oracle.com>
---
 mm/filemap.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 2e2d38e..04e2160 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -441,6 +441,7 @@ int filemap_write_and_wait_range(struct address_space *mapping,
 	}
 	return err;
 }
+EXPORT_SYMBOL(filemap_write_and_wait_range);
 
 /**
  * add_to_page_cache_locked - add a locked page to the pagecache
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/3] Add block_write_full_page_endio for passing endio handler
  2009-04-15 17:22 ` [PATCH 1/3] Export filemap_write_and_wait_range Chris Mason
@ 2009-04-15 17:22   ` Chris Mason
  2009-04-15 17:22   ` [PATCH 3/3] Add ext3 data=guarded mode Chris Mason
  1 sibling, 0 replies; 34+ messages in thread
From: Chris Mason @ 2009-04-15 17:22 UTC (permalink / raw)
  To: Jan Kara, Linus Torvalds, Theodore Ts'o,
	Linux Kernel Developers List, Ext4 
  Cc: Chris Mason

block_write_full_page doesn't allow the caller to control what happens
when the IO is over.  This adds a new call named block_write_full_page_endio
so the buffer head end_io handler can be provided by the caller.

This will be used by the ext3 data=guarded mode to do i_size updates
in a workqueue based end_io handler.  end_buffer_async_write is also
exported so it can be called to do the dirty work of managing
page writeback for the higher level end_io handler.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
---
 fs/buffer.c                 |   45 ++++++++++++++++++++++++++++++++----------
 include/linux/buffer_head.h |    3 ++
 2 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 13edf7a..308865b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -360,7 +360,7 @@ still_busy:
  * Completion handler for block_write_full_page() - pages which are unlocked
  * during I/O, and which have PageWriteback cleared upon I/O completion.
  */
-static void end_buffer_async_write(struct buffer_head *bh, int uptodate)
+void end_buffer_async_write(struct buffer_head *bh, int uptodate)
 {
 	char b[BDEVNAME_SIZE];
 	unsigned long flags;
@@ -438,11 +438,17 @@ static void mark_buffer_async_read(struct buffer_head *bh)
 	set_buffer_async_read(bh);
 }
 
-void mark_buffer_async_write(struct buffer_head *bh)
+void mark_buffer_async_write_endio(struct buffer_head *bh,
+				   bh_end_io_t *handler)
 {
-	bh->b_end_io = end_buffer_async_write;
+	bh->b_end_io = handler;
 	set_buffer_async_write(bh);
 }
+
+void mark_buffer_async_write(struct buffer_head *bh)
+{
+	mark_buffer_async_write_endio(bh, end_buffer_async_write);
+}
 EXPORT_SYMBOL(mark_buffer_async_write);
 
 
@@ -1608,7 +1614,8 @@ EXPORT_SYMBOL(unmap_underlying_metadata);
  * unplugging the device queue.
  */
 static int __block_write_full_page(struct inode *inode, struct page *page,
-			get_block_t *get_block, struct writeback_control *wbc)
+			get_block_t *get_block, struct writeback_control *wbc,
+			bh_end_io_t *handler)
 {
 	int err;
 	sector_t block;
@@ -1693,7 +1700,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 			continue;
 		}
 		if (test_clear_buffer_dirty(bh)) {
-			mark_buffer_async_write(bh);
+			mark_buffer_async_write_endio(bh, handler);
 		} else {
 			unlock_buffer(bh);
 		}
@@ -1746,7 +1753,7 @@ recover:
 		if (buffer_mapped(bh) && buffer_dirty(bh) &&
 		    !buffer_delay(bh)) {
 			lock_buffer(bh);
-			mark_buffer_async_write(bh);
+			mark_buffer_async_write_endio(bh, handler);
 		} else {
 			/*
 			 * The buffer may have been set dirty during
@@ -2672,7 +2679,8 @@ int nobh_writepage(struct page *page, get_block_t *get_block,
 out:
 	ret = mpage_writepage(page, get_block, wbc);
 	if (ret == -EAGAIN)
-		ret = __block_write_full_page(inode, page, get_block, wbc);
+		ret = __block_write_full_page(inode, page, get_block, wbc,
+					      end_buffer_async_write);
 	return ret;
 }
 EXPORT_SYMBOL(nobh_writepage);
@@ -2830,9 +2838,10 @@ out:
 
 /*
  * The generic ->writepage function for buffer-backed address_spaces
+ * this form passes in the end_io handler used to finish the IO.
  */
-int block_write_full_page(struct page *page, get_block_t *get_block,
-			struct writeback_control *wbc)
+int block_write_full_page_endio(struct page *page, get_block_t *get_block,
+			struct writeback_control *wbc, bh_end_io_t *handler)
 {
 	struct inode * const inode = page->mapping->host;
 	loff_t i_size = i_size_read(inode);
@@ -2841,7 +2850,8 @@ int block_write_full_page(struct page *page, get_block_t *get_block,
 
 	/* Is the page fully inside i_size? */
 	if (page->index < end_index)
-		return __block_write_full_page(inode, page, get_block, wbc);
+		return __block_write_full_page(inode, page, get_block, wbc,
+					       handler);
 
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_CACHE_SIZE-1);
@@ -2864,9 +2874,20 @@ int block_write_full_page(struct page *page, get_block_t *get_block,
 	 * writes to that region are not written out to the file."
 	 */
 	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
-	return __block_write_full_page(inode, page, get_block, wbc);
+	return __block_write_full_page(inode, page, get_block, wbc, handler);
 }
 
+/*
+ * The generic ->writepage function for buffer-backed address_spaces
+ */
+int block_write_full_page(struct page *page, get_block_t *get_block,
+			struct writeback_control *wbc)
+{
+	return block_write_full_page_endio(page, get_block, wbc,
+					   end_buffer_async_write);
+}
+
+
 sector_t generic_block_bmap(struct address_space *mapping, sector_t block,
 			    get_block_t *get_block)
 {
@@ -3335,9 +3356,11 @@ EXPORT_SYMBOL(block_read_full_page);
 EXPORT_SYMBOL(block_sync_page);
 EXPORT_SYMBOL(block_truncate_page);
 EXPORT_SYMBOL(block_write_full_page);
+EXPORT_SYMBOL(block_write_full_page_endio);
 EXPORT_SYMBOL(cont_write_begin);
 EXPORT_SYMBOL(end_buffer_read_sync);
 EXPORT_SYMBOL(end_buffer_write_sync);
+EXPORT_SYMBOL(end_buffer_async_write);
 EXPORT_SYMBOL(file_fsync);
 EXPORT_SYMBOL(generic_block_bmap);
 EXPORT_SYMBOL(generic_cont_expand_simple);
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 7b73bb8..16ed028 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -155,6 +155,7 @@ void create_empty_buffers(struct page *, unsigned long,
 			unsigned long b_state);
 void end_buffer_read_sync(struct buffer_head *bh, int uptodate);
 void end_buffer_write_sync(struct buffer_head *bh, int uptodate);
+void end_buffer_async_write(struct buffer_head *bh, int uptodate);
 
 /* Things to do with buffers at mapping->private_list */
 void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode);
@@ -197,6 +198,8 @@ extern int buffer_heads_over_limit;
 void block_invalidatepage(struct page *page, unsigned long offset);
 int block_write_full_page(struct page *page, get_block_t *get_block,
 				struct writeback_control *wbc);
+int block_write_full_page_endio(struct page *page, get_block_t *get_block,
+			struct writeback_control *wbc, bh_end_io_t *handler);
 int block_read_full_page(struct page*, get_block_t*);
 int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
 				unsigned long from);
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/3] Add ext3 data=guarded mode
  2009-04-15 17:22 ` [PATCH 1/3] Export filemap_write_and_wait_range Chris Mason
  2009-04-15 17:22   ` [PATCH 2/3] Add block_write_full_page_endio for passing endio handler Chris Mason
@ 2009-04-15 17:22   ` Chris Mason
  2009-04-16 19:42     ` [PATCH] " Chris Mason
  1 sibling, 1 reply; 34+ messages in thread
From: Chris Mason @ 2009-04-15 17:22 UTC (permalink / raw)
  To: Jan Kara, Linus Torvalds, Theodore Ts'o,
	Linux Kernel Developers List, Ext4 
  Cc: Chris Mason

ext3 data=ordered mode makes sure that data blocks are on disk before
the metadata that references them, which avoids files full of garbage
or previously deleted data after a crash.  It does this by adding every dirty
buffer onto a list of things that must be written before a commit.

This makes every fsync write out all the dirty data on the entire FS, which
has high latencies and is generally much more expensive than it needs to be.

Another way to avoid exposing stale data after a crash is to wait until
after the data buffers are written before updating the on-disk record
of the file's size.  If we crash before the data IO is done, i_size
doesn't yet include the new blocks and no stale data is exposed.

This patch adds the delayed i_size update to ext3, along with a new
mount option (data=guarded) to enable it.  The basic mechanism works like
this:

* Change block_write_full_page to take an end_io handler as a parameter.
This allows us to make an end_io handler that queues buffer heads for
a workqueue where the real work of updating the on disk i_size is done.

* Add an rbtree to the in-memory ext3 inode for tracking data=guarded
buffer heads that are waiting to be sent to disk.

* Add an ext3 guarded write_end call to add buffer heads for newly
allocated blocks into the rbtree.  If we have a newly allocated block that is
filling a hole inside i_size, this is done as an old style data=ordered write
instead.

* Add an ext3 guarded writepage call that uses a special buffer head
end_io handler for buffers that are marked as guarded.  Again, if we find
newly allocated blocks filling holes, they are sent through data=ordered
instead of data=guarded.

* When a guarded IO finishes, kick a per-FS workqueue to do the
on disk i_size updates.  The workqueue function must be very careful.  We
only update the on disk i_size if all of the IO between the old on
disk i_size and the new on disk i_size is complete.  This is why an
rbtree is used to track the pending buffers, that way we can verify all
of the IO is actually done.  The on disk i_size is incrementally updated to
the largest safe value every time an IO completes.

* When we start tracking guarded buffers on a given inode, we put the
inode into ext3's orphan list.  This way if we do crash, the file will
be truncated back down to the on disk i_size and we'll free any blocks that
were not completely written.  The inode is removed from the orphan list
only after all the guarded buffers are done.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
---
 fs/ext3/Makefile           |    3 +-
 fs/ext3/fsync.c            |   12 +
 fs/ext3/inode.c            |  546 +++++++++++++++++++++++++++++++++++++++++++-
 fs/ext3/namei.c            |    3 +-
 fs/ext3/ordered-data.c     |  318 ++++++++++++++++++++++++++
 fs/ext3/super.c            |   48 ++++-
 fs/jbd/transaction.c       |    1 +
 include/linux/ext3_fs.h    |   33 +++-
 include/linux/ext3_fs_i.h  |   44 ++++
 include/linux/ext3_fs_sb.h |    6 +
 include/linux/ext3_jbd.h   |   11 +
 include/linux/jbd.h        |   10 +
 12 files changed, 1016 insertions(+), 19 deletions(-)
 create mode 100644 fs/ext3/ordered-data.c

diff --git a/fs/ext3/Makefile b/fs/ext3/Makefile
index e77766a..f3a9dc1 100644
--- a/fs/ext3/Makefile
+++ b/fs/ext3/Makefile
@@ -5,7 +5,8 @@
 obj-$(CONFIG_EXT3_FS) += ext3.o
 
 ext3-y	:= balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
-	   ioctl.o namei.o super.o symlink.o hash.o resize.o ext3_jbd.o
+	   ioctl.o namei.o super.o symlink.o hash.o resize.o ext3_jbd.o \
+	   ordered-data.o
 
 ext3-$(CONFIG_EXT3_FS_XATTR)	 += xattr.o xattr_user.o xattr_trusted.o
 ext3-$(CONFIG_EXT3_FS_POSIX_ACL) += acl.o
diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c
index d336341..a50abb4 100644
--- a/fs/ext3/fsync.c
+++ b/fs/ext3/fsync.c
@@ -59,6 +59,11 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
 	 *  sync_inode() will write the inode if it is dirty.  Then the caller's
 	 *  filemap_fdatawait() will wait on the pages.
 	 *
+	 * data=guarded:
+	 * The caller's filemap_fdatawrite will start the IO, and we
+	 * use filemap_fdatawait here to make sure all the disk i_size updates
+	 * are done before we commit the inode.
+	 *
 	 * data=journal:
 	 *  filemap_fdatawrite won't do anything (the buffers are clean).
 	 *  ext3_force_commit will write the file data into the journal and
@@ -84,6 +89,13 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
 			.sync_mode = WB_SYNC_ALL,
 			.nr_to_write = 0, /* sys_fsync did this */
 		};
+		/*
+		 * the new disk i_size must be logged before we commit,
+		 * so we wait here for pending writeback
+		 */
+		if (ext3_should_guard_data(inode))
+			filemap_write_and_wait(inode->i_mapping);
+
 		ret = sync_inode(inode, &wbc);
 	}
 out:
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index fcfa243..5b88b83 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -38,6 +38,7 @@
 #include <linux/bio.h>
 #include <linux/fiemap.h>
 #include <linux/namei.h>
+#include <linux/workqueue.h>
 #include "xattr.h"
 #include "acl.h"
 
@@ -179,6 +180,102 @@ static int ext3_journal_test_restart(handle_t *handle, struct inode *inode)
 }
 
 /*
+ * after a data=guarded IO is done, we need to update the
+ * disk i_size to reflect the data we've written.  If there are
+ * no more ordered data extents left in the tree, we need to
+ * get rid of the orphan entry making sure the file's
+ * block pointers match the i_size after a crash
+ *
+ * When we aren't in data=guarded mode, this just does an ext3_orphan_del.
+ *
+ * It returns the result of ext3_orphan_del.
+ *
+ * handle may be null if we are just cleaning up the orphan list in
+ * memory.
+ *
+ * pass must_log == 1 when the inode must be logged in order to get
+ * an i_size update on disk
+ */
+static int ordered_orphan_del(handle_t *handle, struct inode *inode,
+			      int must_log)
+{
+	int ret = 0;
+
+	/* fast out when data=guarded isn't on */
+	if (!ext3_should_guard_data(inode))
+		return ext3_orphan_del(handle, inode);
+
+	ext3_ordered_lock(inode);
+	if (inode->i_nlink &&
+	    RB_EMPTY_ROOT(&EXT3_I(inode)->ordered_tree.tree)) {
+		/*
+		 * if a handle is not null, this also ends up  writing
+		 * our new disk i_size
+		 */
+		ext3_ordered_unlock(inode);
+		ret = ext3_orphan_del(handle, inode);
+		if (ret || !handle)
+			goto err;
+
+		/*
+		 * now we check again to see if we might have dropped
+		 * the orphan just after someone added a new ordered extent
+		 */
+		ext3_ordered_lock(inode);
+		if (!RB_EMPTY_ROOT(&EXT3_I(inode)->ordered_tree.tree) &&
+		    list_empty(&EXT3_I(inode)->i_orphan)) {
+			ext3_ordered_unlock(inode);
+			ret = ext3_orphan_add(handle, inode);
+			if (ret)
+				goto err;
+		} else {
+			ext3_ordered_unlock(inode);
+		}
+	} else if (handle && must_log) {
+		ext3_ordered_unlock(inode);
+
+		/*
+		 * we need to make sure any updates done by the data=guarded
+		 * code end up in the inode on disk.  Log the inode
+		 * here
+		 */
+		ext3_mark_inode_dirty(handle, inode);
+	} else {
+		ext3_ordered_unlock(inode);
+	}
+
+err:
+	return ret;
+}
+
+/*
+ * Wrapper around ordered_orphan_del that starts a transaction
+ */
+static void ordered_orphan_del_trans(struct inode *inode, int must_log)
+{
+	handle_t *handle;
+
+	handle = ext3_journal_start(inode, 3);
+
+	/*
+	 * uhoh, should we flag the FS as readonly here? ext3_dirty_inode
+	 * doesn't, which is what we're modeling ourselves after.
+	 *
+	 * We do need to make sure to get this inode off the ordered list
+	 * when the transaction start fails though.  ordered_orphan_del
+	 * does the right thing.
+	 */
+	if (IS_ERR(handle)) {
+		ordered_orphan_del(NULL, inode, 0);
+		return;
+	}
+
+	ordered_orphan_del(handle, inode, must_log);
+	ext3_journal_stop(handle);
+}
+
+
+/*
  * Called at the last iput() if i_nlink is zero.
  */
 void ext3_delete_inode (struct inode * inode)
@@ -204,6 +301,13 @@ void ext3_delete_inode (struct inode * inode)
 	if (IS_SYNC(inode))
 		handle->h_sync = 1;
 	inode->i_size = 0;
+
+	/*
+	 * make sure we clean up any ordered extents that didn't get
+	 * IO started on them because i_size shrunk down to zero.
+	 */
+	ext3_truncate_ordered_extents(inode, 0);
+
 	if (inode->i_blocks)
 		ext3_truncate(inode);
 	/*
@@ -767,6 +871,24 @@ err_out:
 }
 
 /*
+ * This protects the disk i_size with the  spinlock for the ordered
+ * extent tree.  It returns 1 when the inode needs to be logged
+ * because the i_disksize has been updated.
+ */
+static int maybe_update_disk_isize(struct inode *inode, loff_t new_size)
+{
+	int ret = 0;
+
+	ext3_ordered_lock(inode);
+	if (EXT3_I(inode)->i_disksize < new_size) {
+		EXT3_I(inode)->i_disksize = new_size;
+		ret = 1;
+	}
+	ext3_ordered_unlock(inode);
+	return ret;
+}
+
+/*
  * Allocation strategy is simple: if we have to allocate something, we will
  * have to go the whole way to leaf. So let's do it before attaching anything
  * to tree, set linkage between the newborn blocks, write them if sync is
@@ -815,6 +937,7 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
 	if (!partial) {
 		first_block = le32_to_cpu(chain[depth - 1].key);
 		clear_buffer_new(bh_result);
+		clear_buffer_datanew(bh_result);
 		count++;
 		/*map more blocks*/
 		while (count < maxblocks && count <= blocks_to_boundary) {
@@ -873,6 +996,7 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
 			if (err)
 				goto cleanup;
 			clear_buffer_new(bh_result);
+			clear_buffer_datanew(bh_result);
 			goto got_it;
 		}
 	}
@@ -915,14 +1039,19 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
 	 * i_disksize growing is protected by truncate_mutex.  Don't forget to
 	 * protect it if you're about to implement concurrent
 	 * ext3_get_block() -bzzz
+	 *
+	 * FIXME, I think this only needs to extend the disk i_size when
+	 * we're filling holes that came from using ftruncate to increase
+	 * i_size.  Need to verify.
 	*/
-	if (!err && extend_disksize && inode->i_size > ei->i_disksize)
-		ei->i_disksize = inode->i_size;
+	if (!ext3_should_guard_data(inode) && !err && extend_disksize)
+		maybe_update_disk_isize(inode, inode->i_size);
 	mutex_unlock(&ei->truncate_mutex);
 	if (err)
 		goto cleanup;
 
 	set_buffer_new(bh_result);
+	set_buffer_datanew(bh_result);
 got_it:
 	map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
 	if (count > blocks_to_boundary)
@@ -1079,6 +1208,77 @@ struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
 	return NULL;
 }
 
+/*
+ * data=guarded updates are handled in a workqueue after the IO
+ * is done.  This runs through the list of buffer heads that are pending
+ * processing.
+ */
+void ext3_run_guarded_work(struct work_struct *work)
+{
+	struct ext3_sb_info *sbi =
+		container_of(work, struct ext3_sb_info, guarded_work);
+	struct buffer_head *bh;
+	struct ext3_ordered_extent *ordered;
+	struct inode *inode;
+	struct page *page;
+	int must_log;
+
+	spin_lock_irq(&sbi->guarded_lock);
+	while (!list_empty(&sbi->guarded_buffers)) {
+		ordered = list_entry(sbi->guarded_buffers.next,
+				     struct ext3_ordered_extent, list);
+
+		list_del(&ordered->list);
+
+		bh = ordered->end_io_bh;
+		ordered->end_io_bh = NULL;
+		must_log = 0;
+
+		/* we don't need a reference on the buffer head because
+		 * it is locked until the end_io handler his called.
+		 *
+		 * This means the page can't go away, which means the
+		 * inode can't go away
+		 */
+		spin_unlock_irq(&sbi->guarded_lock);
+
+		page = bh->b_page;
+		inode = page->mapping->host;
+
+		ext3_ordered_lock(inode);
+		if (ordered->bh) {
+			/*
+			 * someone might have decided this buffer didn't
+			 * really need to be ordered and removed us from
+			 * the rb tree.  They set ordered->bh to null
+			 * when that happens.
+			 */
+			must_log = ext3_ordered_update_i_size(inode, ordered);
+			ext3_remove_ordered_extent(inode, ordered);
+		}
+		ext3_ordered_unlock(inode);
+
+		/*
+		 * drop the reference taken when this ordered extent was
+		 * put onto the guarded_buffers list
+		 */
+		ext3_put_ordered_extent(ordered);
+
+		/*
+		 * maybe log the inode and/or cleanup the orphan entry
+		 */
+		ordered_orphan_del_trans(inode, must_log > 0);
+
+		/*
+		 * finally, call the real bh end_io function to do
+		 * all the hard work of maintaining page writeback.
+		 */
+		end_buffer_async_write(bh, buffer_uptodate(bh));
+		spin_lock_irq(&sbi->guarded_lock);
+	}
+	spin_unlock_irq(&sbi->guarded_lock);
+}
+
 static int walk_page_buffers(	handle_t *handle,
 				struct buffer_head *head,
 				unsigned from,
@@ -1185,6 +1385,7 @@ retry:
 		ret = walk_page_buffers(handle, page_buffers(page),
 				from, to, NULL, do_journal_get_write_access);
 	}
+
 write_begin_failed:
 	if (ret) {
 		/*
@@ -1212,7 +1413,13 @@ out:
 
 int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
 {
-	int err = journal_dirty_data(handle, bh);
+	int err;
+
+	/* don't take buffers from the data=guarded list */
+	if (buffer_dataguarded(bh))
+		return 0;
+
+	err = journal_dirty_data(handle, bh);
 	if (err)
 		ext3_journal_abort_handle(__func__, __func__,
 						bh, handle, err);
@@ -1231,6 +1438,89 @@ static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
 	return 0;
 }
 
+/*
+ * Walk the buffers in a page for data=guarded mode.  Buffers that
+ * are not marked as datanew are ignored.
+ *
+ * New buffers outside i_size are sent to the data guarded code
+ *
+ * We must do the old data=ordered mode when filling holes in the
+ * file, since i_size doesn't protect these at all.
+ */
+static int journal_dirty_data_guarded_fn(handle_t *handle,
+					 struct buffer_head *bh)
+{
+	u64 offset = page_offset(bh->b_page) + bh_offset(bh);
+	struct inode *inode = bh->b_page->mapping->host;
+	int ret = 0;
+
+	/*
+	 * Write could have mapped the buffer but it didn't copy the data in
+	 * yet. So avoid filing such buffer into a transaction.
+	 */
+	if (!buffer_mapped(bh) || !buffer_uptodate(bh))
+		return 0;
+
+	if (test_clear_buffer_datanew(bh)) {
+		/*
+		 * if we're filling a hole inside i_size, we need to
+		 * fall back to the old style data=ordered
+		 */
+		if (offset < inode->i_size) {
+			ret = ext3_journal_dirty_data(handle, bh);
+			goto out;
+		}
+		ret = ext3_add_ordered_extent(inode, offset, bh);
+
+		/* if we crash before the IO is done, i_size will be small
+		 * but these blocks will still be allocated to the file.
+		 *
+		 * So, add an orphan entry for the file, which will truncate it
+		 * down to the i_size it finds after the crash.
+		 *
+		 * The orphan is cleaned up when the IO is done.  We
+		 * don't add orphans while mount is running the orphan list,
+		 * that seems to corrupt the list.
+		 */
+		if (ret == 0 && buffer_dataguarded(bh) &&
+		    list_empty(&EXT3_I(inode)->i_orphan) &&
+		    !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ORPHAN_FS)) {
+				ret = ext3_orphan_add(handle, inode);
+		}
+	}
+out:
+	return ret;
+}
+
+/*
+ * Walk the buffers in a page for data=guarded mode for writepage.
+ *
+ * We must do the old data=ordered mode when filling holes in the
+ * file, since i_size doesn't protect these at all.
+ *
+ * This is actually called after writepage is run and so we can't
+ * trust anything other than the buffer head (which we have pinned).
+ *
+ * Any datanew buffer at writepage time is filling a hole, so we don't need
+ * extra tests against the inode size.
+ */
+static int journal_dirty_data_guarded_writepage_fn(handle_t *handle,
+					 struct buffer_head *bh)
+{
+	int ret = 0;
+
+	/*
+	 * Write could have mapped the buffer but it didn't copy the data in
+	 * yet. So avoid filing such buffer into a transaction.
+	 */
+	if (!buffer_mapped(bh) || !buffer_uptodate(bh))
+		return 0;
+
+	if (test_clear_buffer_datanew(bh))
+		ret = ext3_journal_dirty_data(handle, bh);
+	return ret;
+}
+
 /* For write_end() in data=journal mode */
 static int write_end_fn(handle_t *handle, struct buffer_head *bh)
 {
@@ -1251,10 +1541,8 @@ static void update_file_sizes(struct inode *inode, loff_t pos, unsigned copied)
 	/* What matters to us is i_disksize. We don't write i_size anywhere */
 	if (pos + copied > inode->i_size)
 		i_size_write(inode, pos + copied);
-	if (pos + copied > EXT3_I(inode)->i_disksize) {
-		EXT3_I(inode)->i_disksize = pos + copied;
+	if (maybe_update_disk_isize(inode, pos + copied))
 		mark_inode_dirty(inode);
-	}
 }
 
 /*
@@ -1300,6 +1588,51 @@ static int ext3_ordered_write_end(struct file *file,
 	return ret ? ret : copied;
 }
 
+static int ext3_guarded_write_end(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata)
+{
+	handle_t *handle = ext3_journal_current_handle();
+	struct inode *inode = file->f_mapping->host;
+	unsigned from, to;
+	int ret = 0, ret2;
+
+	copied = block_write_end(file, mapping, pos, len, copied,
+				 page, fsdata);
+
+	from = pos & (PAGE_CACHE_SIZE - 1);
+	to = from + copied;
+	ret = walk_page_buffers(handle, page_buffers(page),
+		from, to, NULL, journal_dirty_data_guarded_fn);
+
+	/*
+	 * we only update the in-memory i_size.  The disk i_size is done
+	 * by the end io handlers
+	 */
+	if (ret == 0 && pos + copied > inode->i_size)
+		i_size_write(inode, pos + copied);
+
+	from = pos & (PAGE_CACHE_SIZE - 1);
+	to = from + copied;
+
+	/*
+	 * There may be allocated blocks outside of i_size because
+	 * we failed to copy some data. Prepare for truncate.
+	 */
+	if (pos + len > inode->i_size)
+		ext3_orphan_add(handle, inode);
+	ret2 = ext3_journal_stop(handle);
+	if (!ret)
+		ret = ret2;
+	unlock_page(page);
+	page_cache_release(page);
+
+	if (pos + len > inode->i_size)
+		vmtruncate(inode, inode->i_size);
+	return ret ? ret : copied;
+}
+
 static int ext3_writeback_write_end(struct file *file,
 				struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned copied,
@@ -1311,6 +1644,7 @@ static int ext3_writeback_write_end(struct file *file,
 
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
 	update_file_sizes(inode, pos, copied);
+
 	/*
 	 * There may be allocated blocks outside of i_size because
 	 * we failed to copy some data. Prepare for truncate.
@@ -1574,6 +1908,143 @@ out_fail:
 	return ret;
 }
 
+/*
+ * Completion handler for block_write_full_page().  This will
+ * kick off the data=guarded workqueue as the IO finishes.
+ */
+static void end_buffer_async_write_guarded(struct buffer_head *bh,
+					   int uptodate)
+{
+	struct ext3_sb_info *sbi;
+	struct address_space *mapping;
+	struct ext3_ordered_extent *ordered;
+	unsigned long flags;
+
+	mapping = bh->b_page->mapping;
+	if (!mapping || !bh->b_private || !buffer_dataguarded(bh)) {
+noguard:
+		end_buffer_async_write(bh, uptodate);
+		return;
+	}
+
+	/*
+	 * the guarded workqueue function checks the uptodate bit on the
+	 * bh and uses that to tell the real end_io handler if things worked
+	 * out or not.
+	 */
+	if (uptodate)
+		set_buffer_uptodate(bh);
+	else
+		clear_buffer_uptodate(bh);
+
+	sbi = EXT3_SB(mapping->host->i_sb);
+
+	spin_lock_irqsave(&sbi->guarded_lock, flags);
+
+	/*
+	 * remove any chance that a truncate raced in and cleared
+	 * our dataguard flag, which also freed the ordered extent in
+	 * our b_private.
+	 */
+	if (!buffer_dataguarded(bh)) {
+		spin_unlock_irqrestore(&sbi->guarded_lock, flags);
+		goto noguard;
+	}
+	ordered = bh->b_private;
+	WARN_ON(ordered->end_io_bh);
+
+	/*
+	 * use the special end_io_bh pointer to make sure that
+	 * some form of end_io handler is run on this bh, even
+	 * if the ordered_extent is removed from the rb tree before
+	 * our workqueue ends up processing it.
+	 */
+	ordered->end_io_bh = bh;
+	list_add_tail(&ordered->list, &sbi->guarded_buffers);
+	ext3_get_ordered_extent(ordered);
+	spin_unlock_irqrestore(&sbi->guarded_lock, flags);
+
+	queue_work(sbi->guarded_wq, &sbi->guarded_work);
+}
+
+static int ext3_guarded_writepage(struct page *page,
+				struct writeback_control *wbc)
+{
+	struct inode *inode = page->mapping->host;
+	struct buffer_head *page_bufs;
+	handle_t *handle = NULL;
+	int ret = 0;
+	int err;
+
+	J_ASSERT(PageLocked(page));
+
+	/*
+	 * We give up here if we're reentered, because it might be for a
+	 * different filesystem.
+	 */
+	if (ext3_journal_current_handle())
+		goto out_fail;
+
+	if (!page_has_buffers(page)) {
+		create_empty_buffers(page, inode->i_sb->s_blocksize,
+				(1 << BH_Dirty)|(1 << BH_Uptodate));
+		page_bufs = page_buffers(page);
+	} else {
+		page_bufs = page_buffers(page);
+		if (!walk_page_buffers(NULL, page_bufs, 0, PAGE_CACHE_SIZE,
+				       NULL, buffer_unmapped)) {
+			/* Provide NULL get_block() to catch bugs if buffers
+			 * weren't really mapped */
+			 return block_write_full_page_endio(page, NULL, wbc,
+					  end_buffer_async_write_guarded);
+		}
+	}
+	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
+
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		goto out_fail;
+	}
+
+	walk_page_buffers(handle, page_bufs, 0,
+			PAGE_CACHE_SIZE, NULL, bget_one);
+
+	ret = block_write_full_page_endio(page, ext3_get_block, wbc,
+					  end_buffer_async_write_guarded);
+
+	/*
+	 * The page can become unlocked at any point now, and
+	 * truncate can then come in and change things.  So we
+	 * can't touch *page from now on.  But *page_bufs is
+	 * safe due to elevated refcount.
+	 */
+
+	/*
+	 * And attach them to the current transaction.  But only if
+	 * block_write_full_page() succeeded.  Otherwise they are unmapped,
+	 * and generally junk.
+	 */
+	if (ret == 0) {
+		err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
+				NULL, journal_dirty_data_guarded_writepage_fn);
+		if (!ret)
+			ret = err;
+	}
+	walk_page_buffers(handle, page_bufs, 0,
+			PAGE_CACHE_SIZE, NULL, bput_one);
+	err = ext3_journal_stop(handle);
+	if (!ret)
+		ret = err;
+	return ret;
+
+out_fail:
+	redirty_page_for_writepage(wbc, page);
+	unlock_page(page);
+	return ret;
+}
+
+
+
 static int ext3_writeback_writepage(struct page *page,
 				struct writeback_control *wbc)
 {
@@ -1768,8 +2239,10 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
 			ret = PTR_ERR(handle);
 			goto out;
 		}
+
 		if (inode->i_nlink)
-			ext3_orphan_del(handle, inode);
+			ordered_orphan_del(handle, inode, 0);
+
 		if (ret > 0) {
 			loff_t end = offset + ret;
 			if (end > inode->i_size) {
@@ -1842,6 +2315,21 @@ static const struct address_space_operations ext3_writeback_aops = {
 	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
+static const struct address_space_operations ext3_guarded_aops = {
+	.readpage		= ext3_readpage,
+	.readpages		= ext3_readpages,
+	.writepage		= ext3_guarded_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext3_write_begin,
+	.write_end		= ext3_guarded_write_end,
+	.bmap			= ext3_bmap,
+	.invalidatepage		= ext3_invalidatepage,
+	.releasepage		= ext3_releasepage,
+	.direct_IO		= ext3_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
+};
+
 static const struct address_space_operations ext3_journalled_aops = {
 	.readpage		= ext3_readpage,
 	.readpages		= ext3_readpages,
@@ -1860,6 +2348,8 @@ void ext3_set_aops(struct inode *inode)
 {
 	if (ext3_should_order_data(inode))
 		inode->i_mapping->a_ops = &ext3_ordered_aops;
+	else if (ext3_should_guard_data(inode))
+		inode->i_mapping->a_ops = &ext3_guarded_aops;
 	else if (ext3_should_writeback_data(inode))
 		inode->i_mapping->a_ops = &ext3_writeback_aops;
 	else
@@ -2376,7 +2866,8 @@ void ext3_truncate(struct inode *inode)
 	if (!ext3_can_truncate(inode))
 		return;
 
-	if (inode->i_size == 0 && ext3_should_writeback_data(inode))
+	if (inode->i_size == 0 && (ext3_should_writeback_data(inode) ||
+				   ext3_should_guard_data(inode)))
 		ei->i_state |= EXT3_STATE_FLUSH_ON_CLOSE;
 
 	/*
@@ -3103,10 +3594,39 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
 		ext3_journal_stop(handle);
 	}
 
+	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
+		/*
+		 * we need to make sure any data=guarded pages
+		 * are on disk before we force a new disk i_size
+		 * down into the inode.  The crucial range is
+		 * anything between the disksize on disk now
+		 * and the new size we're going to set.
+		 *
+		 * We're holding i_mutex here, so we know new
+		 * ordered extents are not going to appear in the inode
+		 *
+		 * This must be done both for truncates that make the
+		 * file bigger and smaller because truncate messes around
+		 * with the orphan inode list in both cases.
+		 */
+		if (ext3_should_guard_data(inode)) {
+			filemap_write_and_wait_range(inode->i_mapping,
+						 EXT3_I(inode)->i_disksize,
+						 (loff_t)-1);
+			/*
+			 * we've written everything, make sure all
+			 * the ordered extents are really gone.
+			 *
+			 * This prevents leaking of ordered extents
+			 * and it also makes sure the ordered extent code
+			 * doesn't mess with the orphan link
+			 */
+			ext3_truncate_ordered_extents(inode, 0);
+		}
+	}
 	if (S_ISREG(inode->i_mode) &&
 	    attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
 		handle_t *handle;
-
 		handle = ext3_journal_start(inode, 3);
 		if (IS_ERR(handle)) {
 			error = PTR_ERR(handle);
@@ -3114,6 +3634,7 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
 		}
 
 		error = ext3_orphan_add(handle, inode);
+
 		EXT3_I(inode)->i_disksize = attr->ia_size;
 		rc = ext3_mark_inode_dirty(handle, inode);
 		if (!error)
@@ -3125,8 +3646,11 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
 
 	/* If inode_setattr's call to ext3_truncate failed to get a
 	 * transaction handle at all, we need to clean up the in-core
-	 * orphan list manually. */
-	if (inode->i_nlink)
+	 * orphan list manually.  Because we've finished off all the
+	 * guarded IO above, this doesn't hurt anything for the guarded
+	 * code
+	 */
+	if (inode->i_nlink && (attr->ia_valid & ATTR_SIZE))
 		ext3_orphan_del(NULL, inode);
 
 	if (!rc && (ia_valid & ATTR_MODE))
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 6ff7b97..ac3991a 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -2410,7 +2410,8 @@ static int ext3_rename (struct inode * old_dir, struct dentry *old_dentry,
 		ext3_mark_inode_dirty(handle, new_inode);
 		if (!new_inode->i_nlink)
 			ext3_orphan_add(handle, new_inode);
-		if (ext3_should_writeback_data(new_inode))
+		if (ext3_should_writeback_data(new_inode) ||
+		    ext3_should_guard_data(new_inode))
 			flush_file = 1;
 	}
 	retval = 0;
diff --git a/fs/ext3/ordered-data.c b/fs/ext3/ordered-data.c
new file mode 100644
index 0000000..a772598
--- /dev/null
+++ b/fs/ext3/ordered-data.c
@@ -0,0 +1,318 @@
+/*
+ * Copyright (C) 2009 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/gfp.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/writeback.h>
+#include <linux/pagevec.h>
+#include <linux/buffer_head.h>
+#include <linux/ext3_jbd.h>
+
+
+/*
+ * These routines actually implement data=guarded mode, but the
+ * idea is that it may replace data=ordered mode as it becomes stable.
+ */
+
+static u64 entry_end(struct ext3_ordered_extent *entry, unsigned blocksize)
+{
+	return entry->start + blocksize;
+}
+
+/*
+ * returns NULL if the insertion worked, or it returns the node it did find
+ * in the tree
+ */
+static struct rb_node *tree_insert(struct rb_root *root, u64 start,
+				   unsigned blocksize, struct rb_node *node)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct ext3_ordered_extent *entry;
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct ext3_ordered_extent, rb_node);
+
+		if (start < entry->start)
+			p = &(*p)->rb_left;
+		else if (start >= entry_end(entry, blocksize))
+			p = &(*p)->rb_right;
+		else
+			return parent;
+	}
+
+	rb_link_node(node, parent, p);
+	rb_insert_color(node, root);
+	return NULL;
+}
+
+/*
+ * find the ordered struct that has this offset
+ */
+static inline struct rb_node *tree_search(struct ext3_ordered_inode_tree *tree,
+					  u64 start, unsigned blocksize)
+{
+	struct rb_node *n = tree->tree.rb_node;
+	struct ext3_ordered_extent *entry;
+
+	while (n) {
+		entry = rb_entry(n, struct ext3_ordered_extent, rb_node);
+
+		if (start < entry->start)
+			n = n->rb_left;
+		else if (start >= entry_end(entry, blocksize))
+			n = n->rb_right;
+		else
+			return n;
+	}
+	return NULL;
+}
+
+/* allocate and add a new ordered_extent into the per-inode tree.
+ * start is the logical offset in the file
+ *
+ * The tree is given a single reference on the ordered extent that was
+ * inserted, and it also takes a reference on the buffer head.
+ */
+int ext3_add_ordered_extent(struct inode *inode, u64 start,
+			    struct buffer_head *bh)
+{
+	struct ext3_ordered_inode_tree *tree;
+	struct rb_node *node;
+	struct ext3_ordered_extent *entry;
+	int ret = 0;
+
+	lock_buffer(bh);
+
+	/* ordered extent already there, or in old style data=ordered */
+	if (bh->b_private) {
+		ret = 0;
+		goto out;
+	}
+
+	tree = &EXT3_I(inode)->ordered_tree;
+	entry = kzalloc(sizeof(*entry), GFP_NOFS);
+	if (!entry) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	spin_lock(&tree->lock);
+	entry->start = start;
+
+	get_bh(bh);
+	entry->bh = bh;
+	bh->b_private = entry;
+	set_buffer_dataguarded(bh);
+
+	/* one ref for the tree */
+	atomic_set(&entry->refs, 1);
+	INIT_LIST_HEAD(&entry->list);
+	node = tree_insert(&tree->tree, start,
+			   inode->i_sb->s_blocksize, &entry->rb_node);
+	BUG_ON(node);
+
+	spin_unlock(&tree->lock);
+out:
+	unlock_buffer(bh);
+	return ret;
+}
+
+/*
+ * used to drop a reference on an ordered extent.  This will free
+ * the extent if the last reference is dropped
+ */
+int ext3_put_ordered_extent(struct ext3_ordered_extent *entry)
+{
+	if (atomic_dec_and_test(&entry->refs)) {
+		WARN_ON(entry->bh);
+		WARN_ON(entry->end_io_bh);
+		kfree(entry);
+	}
+	return 0;
+}
+
+/*
+ * remove an ordered extent from the tree.  This removes the
+ * reference held by the rbtree on 'entry' and the
+ * reference on the buffer head held by the entry.
+ */
+int ext3_remove_ordered_extent(struct inode *inode,
+				struct ext3_ordered_extent *entry)
+{
+	struct ext3_ordered_inode_tree *tree;
+	struct rb_node *node;
+
+	tree = &EXT3_I(inode)->ordered_tree;
+	node = &entry->rb_node;
+
+	/*
+	 * the data=guarded end_io handler takes this guarded_lock
+	 * before it puts a given buffer head and its ordered extent
+	 * into the guarded_buffers list.  We need to make sure
+	 * we don't race with them, so we take the guarded_lock too.
+	 */
+	spin_lock_irq(&EXT3_SB(inode->i_sb)->guarded_lock);
+	clear_buffer_dataguarded(entry->bh);
+	entry->bh->b_private = NULL;
+	brelse(entry->bh);
+	entry->bh = NULL;
+	spin_unlock_irq(&EXT3_SB(inode->i_sb)->guarded_lock);
+
+	/*
+	 * we must not clear entry->end_io_bh, that is set by
+	 * the end_io handlers and will be cleared by the end_io
+	 * workqueue
+	 */
+
+	rb_erase(node, &tree->tree);
+	RB_CLEAR_NODE(node);
+	ext3_put_ordered_extent(entry);
+	return 0;
+}
+
+/*
+ * After an extent is done, call this to conditionally update the on disk
+ * i_size.  i_size is updated to cover any fully written part of the file.
+ *
+ * This returns < 0 on error, zero if no action needs to be taken and
+ * 1 if the inode must be logged.
+ */
+int ext3_ordered_update_i_size(struct inode *inode,
+			       struct ext3_ordered_extent *entry)
+{
+	u64 new_i_size;
+	u64 i_size_test;
+	u64 disk_i_size;
+	struct rb_node *node;
+	struct ext3_ordered_extent *test;
+	unsigned blocksize = inode->i_sb->s_blocksize;
+	int ret = 0;
+
+	disk_i_size = EXT3_I(inode)->i_disksize;
+
+	/*
+	 * if the disk i_size is already at the inode->i_size, or
+	 * this ordered extent is inside the disk i_size, we're done
+	 */
+	if (disk_i_size >= inode->i_size ||
+	    entry_end(entry, blocksize) <= disk_i_size)
+		goto out;
+
+	/*
+	 * walk backward from this ordered extent to disk_i_size.
+	 * if we find an ordered extent then we can't update disk i_size
+	 * yet
+	 */
+	node = &entry->rb_node;
+	while (1) {
+		node = rb_prev(node);
+		if (!node)
+			break;
+		test = rb_entry(node, struct ext3_ordered_extent, rb_node);
+		if (entry_end(test, blocksize) <= disk_i_size)
+			break;
+		if (test->start >= inode->i_size)
+			break;
+		if (test->start >= disk_i_size)
+			goto out;
+	}
+
+	/* from here, the caller will have to log something, set ret to 1 */
+	ret = 1;
+
+	/*
+	 * at this point, we know we can safely update i_size to at least
+	 * the offset from this ordered extent.  But, we need to
+	 * walk forward and see if ios from higher up in the file have
+	 * finished.
+	 */
+	node = rb_next(&entry->rb_node);
+	i_size_test = 0;
+	if (node) {
+		/*
+		 * do we have an area where IO might have finished
+		 * between our ordered extent and the next one.
+		 */
+		test = rb_entry(node, struct ext3_ordered_extent, rb_node);
+		if (test->start >= entry_end(entry, blocksize))
+			i_size_test = test->start;
+	} else {
+		i_size_test = i_size_read(inode);
+	}
+
+	new_i_size = min_t(u64, i_size_test, i_size_read(inode));
+
+	EXT3_I(inode)->i_disksize = new_i_size;
+out:
+	return ret;
+}
+
+/*
+ * during a truncate or delete, we need to get rid of pending
+ * ordered extents so there isn't a war over who updates disk i_size first.
+ * This does that, without waiting for any of the IO to actually finish.
+ *
+ * When the IO does finish, it will find the ordered extent removed from the
+ * tree and all will work properly.
+ */
+void ext3_truncate_ordered_extents(struct inode *inode, u64 offset)
+{
+	struct ext3_ordered_inode_tree *tree = &EXT3_I(inode)->ordered_tree;
+	struct rb_node *node;
+	struct ext3_ordered_extent *test;
+
+	spin_lock(&tree->lock);
+	node = rb_last(&tree->tree);
+	while (node) {
+		/* truncate is going to end up waiting for pages that
+		 * straddle the new i_size.  So, we can drop the
+		 * ordered record for anything that starts before the new
+		 * i_size.
+		 */
+		test = rb_entry(node, struct ext3_ordered_extent, rb_node);
+		if (test->start < offset)
+			break;
+		node = rb_prev(&test->rb_node);
+
+		/*
+		 * once this is called, the end_io handler won't run,
+		 * and we won't update disk_i_size to include this buffer.
+		 *
+		 * That's ok for truncates because the truncate code is
+		 * writing a new i_size.
+		 *
+		 * This ignores any IO in flight, which is ok
+		 * because the guarded_buffers list has a reference
+		 * on the ordered extent
+		 */
+		ext3_remove_ordered_extent(inode, test);
+	}
+	spin_unlock(&tree->lock);
+	return;
+
+}
+
+void ext3_ordered_inode_init(struct ext3_inode_info *ei)
+{
+	ei->ordered_tree.tree.rb_node = NULL;
+	spin_lock_init(&ei->ordered_tree.lock);
+}
+
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 599dbfe..6a94647 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -37,6 +37,7 @@
 #include <linux/quotaops.h>
 #include <linux/seq_file.h>
 #include <linux/log2.h>
+#include <linux/workqueue.h>
 
 #include <asm/uaccess.h>
 
@@ -399,6 +400,9 @@ static void ext3_put_super (struct super_block * sb)
 	struct ext3_super_block *es = sbi->s_es;
 	int i, err;
 
+	flush_workqueue(sbi->guarded_wq);
+	destroy_workqueue(sbi->guarded_wq);
+
 	ext3_xattr_put_super(sb);
 	err = journal_destroy(sbi->s_journal);
 	sbi->s_journal = NULL;
@@ -468,6 +472,8 @@ static struct inode *ext3_alloc_inode(struct super_block *sb)
 #endif
 	ei->i_block_alloc_info = NULL;
 	ei->vfs_inode.i_version = 1;
+	ext3_ordered_inode_init(ei);
+
 	return &ei->vfs_inode;
 }
 
@@ -481,6 +487,8 @@ static void ext3_destroy_inode(struct inode *inode)
 				false);
 		dump_stack();
 	}
+	if (EXT3_I(inode)->ordered_tree.tree.rb_node)
+		printk(KERN_INFO "EXT3 ordered tree not empty\n");
 	kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
 }
 
@@ -528,6 +536,13 @@ static void ext3_clear_inode(struct inode *inode)
 		EXT3_I(inode)->i_default_acl = EXT3_ACL_NOT_CACHED;
 	}
 #endif
+	/*
+	 * If pages got cleaned by truncate, truncate should have
+	 * gotten rid of the ordered extents.  Just in case, drop them
+	 * here.
+	 */
+	ext3_truncate_ordered_extents(inode, 0);
+
 	ext3_discard_reservation(inode);
 	EXT3_I(inode)->i_block_alloc_info = NULL;
 	if (unlikely(rsv))
@@ -634,6 +649,8 @@ static int ext3_show_options(struct seq_file *seq, struct vfsmount *vfs)
 		seq_puts(seq, ",data=journal");
 	else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA)
 		seq_puts(seq, ",data=ordered");
+	else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_GUARDED_DATA)
+		seq_puts(seq, ",data=guarded");
 	else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)
 		seq_puts(seq, ",data=writeback");
 
@@ -790,7 +807,7 @@ enum {
 	Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
 	Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
 	Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
-	Opt_data_err_abort, Opt_data_err_ignore,
+	Opt_data_guarded, Opt_data_err_abort, Opt_data_err_ignore,
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
 	Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
@@ -832,6 +849,7 @@ static const match_table_t tokens = {
 	{Opt_abort, "abort"},
 	{Opt_data_journal, "data=journal"},
 	{Opt_data_ordered, "data=ordered"},
+	{Opt_data_guarded, "data=guarded"},
 	{Opt_data_writeback, "data=writeback"},
 	{Opt_data_err_abort, "data_err=abort"},
 	{Opt_data_err_ignore, "data_err=ignore"},
@@ -1034,6 +1052,9 @@ static int parse_options (char *options, struct super_block *sb,
 		case Opt_data_ordered:
 			data_opt = EXT3_MOUNT_ORDERED_DATA;
 			goto datacheck;
+		case Opt_data_guarded:
+			data_opt = EXT3_MOUNT_GUARDED_DATA;
+			goto datacheck;
 		case Opt_data_writeback:
 			data_opt = EXT3_MOUNT_WRITEBACK_DATA;
 		datacheck:
@@ -1949,11 +1970,23 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 			clear_opt(sbi->s_mount_opt, NOBH);
 		}
 	}
+
+	/*
+	 * setup the guarded work list
+	 */
+	INIT_LIST_HEAD(&EXT3_SB(sb)->guarded_buffers);
+	INIT_WORK(&EXT3_SB(sb)->guarded_work, ext3_run_guarded_work);
+	spin_lock_init(&EXT3_SB(sb)->guarded_lock);
+	EXT3_SB(sb)->guarded_wq = create_workqueue("ext3-guard");
+	if (!EXT3_SB(sb)->guarded_wq) {
+		printk(KERN_ERR "EXT3-fs: failed to create workqueue\n");
+		goto failed_mount_guard;
+	}
+
 	/*
 	 * The journal_load will have done any necessary log recovery,
 	 * so we can safely mount the rest of the filesystem now.
 	 */
-
 	root = ext3_iget(sb, EXT3_ROOT_INO);
 	if (IS_ERR(root)) {
 		printk(KERN_ERR "EXT3-fs: get root inode failed\n");
@@ -1965,6 +1998,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 		printk(KERN_ERR "EXT3-fs: corrupt root inode, run e2fsck\n");
 		goto failed_mount4;
 	}
+
 	sb->s_root = d_alloc_root(root);
 	if (!sb->s_root) {
 		printk(KERN_ERR "EXT3-fs: get root dentry failed\n");
@@ -1974,6 +2008,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 	}
 
 	ext3_setup_super (sb, es, sb->s_flags & MS_RDONLY);
+
 	/*
 	 * akpm: core read_super() calls in here with the superblock locked.
 	 * That deadlocks, because orphan cleanup needs to lock the superblock
@@ -1989,9 +2024,10 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 		printk (KERN_INFO "EXT3-fs: recovery complete.\n");
 	ext3_mark_recovery_complete(sb, es);
 	printk (KERN_INFO "EXT3-fs: mounted filesystem with %s data mode.\n",
-		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
-		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
-		"writeback");
+	      test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal" :
+	      test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_GUARDED_DATA ? "guarded" :
+	      test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered" :
+	      "writeback");
 
 	lock_kernel();
 	return 0;
@@ -2003,6 +2039,8 @@ cantfind_ext3:
 	goto failed_mount;
 
 failed_mount4:
+	destroy_workqueue(EXT3_SB(sb)->guarded_wq);
+failed_mount_guard:
 	journal_destroy(sbi->s_journal);
 failed_mount3:
 	percpu_counter_destroy(&sbi->s_freeblocks_counter);
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index ed886e6..1354a55 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -2018,6 +2018,7 @@ zap_buffer_unlocked:
 	clear_buffer_mapped(bh);
 	clear_buffer_req(bh);
 	clear_buffer_new(bh);
+	clear_buffer_datanew(bh);
 	bh->b_bdev = NULL;
 	return may_free;
 }
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 634a5e5..09faa88 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -18,6 +18,7 @@
 
 #include <linux/types.h>
 #include <linux/magic.h>
+#include <linux/workqueue.h>
 
 /*
  * The second extended filesystem constants/structures
@@ -398,7 +399,6 @@ struct ext3_inode {
 #define EXT3_MOUNT_MINIX_DF		0x00080	/* Mimics the Minix statfs */
 #define EXT3_MOUNT_NOLOAD		0x00100	/* Don't use existing journal*/
 #define EXT3_MOUNT_ABORT		0x00200	/* Fatal error detected */
-#define EXT3_MOUNT_DATA_FLAGS		0x00C00	/* Mode for data writes: */
 #define EXT3_MOUNT_JOURNAL_DATA		0x00400	/* Write data to journal */
 #define EXT3_MOUNT_ORDERED_DATA		0x00800	/* Flush data before commit */
 #define EXT3_MOUNT_WRITEBACK_DATA	0x00C00	/* No data ordering */
@@ -414,6 +414,12 @@ struct ext3_inode {
 #define EXT3_MOUNT_GRPQUOTA		0x200000 /* "old" group quota */
 #define EXT3_MOUNT_DATA_ERR_ABORT	0x400000 /* Abort on file data write
 						  * error in ordered mode */
+#define EXT3_MOUNT_GUARDED_DATA		0x800000 /* guard new writes with
+						    i_size */
+#define EXT3_MOUNT_DATA_FLAGS		(EXT3_MOUNT_JOURNAL_DATA | \
+					 EXT3_MOUNT_ORDERED_DATA | \
+					 EXT3_MOUNT_WRITEBACK_DATA | \
+					 EXT3_MOUNT_GUARDED_DATA)
 
 /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
 #ifndef _LINUX_EXT2_FS_H
@@ -892,6 +898,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
 extern void ext3_set_aops(struct inode *inode);
 extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		       u64 start, u64 len);
+void ext3_run_guarded_work(struct work_struct *work);
 
 /* ioctl.c */
 extern long ext3_ioctl(struct file *, unsigned int, unsigned long);
@@ -945,7 +952,31 @@ extern const struct inode_operations ext3_special_inode_operations;
 extern const struct inode_operations ext3_symlink_inode_operations;
 extern const struct inode_operations ext3_fast_symlink_inode_operations;
 
+/* ordered-data.c */
+int ext3_add_ordered_extent(struct inode *inode, u64 file_offset,
+			    struct buffer_head *bh);
+int ext3_put_ordered_extent(struct ext3_ordered_extent *entry);
+int ext3_remove_ordered_extent(struct inode *inode,
+				struct ext3_ordered_extent *entry);
+int ext3_ordered_update_i_size(struct inode *inode,
+			       struct ext3_ordered_extent *entry);
+void ext3_ordered_inode_init(struct ext3_inode_info *ei);
+void ext3_truncate_ordered_extents(struct inode *inode, u64 offset);
+
+static inline void ext3_ordered_lock(struct inode *inode)
+{
+	spin_lock(&EXT3_I(inode)->ordered_tree.lock);
+}
 
+static inline void ext3_ordered_unlock(struct inode *inode)
+{
+	spin_unlock(&EXT3_I(inode)->ordered_tree.lock);
+}
+
+static inline void ext3_get_ordered_extent(struct ext3_ordered_extent *entry)
+{
+	atomic_inc(&entry->refs);
+}
 #endif	/* __KERNEL__ */
 
 #endif	/* _LINUX_EXT3_FS_H */
diff --git a/include/linux/ext3_fs_i.h b/include/linux/ext3_fs_i.h
index 7894dd0..399e3d8 100644
--- a/include/linux/ext3_fs_i.h
+++ b/include/linux/ext3_fs_i.h
@@ -20,6 +20,7 @@
 #include <linux/rbtree.h>
 #include <linux/seqlock.h>
 #include <linux/mutex.h>
+#include <linux/rbtree.h>
 
 /* data type for block offset of block group */
 typedef int ext3_grpblk_t;
@@ -65,6 +66,47 @@ struct ext3_block_alloc_info {
 #define rsv_end rsv_window._rsv_end
 
 /*
+ * used to prevent garbage in files after a crash by
+ * making sure i_size isn't updated until after the IO
+ * is done.
+ *
+ * See fs/ext3/ordered-data.c for the code that uses these.
+ */
+struct buffer_head;
+struct ext3_ordered_inode_tree {
+	/* protects the tree and disk i_size */
+	spinlock_t lock;
+
+	/* rb tree of IO that needs to happen before i_size can be updated */
+	struct rb_root tree;
+};
+
+struct ext3_ordered_extent {
+	/* logical offset of the block in the file */
+	u64 start;
+
+	/* buffer head being written */
+	struct buffer_head *bh;
+
+	/*
+	 * set at end_io time so we properly
+	 * do IO accounting even when this ordered
+	 * extent struct has been removed from the
+	 * rbtree
+	 */
+	struct buffer_head *end_io_bh;
+
+	/* number of refs on this ordered extent */
+	atomic_t refs;
+
+	/* for the rb tree */
+	struct rb_node rb_node;
+
+	/* list of things being processed by the workqueue */
+	struct list_head list;
+};
+
+/*
  * third extended file system inode data in memory
  */
 struct ext3_inode_info {
@@ -141,6 +183,8 @@ struct ext3_inode_info {
 	 * by other means, so we have truncate_mutex.
 	 */
 	struct mutex truncate_mutex;
+
+	struct ext3_ordered_inode_tree ordered_tree;
 	struct inode vfs_inode;
 };
 
diff --git a/include/linux/ext3_fs_sb.h b/include/linux/ext3_fs_sb.h
index f07f34d..5dbdbeb 100644
--- a/include/linux/ext3_fs_sb.h
+++ b/include/linux/ext3_fs_sb.h
@@ -21,6 +21,7 @@
 #include <linux/wait.h>
 #include <linux/blockgroup_lock.h>
 #include <linux/percpu_counter.h>
+#include <linux/workqueue.h>
 #endif
 #include <linux/rbtree.h>
 
@@ -82,6 +83,11 @@ struct ext3_sb_info {
 	char *s_qf_names[MAXQUOTAS];		/* Names of quota files with journalled quota */
 	int s_jquota_fmt;			/* Format of quota to use */
 #endif
+
+	struct workqueue_struct *guarded_wq;
+	struct work_struct guarded_work;
+	struct list_head guarded_buffers;
+	spinlock_t guarded_lock;
 };
 
 static inline spinlock_t *
diff --git a/include/linux/ext3_jbd.h b/include/linux/ext3_jbd.h
index cf82d51..45cb4aa 100644
--- a/include/linux/ext3_jbd.h
+++ b/include/linux/ext3_jbd.h
@@ -212,6 +212,17 @@ static inline int ext3_should_order_data(struct inode *inode)
 	return 0;
 }
 
+static inline int ext3_should_guard_data(struct inode *inode)
+{
+	if (!S_ISREG(inode->i_mode))
+		return 0;
+	if (EXT3_I(inode)->i_flags & EXT3_JOURNAL_DATA_FL)
+		return 0;
+	if (test_opt(inode->i_sb, GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA)
+		return 1;
+	return 0;
+}
+
 static inline int ext3_should_writeback_data(struct inode *inode)
 {
 	if (!S_ISREG(inode->i_mode))
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index 53ae439..6b38b50 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -291,6 +291,13 @@ enum jbd_state_bits {
 	BH_State,		/* Pins most journal_head state */
 	BH_JournalHead,		/* Pins bh->b_private and jh->b_bh */
 	BH_Unshadow,		/* Dummy bit, for BJ_Shadow wakeup filtering */
+	BH_DataGuarded,		/* ext3 data=guarded mode buffer
+				 * these have something other than a
+				 * journal_head at b_private */
+	BH_DataNew,		/* BH_new gets cleared too early for
+				 * data=guarded to use it.  So,
+				 * this gets set instead.
+				 */
 };
 
 BUFFER_FNS(JBD, jbd)
@@ -302,6 +309,9 @@ TAS_BUFFER_FNS(Revoked, revoked)
 BUFFER_FNS(RevokeValid, revokevalid)
 TAS_BUFFER_FNS(RevokeValid, revokevalid)
 BUFFER_FNS(Freed, freed)
+BUFFER_FNS(DataGuarded, dataguarded)
+BUFFER_FNS(DataNew, datanew)
+TAS_BUFFER_FNS(DataNew, datanew)
 
 static inline struct buffer_head *jh2bh(struct journal_head *jh)
 {
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH] Add ext3 data=guarded mode
  2009-04-15 17:22   ` [PATCH 3/3] Add ext3 data=guarded mode Chris Mason
@ 2009-04-16 19:42     ` Chris Mason
  2009-04-17 11:04       ` Mike Galbraith
                         ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Chris Mason @ 2009-04-16 19:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linus Torvalds, Theodore Ts'o, Linux Kernel Developers List,
	Ext4 Developers List

Hello everyone,

Here's an updated (v4) patch for ext3 data=guarded mode.  The first two
patches in the series are unchanged, and it looks like Linus pulled them
in this morning.

This version fixes the bug Mike Galbraith hit.  The problem was with
writes to the last block in the file that made the file bigger but
didn't actually add a new block.  Since the block wasn't new, it
wouldn't get sent through the guarded code and the on disk i_size wasn't
always updated.

---

ext3 data=ordered mode makes sure that data blocks are on disk before
the metadata that references them, which avoids files full of garbage
or previously deleted data after a crash.  It does this by adding every dirty
buffer onto a list of things that must be written before a commit.

This makes every fsync write out all the dirty data on the entire FS, which
has high latencies and is generally much more expensive than it needs to be.

Another way to avoid exposing stale data after a crash is to wait until
after the data buffers are written before updating the on-disk record
of the file's size.  If we crash before the data IO is done, i_size
doesn't yet include the new blocks and no stale data is exposed.

This patch adds the delayed i_size update to ext3, along with a new
mount option (data=guarded) to enable it.  The basic mechanism works like
this:

* Change block_write_full_page to take an end_io handler as a parameter.
This allows us to make an end_io handler that queues buffer heads for
a workqueue where the real work of updating the on disk i_size is done.

* Add an rbtree to the in-memory ext3 inode for tracking data=guarded
buffer heads that are waiting to be sent to disk.

* Add an ext3 guarded write_end call to add buffer heads for newly
allocated blocks into the rbtree.  If we have a newly allocated block that is
filling a hole inside i_size, this is done as an old style data=ordered write
instead.

* Add an ext3 guarded writepage call that uses a special buffer head
end_io handler for buffers that are marked as guarded.  Again, if we find
newly allocated blocks filling holes, they are sent through data=ordered
instead of data=guarded.

* When a guarded IO finishes, kick a per-FS workqueue to do the
on disk i_size updates.  The workqueue function must be very careful.  We
only update the on disk i_size if all of the IO between the old on
disk i_size and the new on disk i_size is complete.  This is why an
rbtree is used to track the pending buffers, that way we can verify all
of the IO is actually done.  The on disk i_size is incrementally updated to
the largest safe value every time an IO completes.

* When we start tracking guarded buffers on a given inode, we put the
inode into ext3's orphan list.  This way if we do crash, the file will
be truncated back down to the on disk i_size and we'll free any blocks that
were not completely written.  The inode is removed from the orphan list
only after all the guarded buffers are done.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

---
 fs/ext3/Makefile           |    3 +-
 fs/ext3/fsync.c            |   12 +
 fs/ext3/inode.c            |  564 +++++++++++++++++++++++++++++++++++++++++++-
 fs/ext3/namei.c            |    3 +-
 fs/ext3/ordered-data.c     |  342 +++++++++++++++++++++++++++
 fs/ext3/super.c            |   48 ++++-
 fs/jbd/transaction.c       |    1 +
 include/linux/ext3_fs.h    |   33 +++-
 include/linux/ext3_fs_i.h  |   44 ++++
 include/linux/ext3_fs_sb.h |    6 +
 include/linux/ext3_jbd.h   |   11 +
 include/linux/jbd.h        |   10 +
 12 files changed, 1058 insertions(+), 19 deletions(-)

diff --git a/fs/ext3/Makefile b/fs/ext3/Makefile
index e77766a..f3a9dc1 100644
--- a/fs/ext3/Makefile
+++ b/fs/ext3/Makefile
@@ -5,7 +5,8 @@
 obj-$(CONFIG_EXT3_FS) += ext3.o
 
 ext3-y	:= balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
-	   ioctl.o namei.o super.o symlink.o hash.o resize.o ext3_jbd.o
+	   ioctl.o namei.o super.o symlink.o hash.o resize.o ext3_jbd.o \
+	   ordered-data.o
 
 ext3-$(CONFIG_EXT3_FS_XATTR)	 += xattr.o xattr_user.o xattr_trusted.o
 ext3-$(CONFIG_EXT3_FS_POSIX_ACL) += acl.o
diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c
index d336341..a50abb4 100644
--- a/fs/ext3/fsync.c
+++ b/fs/ext3/fsync.c
@@ -59,6 +59,11 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
 	 *  sync_inode() will write the inode if it is dirty.  Then the caller's
 	 *  filemap_fdatawait() will wait on the pages.
 	 *
+	 * data=guarded:
+	 * The caller's filemap_fdatawrite will start the IO, and we
+	 * use filemap_fdatawait here to make sure all the disk i_size updates
+	 * are done before we commit the inode.
+	 *
 	 * data=journal:
 	 *  filemap_fdatawrite won't do anything (the buffers are clean).
 	 *  ext3_force_commit will write the file data into the journal and
@@ -84,6 +89,13 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
 			.sync_mode = WB_SYNC_ALL,
 			.nr_to_write = 0, /* sys_fsync did this */
 		};
+		/*
+		 * the new disk i_size must be logged before we commit,
+		 * so we wait here for pending writeback
+		 */
+		if (ext3_should_guard_data(inode))
+			filemap_write_and_wait(inode->i_mapping);
+
 		ret = sync_inode(inode, &wbc);
 	}
 out:
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index fcfa243..94964f7 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -38,6 +38,7 @@
 #include <linux/bio.h>
 #include <linux/fiemap.h>
 #include <linux/namei.h>
+#include <linux/workqueue.h>
 #include "xattr.h"
 #include "acl.h"
 
@@ -179,6 +180,105 @@ static int ext3_journal_test_restart(handle_t *handle, struct inode *inode)
 }
 
 /*
+ * after a data=guarded IO is done, we need to update the
+ * disk i_size to reflect the data we've written.  If there are
+ * no more ordered data extents left in the tree, we need to
+ * get rid of the orphan entry making sure the file's
+ * block pointers match the i_size after a crash
+ *
+ * When we aren't in data=guarded mode, this just does an ext3_orphan_del.
+ *
+ * It returns the result of ext3_orphan_del.
+ *
+ * handle may be null if we are just cleaning up the orphan list in
+ * memory.
+ *
+ * pass must_log == 1 when the inode must be logged in order to get
+ * an i_size update on disk
+ */
+static int ordered_orphan_del(handle_t *handle, struct inode *inode,
+			      int must_log)
+{
+	int ret = 0;
+
+	/* fast out when data=guarded isn't on */
+	if (!ext3_should_guard_data(inode))
+		return ext3_orphan_del(handle, inode);
+
+	ext3_ordered_lock(inode);
+	if (inode->i_nlink &&
+	    RB_EMPTY_ROOT(&EXT3_I(inode)->ordered_tree.tree)) {
+		ext3_ordered_unlock(inode);
+
+		/*
+		 * if we aren't actually on the orphan list, the orphan
+		 * del won't log our inode.  Log it now to make sure
+		 */
+		ext3_mark_inode_dirty(handle, inode);
+
+		ret = ext3_orphan_del(handle, inode);
+		if (ret || !handle)
+			goto err;
+
+		/*
+		 * now we check again to see if we might have dropped
+		 * the orphan just after someone added a new ordered extent
+		 */
+		ext3_ordered_lock(inode);
+		if (!RB_EMPTY_ROOT(&EXT3_I(inode)->ordered_tree.tree) &&
+		    list_empty(&EXT3_I(inode)->i_orphan)) {
+			ext3_ordered_unlock(inode);
+			ret = ext3_orphan_add(handle, inode);
+			if (ret)
+				goto err;
+		} else {
+			ext3_ordered_unlock(inode);
+		}
+	} else if (handle && must_log) {
+		ext3_ordered_unlock(inode);
+
+		/*
+		 * we need to make sure any updates done by the data=guarded
+		 * code end up in the inode on disk.  Log the inode
+		 * here
+		 */
+		ext3_mark_inode_dirty(handle, inode);
+	} else {
+		ext3_ordered_unlock(inode);
+	}
+
+err:
+	return ret;
+}
+
+/*
+ * Wrapper around ordered_orphan_del that starts a transaction
+ */
+static void ordered_orphan_del_trans(struct inode *inode, int must_log)
+{
+	handle_t *handle;
+
+	handle = ext3_journal_start(inode, 3);
+
+	/*
+	 * uhoh, should we flag the FS as readonly here? ext3_dirty_inode
+	 * doesn't, which is what we're modeling ourselves after.
+	 *
+	 * We do need to make sure to get this inode off the ordered list
+	 * when the transaction start fails though.  ordered_orphan_del
+	 * does the right thing.
+	 */
+	if (IS_ERR(handle)) {
+		ordered_orphan_del(NULL, inode, 0);
+		return;
+	}
+
+	ordered_orphan_del(handle, inode, must_log);
+	ext3_journal_stop(handle);
+}
+
+
+/*
  * Called at the last iput() if i_nlink is zero.
  */
 void ext3_delete_inode (struct inode * inode)
@@ -204,6 +304,13 @@ void ext3_delete_inode (struct inode * inode)
 	if (IS_SYNC(inode))
 		handle->h_sync = 1;
 	inode->i_size = 0;
+
+	/*
+	 * make sure we clean up any ordered extents that didn't get
+	 * IO started on them because i_size shrunk down to zero.
+	 */
+	ext3_truncate_ordered_extents(inode, 0);
+
 	if (inode->i_blocks)
 		ext3_truncate(inode);
 	/*
@@ -767,6 +874,24 @@ err_out:
 }
 
 /*
+ * This protects the disk i_size with the  spinlock for the ordered
+ * extent tree.  It returns 1 when the inode needs to be logged
+ * because the i_disksize has been updated.
+ */
+static int maybe_update_disk_isize(struct inode *inode, loff_t new_size)
+{
+	int ret = 0;
+
+	ext3_ordered_lock(inode);
+	if (EXT3_I(inode)->i_disksize < new_size) {
+		EXT3_I(inode)->i_disksize = new_size;
+		ret = 1;
+	}
+	ext3_ordered_unlock(inode);
+	return ret;
+}
+
+/*
  * Allocation strategy is simple: if we have to allocate something, we will
  * have to go the whole way to leaf. So let's do it before attaching anything
  * to tree, set linkage between the newborn blocks, write them if sync is
@@ -815,6 +940,7 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
 	if (!partial) {
 		first_block = le32_to_cpu(chain[depth - 1].key);
 		clear_buffer_new(bh_result);
+		clear_buffer_datanew(bh_result);
 		count++;
 		/*map more blocks*/
 		while (count < maxblocks && count <= blocks_to_boundary) {
@@ -873,6 +999,7 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
 			if (err)
 				goto cleanup;
 			clear_buffer_new(bh_result);
+			clear_buffer_datanew(bh_result);
 			goto got_it;
 		}
 	}
@@ -915,14 +1042,19 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
 	 * i_disksize growing is protected by truncate_mutex.  Don't forget to
 	 * protect it if you're about to implement concurrent
 	 * ext3_get_block() -bzzz
+	 *
+	 * FIXME, I think this only needs to extend the disk i_size when
+	 * we're filling holes that came from using ftruncate to increase
+	 * i_size.  Need to verify.
 	*/
-	if (!err && extend_disksize && inode->i_size > ei->i_disksize)
-		ei->i_disksize = inode->i_size;
+	if (!ext3_should_guard_data(inode) && !err && extend_disksize)
+		maybe_update_disk_isize(inode, inode->i_size);
 	mutex_unlock(&ei->truncate_mutex);
 	if (err)
 		goto cleanup;
 
 	set_buffer_new(bh_result);
+	set_buffer_datanew(bh_result);
 got_it:
 	map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
 	if (count > blocks_to_boundary)
@@ -1079,6 +1211,77 @@ struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
 	return NULL;
 }
 
+/*
+ * data=guarded updates are handled in a workqueue after the IO
+ * is done.  This runs through the list of buffer heads that are pending
+ * processing.
+ */
+void ext3_run_guarded_work(struct work_struct *work)
+{
+	struct ext3_sb_info *sbi =
+		container_of(work, struct ext3_sb_info, guarded_work);
+	struct buffer_head *bh;
+	struct ext3_ordered_extent *ordered;
+	struct inode *inode;
+	struct page *page;
+	int must_log;
+
+	spin_lock_irq(&sbi->guarded_lock);
+	while (!list_empty(&sbi->guarded_buffers)) {
+		ordered = list_entry(sbi->guarded_buffers.next,
+				     struct ext3_ordered_extent, list);
+
+		list_del(&ordered->list);
+
+		bh = ordered->end_io_bh;
+		ordered->end_io_bh = NULL;
+		must_log = 0;
+
+		/* we don't need a reference on the buffer head because
+		 * it is locked until the end_io handler his called.
+		 *
+		 * This means the page can't go away, which means the
+		 * inode can't go away
+		 */
+		spin_unlock_irq(&sbi->guarded_lock);
+
+		page = bh->b_page;
+		inode = page->mapping->host;
+
+		ext3_ordered_lock(inode);
+		if (ordered->bh) {
+			/*
+			 * someone might have decided this buffer didn't
+			 * really need to be ordered and removed us from
+			 * the rb tree.  They set ordered->bh to null
+			 * when that happens.
+			 */
+			must_log = ext3_ordered_update_i_size(inode, ordered);
+			ext3_remove_ordered_extent(inode, ordered);
+		}
+		ext3_ordered_unlock(inode);
+
+		/*
+		 * drop the reference taken when this ordered extent was
+		 * put onto the guarded_buffers list
+		 */
+		ext3_put_ordered_extent(ordered);
+
+		/*
+		 * maybe log the inode and/or cleanup the orphan entry
+		 */
+		ordered_orphan_del_trans(inode, must_log > 0);
+
+		/*
+		 * finally, call the real bh end_io function to do
+		 * all the hard work of maintaining page writeback.
+		 */
+		end_buffer_async_write(bh, buffer_uptodate(bh));
+		spin_lock_irq(&sbi->guarded_lock);
+	}
+	spin_unlock_irq(&sbi->guarded_lock);
+}
+
 static int walk_page_buffers(	handle_t *handle,
 				struct buffer_head *head,
 				unsigned from,
@@ -1185,6 +1388,7 @@ retry:
 		ret = walk_page_buffers(handle, page_buffers(page),
 				from, to, NULL, do_journal_get_write_access);
 	}
+
 write_begin_failed:
 	if (ret) {
 		/*
@@ -1212,7 +1416,13 @@ out:
 
 int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
 {
-	int err = journal_dirty_data(handle, bh);
+	int err;
+
+	/* don't take buffers from the data=guarded list */
+	if (buffer_dataguarded(bh))
+		return 0;
+
+	err = journal_dirty_data(handle, bh);
 	if (err)
 		ext3_journal_abort_handle(__func__, __func__,
 						bh, handle, err);
@@ -1231,6 +1441,89 @@ static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
 	return 0;
 }
 
+/*
+ * Walk the buffers in a page for data=guarded mode.  Buffers that
+ * are not marked as datanew are ignored.
+ *
+ * New buffers outside i_size are sent to the data guarded code
+ *
+ * We must do the old data=ordered mode when filling holes in the
+ * file, since i_size doesn't protect these at all.
+ */
+static int journal_dirty_data_guarded_fn(handle_t *handle,
+					 struct buffer_head *bh)
+{
+	u64 offset = page_offset(bh->b_page) + bh_offset(bh);
+	struct inode *inode = bh->b_page->mapping->host;
+	int ret = 0;
+
+	/*
+	 * Write could have mapped the buffer but it didn't copy the data in
+	 * yet. So avoid filing such buffer into a transaction.
+	 */
+	if (!buffer_mapped(bh) || !buffer_uptodate(bh))
+		return 0;
+
+	if (test_clear_buffer_datanew(bh)) {
+		/*
+		 * if we're filling a hole inside i_size, we need to
+		 * fall back to the old style data=ordered
+		 */
+		if (offset < inode->i_size) {
+			ret = ext3_journal_dirty_data(handle, bh);
+			goto out;
+		}
+		ret = ext3_add_ordered_extent(inode, offset, bh);
+
+		/* if we crash before the IO is done, i_size will be small
+		 * but these blocks will still be allocated to the file.
+		 *
+		 * So, add an orphan entry for the file, which will truncate it
+		 * down to the i_size it finds after the crash.
+		 *
+		 * The orphan is cleaned up when the IO is done.  We
+		 * don't add orphans while mount is running the orphan list,
+		 * that seems to corrupt the list.
+		 */
+		if (ret == 0 && buffer_dataguarded(bh) &&
+		    list_empty(&EXT3_I(inode)->i_orphan) &&
+		    !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ORPHAN_FS)) {
+				ret = ext3_orphan_add(handle, inode);
+		}
+	}
+out:
+	return ret;
+}
+
+/*
+ * Walk the buffers in a page for data=guarded mode for writepage.
+ *
+ * We must do the old data=ordered mode when filling holes in the
+ * file, since i_size doesn't protect these at all.
+ *
+ * This is actually called after writepage is run and so we can't
+ * trust anything other than the buffer head (which we have pinned).
+ *
+ * Any datanew buffer at writepage time is filling a hole, so we don't need
+ * extra tests against the inode size.
+ */
+static int journal_dirty_data_guarded_writepage_fn(handle_t *handle,
+					 struct buffer_head *bh)
+{
+	int ret = 0;
+
+	/*
+	 * Write could have mapped the buffer but it didn't copy the data in
+	 * yet. So avoid filing such buffer into a transaction.
+	 */
+	if (!buffer_mapped(bh) || !buffer_uptodate(bh))
+		return 0;
+
+	if (test_clear_buffer_datanew(bh))
+		ret = ext3_journal_dirty_data(handle, bh);
+	return ret;
+}
+
 /* For write_end() in data=journal mode */
 static int write_end_fn(handle_t *handle, struct buffer_head *bh)
 {
@@ -1251,10 +1544,8 @@ static void update_file_sizes(struct inode *inode, loff_t pos, unsigned copied)
 	/* What matters to us is i_disksize. We don't write i_size anywhere */
 	if (pos + copied > inode->i_size)
 		i_size_write(inode, pos + copied);
-	if (pos + copied > EXT3_I(inode)->i_disksize) {
-		EXT3_I(inode)->i_disksize = pos + copied;
+	if (maybe_update_disk_isize(inode, pos + copied))
 		mark_inode_dirty(inode);
-	}
 }
 
 /*
@@ -1300,6 +1591,65 @@ static int ext3_ordered_write_end(struct file *file,
 	return ret ? ret : copied;
 }
 
+static int ext3_guarded_write_end(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned copied,
+				struct page *page, void *fsdata)
+{
+	handle_t *handle = ext3_journal_current_handle();
+	struct inode *inode = file->f_mapping->host;
+	unsigned from, to;
+	int ret = 0, ret2;
+
+	copied = block_write_end(file, mapping, pos, len, copied,
+				 page, fsdata);
+
+	from = pos & (PAGE_CACHE_SIZE - 1);
+	to = from + copied;
+	ret = walk_page_buffers(handle, page_buffers(page),
+		from, to, NULL, journal_dirty_data_guarded_fn);
+
+	/*
+	 * we only update the in-memory i_size.  The disk i_size is done
+	 * by the end io handlers
+	 */
+	if (ret == 0 && pos + copied > inode->i_size) {
+		int must_log;
+
+		i_size_write(inode, pos + copied);
+
+		/* we've updated i_size, but we may have raced with a
+		 * data=guarded end_io handler.
+		 *
+		 * All the guarded IO could have ended while i_size was still
+		 * small, and if we're just adding bytes into an existing block
+		 * in the file, we may not be adding a new guarded IO with this
+		 * write.  So, do a check on the disk i_size and make sure it
+		 * is updated to the highest safe value.
+		 */
+		ext3_ordered_lock(inode);
+		must_log = ext3_ordered_update_i_size(inode, NULL);
+		ext3_ordered_unlock(inode);
+		ordered_orphan_del_trans(inode, must_log > 0);
+	}
+
+	/*
+	 * There may be allocated blocks outside of i_size because
+	 * we failed to copy some data. Prepare for truncate.
+	 */
+	if (pos + len > inode->i_size)
+		ext3_orphan_add(handle, inode);
+	ret2 = ext3_journal_stop(handle);
+	if (!ret)
+		ret = ret2;
+	unlock_page(page);
+	page_cache_release(page);
+
+	if (pos + len > inode->i_size)
+		vmtruncate(inode, inode->i_size);
+	return ret ? ret : copied;
+}
+
 static int ext3_writeback_write_end(struct file *file,
 				struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned copied,
@@ -1311,6 +1661,7 @@ static int ext3_writeback_write_end(struct file *file,
 
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
 	update_file_sizes(inode, pos, copied);
+
 	/*
 	 * There may be allocated blocks outside of i_size because
 	 * we failed to copy some data. Prepare for truncate.
@@ -1574,6 +1925,144 @@ out_fail:
 	return ret;
 }
 
+/*
+ * Completion handler for block_write_full_page().  This will
+ * kick off the data=guarded workqueue as the IO finishes.
+ */
+static void end_buffer_async_write_guarded(struct buffer_head *bh,
+					   int uptodate)
+{
+	struct ext3_sb_info *sbi;
+	struct address_space *mapping;
+	struct ext3_ordered_extent *ordered;
+	unsigned long flags;
+
+	mapping = bh->b_page->mapping;
+	if (!mapping || !bh->b_private || !buffer_dataguarded(bh)) {
+noguard:
+		end_buffer_async_write(bh, uptodate);
+		return;
+	}
+
+	/*
+	 * the guarded workqueue function checks the uptodate bit on the
+	 * bh and uses that to tell the real end_io handler if things worked
+	 * out or not.
+	 */
+	if (uptodate)
+		set_buffer_uptodate(bh);
+	else
+		clear_buffer_uptodate(bh);
+
+	sbi = EXT3_SB(mapping->host->i_sb);
+
+	spin_lock_irqsave(&sbi->guarded_lock, flags);
+
+	/*
+	 * remove any chance that a truncate raced in and cleared
+	 * our dataguard flag, which also freed the ordered extent in
+	 * our b_private.
+	 */
+	if (!buffer_dataguarded(bh)) {
+		spin_unlock_irqrestore(&sbi->guarded_lock, flags);
+		goto noguard;
+	}
+	ordered = bh->b_private;
+	WARN_ON(ordered->end_io_bh);
+
+	/*
+	 * use the special end_io_bh pointer to make sure that
+	 * some form of end_io handler is run on this bh, even
+	 * if the ordered_extent is removed from the rb tree before
+	 * our workqueue ends up processing it.
+	 */
+	ordered->end_io_bh = bh;
+	list_add_tail(&ordered->list, &sbi->guarded_buffers);
+	ext3_get_ordered_extent(ordered);
+	spin_unlock_irqrestore(&sbi->guarded_lock, flags);
+
+	queue_work(sbi->guarded_wq, &sbi->guarded_work);
+}
+
+static int ext3_guarded_writepage(struct page *page,
+				struct writeback_control *wbc)
+{
+	struct inode *inode = page->mapping->host;
+	struct buffer_head *page_bufs;
+	handle_t *handle = NULL;
+	int ret = 0;
+	int err;
+
+	J_ASSERT(PageLocked(page));
+
+	/*
+	 * We give up here if we're reentered, because it might be for a
+	 * different filesystem.
+	 */
+	if (ext3_journal_current_handle())
+		goto out_fail;
+
+	if (!page_has_buffers(page)) {
+		create_empty_buffers(page, inode->i_sb->s_blocksize,
+				(1 << BH_Dirty)|(1 << BH_Uptodate));
+		page_bufs = page_buffers(page);
+	} else {
+		page_bufs = page_buffers(page);
+		if (!walk_page_buffers(NULL, page_bufs, 0, PAGE_CACHE_SIZE,
+				       NULL, buffer_unmapped)) {
+			/* Provide NULL get_block() to catch bugs if buffers
+			 * weren't really mapped */
+			 return block_write_full_page_endio(page, NULL, wbc,
+					  end_buffer_async_write_guarded);
+		}
+	}
+	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
+
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		goto out_fail;
+	}
+
+	walk_page_buffers(handle, page_bufs, 0,
+			PAGE_CACHE_SIZE, NULL, bget_one);
+
+	ret = block_write_full_page_endio(page, ext3_get_block, wbc,
+					  end_buffer_async_write_guarded);
+
+	/*
+	 * The page can become unlocked at any point now, and
+	 * truncate can then come in and change things.  So we
+	 * can't touch *page from now on.  But *page_bufs is
+	 * safe due to elevated refcount.
+	 */
+
+	/*
+	 * And attach them to the current transaction.  But only if
+	 * block_write_full_page() succeeded.  Otherwise they are unmapped,
+	 * and generally junk.
+	 */
+	if (ret == 0) {
+		err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
+				NULL, journal_dirty_data_guarded_writepage_fn);
+		if (!ret)
+			ret = err;
+	}
+	walk_page_buffers(handle, page_bufs, 0,
+			PAGE_CACHE_SIZE, NULL, bput_one);
+	err = ext3_journal_stop(handle);
+	if (!ret)
+		ret = err;
+
+	return ret;
+
+out_fail:
+	redirty_page_for_writepage(wbc, page);
+	unlock_page(page);
+	return ret;
+}
+
+
+
 static int ext3_writeback_writepage(struct page *page,
 				struct writeback_control *wbc)
 {
@@ -1768,8 +2257,10 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
 			ret = PTR_ERR(handle);
 			goto out;
 		}
+
 		if (inode->i_nlink)
-			ext3_orphan_del(handle, inode);
+			ordered_orphan_del(handle, inode, 0);
+
 		if (ret > 0) {
 			loff_t end = offset + ret;
 			if (end > inode->i_size) {
@@ -1842,6 +2333,21 @@ static const struct address_space_operations ext3_writeback_aops = {
 	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
+static const struct address_space_operations ext3_guarded_aops = {
+	.readpage		= ext3_readpage,
+	.readpages		= ext3_readpages,
+	.writepage		= ext3_guarded_writepage,
+	.sync_page		= block_sync_page,
+	.write_begin		= ext3_write_begin,
+	.write_end		= ext3_guarded_write_end,
+	.bmap			= ext3_bmap,
+	.invalidatepage		= ext3_invalidatepage,
+	.releasepage		= ext3_releasepage,
+	.direct_IO		= ext3_direct_IO,
+	.migratepage		= buffer_migrate_page,
+	.is_partially_uptodate  = block_is_partially_uptodate,
+};
+
 static const struct address_space_operations ext3_journalled_aops = {
 	.readpage		= ext3_readpage,
 	.readpages		= ext3_readpages,
@@ -1860,6 +2366,8 @@ void ext3_set_aops(struct inode *inode)
 {
 	if (ext3_should_order_data(inode))
 		inode->i_mapping->a_ops = &ext3_ordered_aops;
+	else if (ext3_should_guard_data(inode))
+		inode->i_mapping->a_ops = &ext3_guarded_aops;
 	else if (ext3_should_writeback_data(inode))
 		inode->i_mapping->a_ops = &ext3_writeback_aops;
 	else
@@ -2376,7 +2884,8 @@ void ext3_truncate(struct inode *inode)
 	if (!ext3_can_truncate(inode))
 		return;
 
-	if (inode->i_size == 0 && ext3_should_writeback_data(inode))
+	if (inode->i_size == 0 && (ext3_should_writeback_data(inode) ||
+				   ext3_should_guard_data(inode)))
 		ei->i_state |= EXT3_STATE_FLUSH_ON_CLOSE;
 
 	/*
@@ -3103,10 +3612,39 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
 		ext3_journal_stop(handle);
 	}
 
+	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
+		/*
+		 * we need to make sure any data=guarded pages
+		 * are on disk before we force a new disk i_size
+		 * down into the inode.  The crucial range is
+		 * anything between the disksize on disk now
+		 * and the new size we're going to set.
+		 *
+		 * We're holding i_mutex here, so we know new
+		 * ordered extents are not going to appear in the inode
+		 *
+		 * This must be done both for truncates that make the
+		 * file bigger and smaller because truncate messes around
+		 * with the orphan inode list in both cases.
+		 */
+		if (ext3_should_guard_data(inode)) {
+			filemap_write_and_wait_range(inode->i_mapping,
+						 EXT3_I(inode)->i_disksize,
+						 (loff_t)-1);
+			/*
+			 * we've written everything, make sure all
+			 * the ordered extents are really gone.
+			 *
+			 * This prevents leaking of ordered extents
+			 * and it also makes sure the ordered extent code
+			 * doesn't mess with the orphan link
+			 */
+			ext3_truncate_ordered_extents(inode, 0);
+		}
+	}
 	if (S_ISREG(inode->i_mode) &&
 	    attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
 		handle_t *handle;
-
 		handle = ext3_journal_start(inode, 3);
 		if (IS_ERR(handle)) {
 			error = PTR_ERR(handle);
@@ -3114,6 +3652,7 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
 		}
 
 		error = ext3_orphan_add(handle, inode);
+
 		EXT3_I(inode)->i_disksize = attr->ia_size;
 		rc = ext3_mark_inode_dirty(handle, inode);
 		if (!error)
@@ -3125,8 +3664,11 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
 
 	/* If inode_setattr's call to ext3_truncate failed to get a
 	 * transaction handle at all, we need to clean up the in-core
-	 * orphan list manually. */
-	if (inode->i_nlink)
+	 * orphan list manually.  Because we've finished off all the
+	 * guarded IO above, this doesn't hurt anything for the guarded
+	 * code
+	 */
+	if (inode->i_nlink && (attr->ia_valid & ATTR_SIZE))
 		ext3_orphan_del(NULL, inode);
 
 	if (!rc && (ia_valid & ATTR_MODE))
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 6ff7b97..ac3991a 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -2410,7 +2410,8 @@ static int ext3_rename (struct inode * old_dir, struct dentry *old_dentry,
 		ext3_mark_inode_dirty(handle, new_inode);
 		if (!new_inode->i_nlink)
 			ext3_orphan_add(handle, new_inode);
-		if (ext3_should_writeback_data(new_inode))
+		if (ext3_should_writeback_data(new_inode) ||
+		    ext3_should_guard_data(new_inode))
 			flush_file = 1;
 	}
 	retval = 0;
diff --git a/fs/ext3/ordered-data.c b/fs/ext3/ordered-data.c
new file mode 100644
index 0000000..d3ebccb
--- /dev/null
+++ b/fs/ext3/ordered-data.c
@@ -0,0 +1,342 @@
+/*
+ * Copyright (C) 2009 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/gfp.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/writeback.h>
+#include <linux/pagevec.h>
+#include <linux/buffer_head.h>
+#include <linux/ext3_jbd.h>
+
+
+/*
+ * These routines actually implement data=guarded mode, but the
+ * idea is that it may replace data=ordered mode as it becomes stable.
+ */
+
+static u64 entry_end(struct ext3_ordered_extent *entry, unsigned blocksize)
+{
+	return entry->start + blocksize;
+}
+
+/*
+ * returns NULL if the insertion worked, or it returns the node it did find
+ * in the tree
+ */
+static struct rb_node *tree_insert(struct rb_root *root, u64 start,
+				   unsigned blocksize, struct rb_node *node)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct ext3_ordered_extent *entry;
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct ext3_ordered_extent, rb_node);
+
+		if (start < entry->start)
+			p = &(*p)->rb_left;
+		else if (start >= entry_end(entry, blocksize))
+			p = &(*p)->rb_right;
+		else
+			return parent;
+	}
+
+	rb_link_node(node, parent, p);
+	rb_insert_color(node, root);
+	return NULL;
+}
+
+/*
+ * find the ordered struct that has this offset
+ */
+static inline struct rb_node *tree_search(struct ext3_ordered_inode_tree *tree,
+					  u64 start, unsigned blocksize)
+{
+	struct rb_node *n = tree->tree.rb_node;
+	struct ext3_ordered_extent *entry;
+
+	while (n) {
+		entry = rb_entry(n, struct ext3_ordered_extent, rb_node);
+
+		if (start < entry->start)
+			n = n->rb_left;
+		else if (start >= entry_end(entry, blocksize))
+			n = n->rb_right;
+		else
+			return n;
+	}
+	return NULL;
+}
+
+/* allocate and add a new ordered_extent into the per-inode tree.
+ * start is the logical offset in the file
+ *
+ * The tree is given a single reference on the ordered extent that was
+ * inserted, and it also takes a reference on the buffer head.
+ */
+int ext3_add_ordered_extent(struct inode *inode, u64 start,
+			    struct buffer_head *bh)
+{
+	struct ext3_ordered_inode_tree *tree;
+	struct rb_node *node;
+	struct ext3_ordered_extent *entry;
+	int ret = 0;
+
+	lock_buffer(bh);
+
+	/* ordered extent already there, or in old style data=ordered */
+	if (bh->b_private) {
+		ret = 0;
+		goto out;
+	}
+
+	tree = &EXT3_I(inode)->ordered_tree;
+	entry = kzalloc(sizeof(*entry), GFP_NOFS);
+	if (!entry) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	spin_lock(&tree->lock);
+	entry->start = start;
+
+	get_bh(bh);
+	entry->bh = bh;
+	bh->b_private = entry;
+	set_buffer_dataguarded(bh);
+
+	/* one ref for the tree */
+	atomic_set(&entry->refs, 1);
+	INIT_LIST_HEAD(&entry->list);
+	node = tree_insert(&tree->tree, start,
+			   inode->i_sb->s_blocksize, &entry->rb_node);
+	BUG_ON(node);
+
+	spin_unlock(&tree->lock);
+out:
+	unlock_buffer(bh);
+	return ret;
+}
+
+/*
+ * used to drop a reference on an ordered extent.  This will free
+ * the extent if the last reference is dropped
+ */
+int ext3_put_ordered_extent(struct ext3_ordered_extent *entry)
+{
+	if (atomic_dec_and_test(&entry->refs)) {
+		WARN_ON(entry->bh);
+		WARN_ON(entry->end_io_bh);
+		kfree(entry);
+	}
+	return 0;
+}
+
+/*
+ * remove an ordered extent from the tree.  This removes the
+ * reference held by the rbtree on 'entry' and the
+ * reference on the buffer head held by the entry.
+ */
+int ext3_remove_ordered_extent(struct inode *inode,
+				struct ext3_ordered_extent *entry)
+{
+	struct ext3_ordered_inode_tree *tree;
+	struct rb_node *node;
+
+	tree = &EXT3_I(inode)->ordered_tree;
+	node = &entry->rb_node;
+
+	/*
+	 * the data=guarded end_io handler takes this guarded_lock
+	 * before it puts a given buffer head and its ordered extent
+	 * into the guarded_buffers list.  We need to make sure
+	 * we don't race with them, so we take the guarded_lock too.
+	 */
+	spin_lock_irq(&EXT3_SB(inode->i_sb)->guarded_lock);
+	clear_buffer_dataguarded(entry->bh);
+	entry->bh->b_private = NULL;
+	brelse(entry->bh);
+	entry->bh = NULL;
+	spin_unlock_irq(&EXT3_SB(inode->i_sb)->guarded_lock);
+
+	/*
+	 * we must not clear entry->end_io_bh, that is set by
+	 * the end_io handlers and will be cleared by the end_io
+	 * workqueue
+	 */
+
+	rb_erase(node, &tree->tree);
+	RB_CLEAR_NODE(node);
+	ext3_put_ordered_extent(entry);
+	return 0;
+}
+
+/*
+ * After an extent is done, call this to conditionally update the on disk
+ * i_size.  i_size is updated to cover any fully written part of the file.
+ *
+ * This returns < 0 on error, zero if no action needs to be taken and
+ * 1 if the inode must be logged.
+ */
+int ext3_ordered_update_i_size(struct inode *inode,
+			       struct ext3_ordered_extent *entry)
+{
+	u64 new_i_size;
+	u64 i_size_test;
+	u64 disk_i_size;
+	struct rb_node *node;
+	struct ext3_ordered_extent *test;
+	unsigned blocksize = inode->i_sb->s_blocksize;
+	int ret = 0;
+
+	disk_i_size = EXT3_I(inode)->i_disksize;
+
+	/*
+	 * if the disk i_size is already at the inode->i_size, or
+	 * this ordered extent is inside the disk i_size, we're done
+	 */
+	if (disk_i_size >= inode->i_size)
+		goto out;
+
+	/*
+	 * if the caller didn't pass us an entry, check against
+	 * the first entry in the rbtree
+	 */
+	if (!entry) {
+		node = rb_first(&EXT3_I(inode)->ordered_tree.tree);
+		if (!node) {
+			/*
+			 * nothing in the tree at all, we're fully up to
+			 * date
+			 */
+			i_size_test = i_size_read(inode);
+			goto out_update;
+		}
+		entry = rb_entry(node, struct ext3_ordered_extent, rb_node);
+		if (entry->start >= disk_i_size)
+			goto out;
+
+		/*
+		 * the area between disk_i_size and the start of this
+		 * entry is fully written.  Update the on disk i_size
+		 */
+		i_size_test = entry->start;
+		goto out_update;
+	}
+
+	/*
+	 * walk backward from this ordered extent to disk_i_size.
+	 * if we find an ordered extent then we can't update disk i_size
+	 * yet
+	 */
+	node = &entry->rb_node;
+	while (1) {
+		node = rb_prev(node);
+		if (!node)
+			break;
+		test = rb_entry(node, struct ext3_ordered_extent, rb_node);
+		if (entry_end(test, blocksize) <= disk_i_size)
+			break;
+		if (test->start >= inode->i_size)
+			break;
+		if (test->start >= disk_i_size)
+			goto out;
+	}
+
+	/*
+	 * at this point, we know we can safely update i_size to at least
+	 * the offset from this ordered extent.  But, we need to
+	 * walk forward and see if ios from higher up in the file have
+	 * finished.
+	 */
+	node = rb_next(&entry->rb_node);
+	if (node) {
+		/*
+		 * do we have an area where IO might have finished
+		 * between our ordered extent and the next one.
+		 */
+		test = rb_entry(node, struct ext3_ordered_extent, rb_node);
+		i_size_test = test->start;
+	} else {
+		i_size_test = i_size_read(inode);
+	}
+
+out_update:
+	new_i_size = min_t(u64, i_size_test, i_size_read(inode));
+
+	/* the caller needs to log this inode */
+	ret = 1;
+
+	EXT3_I(inode)->i_disksize = new_i_size;
+out:
+	return ret;
+}
+
+/*
+ * during a truncate or delete, we need to get rid of pending
+ * ordered extents so there isn't a war over who updates disk i_size first.
+ * This does that, without waiting for any of the IO to actually finish.
+ *
+ * When the IO does finish, it will find the ordered extent removed from the
+ * tree and all will work properly.
+ */
+void ext3_truncate_ordered_extents(struct inode *inode, u64 offset)
+{
+	struct ext3_ordered_inode_tree *tree = &EXT3_I(inode)->ordered_tree;
+	struct rb_node *node;
+	struct ext3_ordered_extent *test;
+
+	spin_lock(&tree->lock);
+	node = rb_last(&tree->tree);
+	while (node) {
+		/* truncate is going to end up waiting for pages that
+		 * straddle the new i_size.  So, we can drop the
+		 * ordered record for anything that starts before the new
+		 * i_size.
+		 */
+		test = rb_entry(node, struct ext3_ordered_extent, rb_node);
+		if (test->start < offset)
+			break;
+		node = rb_prev(&test->rb_node);
+
+		/*
+		 * once this is called, the end_io handler won't run,
+		 * and we won't update disk_i_size to include this buffer.
+		 *
+		 * That's ok for truncates because the truncate code is
+		 * writing a new i_size.
+		 *
+		 * This ignores any IO in flight, which is ok
+		 * because the guarded_buffers list has a reference
+		 * on the ordered extent
+		 */
+		ext3_remove_ordered_extent(inode, test);
+	}
+	spin_unlock(&tree->lock);
+	return;
+
+}
+
+void ext3_ordered_inode_init(struct ext3_inode_info *ei)
+{
+	ei->ordered_tree.tree.rb_node = NULL;
+	spin_lock_init(&ei->ordered_tree.lock);
+}
+
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 599dbfe..6a94647 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -37,6 +37,7 @@
 #include <linux/quotaops.h>
 #include <linux/seq_file.h>
 #include <linux/log2.h>
+#include <linux/workqueue.h>
 
 #include <asm/uaccess.h>
 
@@ -399,6 +400,9 @@ static void ext3_put_super (struct super_block * sb)
 	struct ext3_super_block *es = sbi->s_es;
 	int i, err;
 
+	flush_workqueue(sbi->guarded_wq);
+	destroy_workqueue(sbi->guarded_wq);
+
 	ext3_xattr_put_super(sb);
 	err = journal_destroy(sbi->s_journal);
 	sbi->s_journal = NULL;
@@ -468,6 +472,8 @@ static struct inode *ext3_alloc_inode(struct super_block *sb)
 #endif
 	ei->i_block_alloc_info = NULL;
 	ei->vfs_inode.i_version = 1;
+	ext3_ordered_inode_init(ei);
+
 	return &ei->vfs_inode;
 }
 
@@ -481,6 +487,8 @@ static void ext3_destroy_inode(struct inode *inode)
 				false);
 		dump_stack();
 	}
+	if (EXT3_I(inode)->ordered_tree.tree.rb_node)
+		printk(KERN_INFO "EXT3 ordered tree not empty\n");
 	kmem_cache_free(ext3_inode_cachep, EXT3_I(inode));
 }
 
@@ -528,6 +536,13 @@ static void ext3_clear_inode(struct inode *inode)
 		EXT3_I(inode)->i_default_acl = EXT3_ACL_NOT_CACHED;
 	}
 #endif
+	/*
+	 * If pages got cleaned by truncate, truncate should have
+	 * gotten rid of the ordered extents.  Just in case, drop them
+	 * here.
+	 */
+	ext3_truncate_ordered_extents(inode, 0);
+
 	ext3_discard_reservation(inode);
 	EXT3_I(inode)->i_block_alloc_info = NULL;
 	if (unlikely(rsv))
@@ -634,6 +649,8 @@ static int ext3_show_options(struct seq_file *seq, struct vfsmount *vfs)
 		seq_puts(seq, ",data=journal");
 	else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA)
 		seq_puts(seq, ",data=ordered");
+	else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_GUARDED_DATA)
+		seq_puts(seq, ",data=guarded");
 	else if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)
 		seq_puts(seq, ",data=writeback");
 
@@ -790,7 +807,7 @@ enum {
 	Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
 	Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
 	Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
-	Opt_data_err_abort, Opt_data_err_ignore,
+	Opt_data_guarded, Opt_data_err_abort, Opt_data_err_ignore,
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
 	Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
@@ -832,6 +849,7 @@ static const match_table_t tokens = {
 	{Opt_abort, "abort"},
 	{Opt_data_journal, "data=journal"},
 	{Opt_data_ordered, "data=ordered"},
+	{Opt_data_guarded, "data=guarded"},
 	{Opt_data_writeback, "data=writeback"},
 	{Opt_data_err_abort, "data_err=abort"},
 	{Opt_data_err_ignore, "data_err=ignore"},
@@ -1034,6 +1052,9 @@ static int parse_options (char *options, struct super_block *sb,
 		case Opt_data_ordered:
 			data_opt = EXT3_MOUNT_ORDERED_DATA;
 			goto datacheck;
+		case Opt_data_guarded:
+			data_opt = EXT3_MOUNT_GUARDED_DATA;
+			goto datacheck;
 		case Opt_data_writeback:
 			data_opt = EXT3_MOUNT_WRITEBACK_DATA;
 		datacheck:
@@ -1949,11 +1970,23 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 			clear_opt(sbi->s_mount_opt, NOBH);
 		}
 	}
+
+	/*
+	 * setup the guarded work list
+	 */
+	INIT_LIST_HEAD(&EXT3_SB(sb)->guarded_buffers);
+	INIT_WORK(&EXT3_SB(sb)->guarded_work, ext3_run_guarded_work);
+	spin_lock_init(&EXT3_SB(sb)->guarded_lock);
+	EXT3_SB(sb)->guarded_wq = create_workqueue("ext3-guard");
+	if (!EXT3_SB(sb)->guarded_wq) {
+		printk(KERN_ERR "EXT3-fs: failed to create workqueue\n");
+		goto failed_mount_guard;
+	}
+
 	/*
 	 * The journal_load will have done any necessary log recovery,
 	 * so we can safely mount the rest of the filesystem now.
 	 */
-
 	root = ext3_iget(sb, EXT3_ROOT_INO);
 	if (IS_ERR(root)) {
 		printk(KERN_ERR "EXT3-fs: get root inode failed\n");
@@ -1965,6 +1998,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 		printk(KERN_ERR "EXT3-fs: corrupt root inode, run e2fsck\n");
 		goto failed_mount4;
 	}
+
 	sb->s_root = d_alloc_root(root);
 	if (!sb->s_root) {
 		printk(KERN_ERR "EXT3-fs: get root dentry failed\n");
@@ -1974,6 +2008,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 	}
 
 	ext3_setup_super (sb, es, sb->s_flags & MS_RDONLY);
+
 	/*
 	 * akpm: core read_super() calls in here with the superblock locked.
 	 * That deadlocks, because orphan cleanup needs to lock the superblock
@@ -1989,9 +2024,10 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 		printk (KERN_INFO "EXT3-fs: recovery complete.\n");
 	ext3_mark_recovery_complete(sb, es);
 	printk (KERN_INFO "EXT3-fs: mounted filesystem with %s data mode.\n",
-		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
-		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
-		"writeback");
+	      test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal" :
+	      test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_GUARDED_DATA ? "guarded" :
+	      test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered" :
+	      "writeback");
 
 	lock_kernel();
 	return 0;
@@ -2003,6 +2039,8 @@ cantfind_ext3:
 	goto failed_mount;
 
 failed_mount4:
+	destroy_workqueue(EXT3_SB(sb)->guarded_wq);
+failed_mount_guard:
 	journal_destroy(sbi->s_journal);
 failed_mount3:
 	percpu_counter_destroy(&sbi->s_freeblocks_counter);
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index ed886e6..1354a55 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -2018,6 +2018,7 @@ zap_buffer_unlocked:
 	clear_buffer_mapped(bh);
 	clear_buffer_req(bh);
 	clear_buffer_new(bh);
+	clear_buffer_datanew(bh);
 	bh->b_bdev = NULL;
 	return may_free;
 }
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 634a5e5..09faa88 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -18,6 +18,7 @@
 
 #include <linux/types.h>
 #include <linux/magic.h>
+#include <linux/workqueue.h>
 
 /*
  * The second extended filesystem constants/structures
@@ -398,7 +399,6 @@ struct ext3_inode {
 #define EXT3_MOUNT_MINIX_DF		0x00080	/* Mimics the Minix statfs */
 #define EXT3_MOUNT_NOLOAD		0x00100	/* Don't use existing journal*/
 #define EXT3_MOUNT_ABORT		0x00200	/* Fatal error detected */
-#define EXT3_MOUNT_DATA_FLAGS		0x00C00	/* Mode for data writes: */
 #define EXT3_MOUNT_JOURNAL_DATA		0x00400	/* Write data to journal */
 #define EXT3_MOUNT_ORDERED_DATA		0x00800	/* Flush data before commit */
 #define EXT3_MOUNT_WRITEBACK_DATA	0x00C00	/* No data ordering */
@@ -414,6 +414,12 @@ struct ext3_inode {
 #define EXT3_MOUNT_GRPQUOTA		0x200000 /* "old" group quota */
 #define EXT3_MOUNT_DATA_ERR_ABORT	0x400000 /* Abort on file data write
 						  * error in ordered mode */
+#define EXT3_MOUNT_GUARDED_DATA		0x800000 /* guard new writes with
+						    i_size */
+#define EXT3_MOUNT_DATA_FLAGS		(EXT3_MOUNT_JOURNAL_DATA | \
+					 EXT3_MOUNT_ORDERED_DATA | \
+					 EXT3_MOUNT_WRITEBACK_DATA | \
+					 EXT3_MOUNT_GUARDED_DATA)
 
 /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
 #ifndef _LINUX_EXT2_FS_H
@@ -892,6 +898,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
 extern void ext3_set_aops(struct inode *inode);
 extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		       u64 start, u64 len);
+void ext3_run_guarded_work(struct work_struct *work);
 
 /* ioctl.c */
 extern long ext3_ioctl(struct file *, unsigned int, unsigned long);
@@ -945,7 +952,31 @@ extern const struct inode_operations ext3_special_inode_operations;
 extern const struct inode_operations ext3_symlink_inode_operations;
 extern const struct inode_operations ext3_fast_symlink_inode_operations;
 
+/* ordered-data.c */
+int ext3_add_ordered_extent(struct inode *inode, u64 file_offset,
+			    struct buffer_head *bh);
+int ext3_put_ordered_extent(struct ext3_ordered_extent *entry);
+int ext3_remove_ordered_extent(struct inode *inode,
+				struct ext3_ordered_extent *entry);
+int ext3_ordered_update_i_size(struct inode *inode,
+			       struct ext3_ordered_extent *entry);
+void ext3_ordered_inode_init(struct ext3_inode_info *ei);
+void ext3_truncate_ordered_extents(struct inode *inode, u64 offset);
+
+static inline void ext3_ordered_lock(struct inode *inode)
+{
+	spin_lock(&EXT3_I(inode)->ordered_tree.lock);
+}
 
+static inline void ext3_ordered_unlock(struct inode *inode)
+{
+	spin_unlock(&EXT3_I(inode)->ordered_tree.lock);
+}
+
+static inline void ext3_get_ordered_extent(struct ext3_ordered_extent *entry)
+{
+	atomic_inc(&entry->refs);
+}
 #endif	/* __KERNEL__ */
 
 #endif	/* _LINUX_EXT3_FS_H */
diff --git a/include/linux/ext3_fs_i.h b/include/linux/ext3_fs_i.h
index 7894dd0..399e3d8 100644
--- a/include/linux/ext3_fs_i.h
+++ b/include/linux/ext3_fs_i.h
@@ -20,6 +20,7 @@
 #include <linux/rbtree.h>
 #include <linux/seqlock.h>
 #include <linux/mutex.h>
+#include <linux/rbtree.h>
 
 /* data type for block offset of block group */
 typedef int ext3_grpblk_t;
@@ -65,6 +66,47 @@ struct ext3_block_alloc_info {
 #define rsv_end rsv_window._rsv_end
 
 /*
+ * used to prevent garbage in files after a crash by
+ * making sure i_size isn't updated until after the IO
+ * is done.
+ *
+ * See fs/ext3/ordered-data.c for the code that uses these.
+ */
+struct buffer_head;
+struct ext3_ordered_inode_tree {
+	/* protects the tree and disk i_size */
+	spinlock_t lock;
+
+	/* rb tree of IO that needs to happen before i_size can be updated */
+	struct rb_root tree;
+};
+
+struct ext3_ordered_extent {
+	/* logical offset of the block in the file */
+	u64 start;
+
+	/* buffer head being written */
+	struct buffer_head *bh;
+
+	/*
+	 * set at end_io time so we properly
+	 * do IO accounting even when this ordered
+	 * extent struct has been removed from the
+	 * rbtree
+	 */
+	struct buffer_head *end_io_bh;
+
+	/* number of refs on this ordered extent */
+	atomic_t refs;
+
+	/* for the rb tree */
+	struct rb_node rb_node;
+
+	/* list of things being processed by the workqueue */
+	struct list_head list;
+};
+
+/*
  * third extended file system inode data in memory
  */
 struct ext3_inode_info {
@@ -141,6 +183,8 @@ struct ext3_inode_info {
 	 * by other means, so we have truncate_mutex.
 	 */
 	struct mutex truncate_mutex;
+
+	struct ext3_ordered_inode_tree ordered_tree;
 	struct inode vfs_inode;
 };
 
diff --git a/include/linux/ext3_fs_sb.h b/include/linux/ext3_fs_sb.h
index f07f34d..5dbdbeb 100644
--- a/include/linux/ext3_fs_sb.h
+++ b/include/linux/ext3_fs_sb.h
@@ -21,6 +21,7 @@
 #include <linux/wait.h>
 #include <linux/blockgroup_lock.h>
 #include <linux/percpu_counter.h>
+#include <linux/workqueue.h>
 #endif
 #include <linux/rbtree.h>
 
@@ -82,6 +83,11 @@ struct ext3_sb_info {
 	char *s_qf_names[MAXQUOTAS];		/* Names of quota files with journalled quota */
 	int s_jquota_fmt;			/* Format of quota to use */
 #endif
+
+	struct workqueue_struct *guarded_wq;
+	struct work_struct guarded_work;
+	struct list_head guarded_buffers;
+	spinlock_t guarded_lock;
 };
 
 static inline spinlock_t *
diff --git a/include/linux/ext3_jbd.h b/include/linux/ext3_jbd.h
index cf82d51..45cb4aa 100644
--- a/include/linux/ext3_jbd.h
+++ b/include/linux/ext3_jbd.h
@@ -212,6 +212,17 @@ static inline int ext3_should_order_data(struct inode *inode)
 	return 0;
 }
 
+static inline int ext3_should_guard_data(struct inode *inode)
+{
+	if (!S_ISREG(inode->i_mode))
+		return 0;
+	if (EXT3_I(inode)->i_flags & EXT3_JOURNAL_DATA_FL)
+		return 0;
+	if (test_opt(inode->i_sb, GUARDED_DATA) == EXT3_MOUNT_GUARDED_DATA)
+		return 1;
+	return 0;
+}
+
 static inline int ext3_should_writeback_data(struct inode *inode)
 {
 	if (!S_ISREG(inode->i_mode))
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index 53ae439..6b38b50 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -291,6 +291,13 @@ enum jbd_state_bits {
 	BH_State,		/* Pins most journal_head state */
 	BH_JournalHead,		/* Pins bh->b_private and jh->b_bh */
 	BH_Unshadow,		/* Dummy bit, for BJ_Shadow wakeup filtering */
+	BH_DataGuarded,		/* ext3 data=guarded mode buffer
+				 * these have something other than a
+				 * journal_head at b_private */
+	BH_DataNew,		/* BH_new gets cleared too early for
+				 * data=guarded to use it.  So,
+				 * this gets set instead.
+				 */
 };
 
 BUFFER_FNS(JBD, jbd)
@@ -302,6 +309,9 @@ TAS_BUFFER_FNS(Revoked, revoked)
 BUFFER_FNS(RevokeValid, revokevalid)
 TAS_BUFFER_FNS(RevokeValid, revokevalid)
 BUFFER_FNS(Freed, freed)
+BUFFER_FNS(DataGuarded, dataguarded)
+BUFFER_FNS(DataNew, datanew)
+TAS_BUFFER_FNS(DataNew, datanew)
 
 static inline struct buffer_head *jh2bh(struct journal_head *jh)
 {



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH] Add ext3 data=guarded mode
  2009-04-16 19:42     ` [PATCH] " Chris Mason
@ 2009-04-17 11:04       ` Mike Galbraith
  2009-04-17 18:09       ` Amit Shah
  2009-04-20 13:44       ` Jan Kara
  2 siblings, 0 replies; 34+ messages in thread
From: Mike Galbraith @ 2009-04-17 11:04 UTC (permalink / raw)
  To: Chris Mason
  Cc: Jan Kara, Linus Torvalds, Theodore Ts'o,
	Linux Kernel Developers List, Ext4 Developers List

On Thu, 2009-04-16 at 15:42 -0400, Chris Mason wrote:
> Hello everyone,
> 
> Here's an updated (v4) patch for ext3 data=guarded mode.  The first two
> patches in the series are unchanged, and it looks like Linus pulled them
> in this morning.
> 
> This version fixes the bug Mike Galbraith hit.  The problem was with
> writes to the last block in the file that made the file bigger but
> didn't actually add a new block.  Since the block wasn't new, it
> wouldn't get sent through the guarded code and the on disk i_size wasn't
> always updated.

FWIW, I booted my backup fs drive mounted data=guarded, and have been
hammering it steadily since 6 A.M. with no hint of trouble.

	-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Add ext3 data=guarded mode
  2009-04-16 19:42     ` [PATCH] " Chris Mason
  2009-04-17 11:04       ` Mike Galbraith
@ 2009-04-17 18:09       ` Amit Shah
  2009-04-17 20:13         ` Theodore Tso
  2009-04-20 13:44       ` Jan Kara
  2 siblings, 1 reply; 34+ messages in thread
From: Amit Shah @ 2009-04-17 18:09 UTC (permalink / raw)
  To: Chris Mason
  Cc: Jan Kara, Linus Torvalds, Theodore Ts'o,
	Linux Kernel Developers List, Ext4 Developers List

On (Thu) Apr 16 2009 [15:42:01], Chris Mason wrote:
> Hello everyone,
> 
> Here's an updated (v4) patch for ext3 data=guarded mode.  The first two
> patches in the series are unchanged, and it looks like Linus pulled them
> in this morning.

I had written a small program that calculates the time needed to allocate
a file and zero it using various methods (posix_fallocate, mmap, 4k-chunk
writes, 8k-chunk writes). I did this on a 20G partition with each method
creating a file 4G in size. The system has 3G RAM. The program that does
this is at

http://fedorapeople.org/gitweb?p=amitshah/public_git/alloc-perf.git;a=blob;f=test-file-zero-alloc-speed.c;hb=HEAD

with the script to run it for the multiple filesystems at

http://fedorapeople.org/gitweb?p=amitshah/public_git/alloc-perf.git;a=blob;f=run_test.sh;hb=HEAD

I have a few results from those runs, time in seconds:

# 4GiB file, kernel b0cbc861a3c05e634520b049b5cc27ad6febb51f

filesystem	posix-fallocate	mmap	chunk-4096	chunk-8192
ext2		74		96	761		81
ext3-writeback	87		97	202		93
ext3-ordered	86		94	134		104
ext4		0		84	120		91
xfs		0		84	274		81
reiserfs	85		84	187		98
btrfs		0		86	121		85

# 4GiB file, kernel 9f76208c33984ab777eace5d07a4e36e88703e02 + ext3-guarded

filesystem	posix-fallocate	mmap	chunk-4096	chunk-8192
ext3-guarded	85		97	459		90
ext3-writeback	86		95	140		94
ext3-ordered	86		96	277		95
btrfs		0		81	499		93
xfs		0		79	184		84

These were with a desktop running with a few terminal sessions and one
konqueror session (to gauge the times a user will actually see while
working on her desktop).

Running the test in single user mode, I get the following results:

# 4GiB file, kernel 9f76208c33984ab777eace5d07a4e36e88703e02 + ext3-guarded

filesystem	posix-fallocate	mmap	chunk-4096	chunk-8192
ext3-guarded	84		86	163		91
ext3-writeback	84		88	217		91
ext3-ordered	84		86	226		91
btrfs		0		76	86		79
ext4		0		73	195		76

		Amit

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Add ext3 data=guarded mode
  2009-04-17 18:09       ` Amit Shah
@ 2009-04-17 20:13         ` Theodore Tso
  2009-04-18  6:03           ` Amit Shah
       [not found]           ` <20090418060312.GA10943@amit-x200.pnq.redhat.com>
  0 siblings, 2 replies; 34+ messages in thread
From: Theodore Tso @ 2009-04-17 20:13 UTC (permalink / raw)
  To: Amit Shah
  Cc: Chris Mason, Jan Kara, Linus Torvalds,
	Linux Kernel Developers List, Ext4 Developers List

On Fri, Apr 17, 2009 at 11:39:06PM +0530, Amit Shah wrote:
> # 4GiB file, kernel 9f76208c33984ab777eace5d07a4e36e88703e02 + ext3-guarded
> 
> filesystem	posix-fallocate	mmap	chunk-4096	chunk-8192
> ext3-guarded	85		97	459		90
> ext3-writeback 86		95	140		94
> ext3-ordered	86		96	277		95
> 
> Running the test in single user mode, I get the following results:
> 
> # 4GiB file, kernel 9f76208c33984ab777eace5d07a4e36e88703e02 + ext3-guarded
> 
> filesystem	posix-fallocate	mmap	chunk-4096	chunk-8192
> ext3-guarded	84		86	163		91
> ext3-writeback 84		88	217		91
> ext3-ordered	84		86	226		91


The difference between guarded and writeback in chunk-4096 looking at
your desktop timings and your single user times is.... surprising. 

In particular, the fact that the guarded time is 3 times longer than
ext3-writeback when the desktop is running, and 20% faster in single
user mode.  Are these results reproducible?  And do you have any
thoughts as to what might be causing them?

					- Ted

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Add ext3 data=guarded mode
  2009-04-17 20:13         ` Theodore Tso
@ 2009-04-18  6:03           ` Amit Shah
       [not found]           ` <20090418060312.GA10943@amit-x200.pnq.redhat.com>
  1 sibling, 0 replies; 34+ messages in thread
From: Amit Shah @ 2009-04-18  6:03 UTC (permalink / raw)
  To: Theodore Tso, Chris Mason, Jan Kara, Linus Torvalds,
	Linux Kernel Developers List <linux-kernel

On (Fri) Apr 17 2009 [16:13:42], Theodore Tso wrote:
> On Fri, Apr 17, 2009 at 11:39:06PM +0530, Amit Shah wrote:
> > # 4GiB file, kernel 9f76208c33984ab777eace5d07a4e36e88703e02 + ext3-guarded
> > 
> > filesystem	posix-fallocate	mmap	chunk-4096	chunk-8192
> > ext3-guarded	85		97	459		90
> > ext3-writeback 86		95	140		94
> > ext3-ordered	86		96	277		95
> > 
> > Running the test in single user mode, I get the following results:
> > 
> > # 4GiB file, kernel 9f76208c33984ab777eace5d07a4e36e88703e02 + ext3-guarded
> > 
> > filesystem	posix-fallocate	mmap	chunk-4096	chunk-8192
> > ext3-guarded	84		86	163		91
> > ext3-writeback 84		88	217		91
> > ext3-ordered	84		86	226		91
> 
> 
> The difference between guarded and writeback in chunk-4096 looking at
> your desktop timings and your single user times is.... surprising. 

Surely. I re-ran the guarded test immediately after that one and got a
time of 353s with the desktop. Another run much latergave me a 189s time,
so it seems to vary quite a lot. Initially when I was getting high
numbers, I thought it could be related to the IO scheduler but looks like
it's just some background tasks trying to get cpu or io time. Of course,
the whole system becomes sluggish once these tests start.

> In particular, the fact that the guarded time is 3 times longer than
> ext3-writeback when the desktop is running, and 20% faster in single
> user mode.  Are these results reproducible?  And do you have any
> thoughts as to what might be causing them?

I initially thought there was something but I also got lower numbers
(189s), so I can't really say what it is even though I call sync before
starting the tests.

		Amit

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <20090418060312.GA10943@amit-x200.pnq.redhat.com>]

* Re: [PATCH] Add ext3 data=guarded mode
       [not found]           ` <20090418060312.GA10943@amit-x200.pnq.redhat.com>
@ 2009-04-18  7:28             ` Mike Galbraith
  2009-04-19  6:24               ` Amit Shah
  0 siblings, 1 reply; 34+ messages in thread
From: Mike Galbraith @ 2009-04-18  7:28 UTC (permalink / raw)
  To: Amit Shah
  Cc: Theodore Tso, Chris Mason, Jan Kara, Linus Torvalds,
	Linux Kernel Developers List, Ext4 Developers List

On Sat, 2009-04-18 at 11:33 +0530, Amit Shah wrote:
> On (Fri) Apr 17 2009 [16:13:42], Theodore Tso wrote:
> > On Fri, Apr 17, 2009 at 11:39:06PM +0530, Amit Shah wrote:
> > > # 4GiB file, kernel 9f76208c33984ab777eace5d07a4e36e88703e02 + ext3-guarded
> > > 
> > > filesystem	posix-fallocate	mmap	chunk-4096	chunk-8192
> > > ext3-guarded	85		97	459		90
> > > ext3-writeback 86		95	140		94
> > > ext3-ordered	86		96	277		95
> > > 
> > > Running the test in single user mode, I get the following results:
> > > 
> > > # 4GiB file, kernel 9f76208c33984ab777eace5d07a4e36e88703e02 + ext3-guarded
> > > 
> > > filesystem	posix-fallocate	mmap	chunk-4096	chunk-8192
> > > ext3-guarded	84		86	163		91
> > > ext3-writeback 84		88	217		91
> > > ext3-ordered	84		86	226		91
> > 
> > 
> > The difference between guarded and writeback in chunk-4096 looking at
> > your desktop timings and your single user times is.... surprising. 
> 
> Surely. I re-ran the guarded test immediately after that one and got a
> time of 353s with the desktop. Another run much latergave me a 189s time,
> so it seems to vary quite a lot. Initially when I was getting high
> numbers, I thought it could be related to the IO scheduler but looks like
> it's just some background tasks trying to get cpu or io time. Of course,
> the whole system becomes sluggish once these tests start.
> 
> > In particular, the fact that the guarded time is 3 times longer than
> > ext3-writeback when the desktop is running, and 20% faster in single
> > user mode.  Are these results reproducible?  And do you have any
> > thoughts as to what might be causing them?
> 
> I initially thought there was something but I also got lower numbers
> (189s), so I can't really say what it is even though I call sync before
> starting the tests.

Probably because you're swapping heavily, and that is perturbing your
test?  With my setup, 3GB ram + 2GB swap, I can't even run the 4GB test
without an mmap() failure/abort, but with 3GB size, box swaps insanely.

(If I drop file size to 2GB, I see zip difference for all three mounts
modes.  4k chunk time is ~27s for all three.  Actually, all numbers
emitted are around 27s.)

	-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Add ext3 data=guarded mode
  2009-04-18  7:28             ` Mike Galbraith
@ 2009-04-19  6:24               ` Amit Shah
  2009-04-20  9:07                 ` Mike Galbraith
  0 siblings, 1 reply; 34+ messages in thread
From: Amit Shah @ 2009-04-19  6:24 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Theodore Tso, Chris Mason, Jan Kara, Linus Torvalds,
	Linux Kernel Developers List, Ext4 Developers List

On (Sat) Apr 18 2009 [09:28:21], Mike Galbraith wrote:
> On Sat, 2009-04-18 at 11:33 +0530, Amit Shah wrote:
> > On (Fri) Apr 17 2009 [16:13:42], Theodore Tso wrote:
> > > On Fri, Apr 17, 2009 at 11:39:06PM +0530, Amit Shah wrote:
> > > > # 4GiB file, kernel 9f76208c33984ab777eace5d07a4e36e88703e02 + ext3-guarded
> > > > 
> > > > filesystem	posix-fallocate	mmap	chunk-4096	chunk-8192
> > > > ext3-guarded	85		97	459		90
> > > > ext3-writeback 86		95	140		94
> > > > ext3-ordered	86		96	277		95
> > > > 
> > > > Running the test in single user mode, I get the following results:
> > > > 
> > > > # 4GiB file, kernel 9f76208c33984ab777eace5d07a4e36e88703e02 + ext3-guarded
> > > > 
> > > > filesystem	posix-fallocate	mmap	chunk-4096	chunk-8192
> > > > ext3-guarded	84		86	163		91
> > > > ext3-writeback 84		88	217		91
> > > > ext3-ordered	84		86	226		91
> > > 
> > > 
> > > The difference between guarded and writeback in chunk-4096 looking at
> > > your desktop timings and your single user times is.... surprising. 
> > 
> > Surely. I re-ran the guarded test immediately after that one and got a
> > time of 353s with the desktop. Another run much latergave me a 189s time,
> > so it seems to vary quite a lot. Initially when I was getting high
> > numbers, I thought it could be related to the IO scheduler but looks like
> > it's just some background tasks trying to get cpu or io time. Of course,
> > the whole system becomes sluggish once these tests start.
> > 
> > > In particular, the fact that the guarded time is 3 times longer than
> > > ext3-writeback when the desktop is running, and 20% faster in single
> > > user mode.  Are these results reproducible?  And do you have any
> > > thoughts as to what might be causing them?
> > 
> > I initially thought there was something but I also got lower numbers
> > (189s), so I can't really say what it is even though I call sync before
> > starting the tests.
> 
> Probably because you're swapping heavily, and that is perturbing your

The variance only affects the 4k test; the other times more or less
remain the same.

		Amit

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Add ext3 data=guarded mode
  2009-04-19  6:24               ` Amit Shah
@ 2009-04-20  9:07                 ` Mike Galbraith
  2009-04-20  9:26                   ` Jan Kara
  0 siblings, 1 reply; 34+ messages in thread
From: Mike Galbraith @ 2009-04-20  9:07 UTC (permalink / raw)
  To: Amit Shah
  Cc: Theodore Tso, Chris Mason, Jan Kara, Linus Torvalds,
	Linux Kernel Developers List, Ext4 Developers List

On Sun, 2009-04-19 at 11:54 +0530, Amit Shah wrote:
> On (Sat) Apr 18 2009 [09:28:21], Mike Galbraith wrote:

> > Probably because you're swapping heavily, and that is perturbing your
> 
> The variance only affects the 4k test; the other times more or less
> remain the same.

My box disagrees.

(bumps ulimits to test 4BG... OS+swap live on sdb btw)

./test-file-zero-alloc-speed 4 /dev/sdf3 /media/root ext3 rw,_netdev,noatime,data=foo,acl,user_xattr

foo=guarded
4k   225  141   80  142  361
8k    74   96  362   78   84
mm    55   57   57   57   57

foo=writeback
4k   179  264  187  125   93
8k    94  161   73  334   84
mm    57   58   57   56   57

foo=ordered
4k    81   74   76   80   75
8k    77   76  224   79   79 
mm    59   56   60   58   59

foo=journal
4k    95  297   69   83  420
8k    73  139  158   80   78
mm    57   58   56   59   56

./test-file-zero-alloc-speed 2 /dev/sdf3 /media/root ext3 rw,_netdev,noatime,data=foo,acl,user_xattr

foo=guarded
4k    28   27   27   28   28
8k    28   27   27   28   27

foo=writeback
4k    27   27   27   27   27
8k    28   28   27   27   28

All journal modes seem subject to bad throughput under heavy pressure,
though data=ordered seems much less likely to suffer for some reason.
Major difference _seems_ to be that write()+largefile induces very much
swap activity.

	-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Add ext3 data=guarded mode
  2009-04-20  9:07                 ` Mike Galbraith
@ 2009-04-20  9:26                   ` Jan Kara
  2009-04-20 12:15                     ` Mike Galbraith
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Kara @ 2009-04-20  9:26 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Amit Shah, Theodore Tso, Chris Mason, Jan Kara, Linus Torvalds,
	Linux Kernel Developers List, Ext4 Developers List

[-- Attachment #1: Type: text/plain, Size: 2385 bytes --]

On Mon 20-04-09 11:07:17, Mike Galbraith wrote:
> On Sun, 2009-04-19 at 11:54 +0530, Amit Shah wrote:
> > On (Sat) Apr 18 2009 [09:28:21], Mike Galbraith wrote:
> 
> > > Probably because you're swapping heavily, and that is perturbing your
> > 
> > The variance only affects the 4k test; the other times more or less
> > remain the same.
> 
> My box disagrees.
> 
> (bumps ulimits to test 4BG... OS+swap live on sdb btw)
> 
> ./test-file-zero-alloc-speed 4 /dev/sdf3 /media/root ext3 rw,_netdev,noatime,data=foo,acl,user_xattr
> 
> foo=guarded
> 4k   225  141   80  142  361
> 8k    74   96  362   78   84
> mm    55   57   57   57   57
> 
> foo=writeback
> 4k   179  264  187  125   93
> 8k    94  161   73  334   84
> mm    57   58   57   56   57
> 
> foo=ordered
> 4k    81   74   76   80   75
> 8k    77   76  224   79   79 
> mm    59   56   60   58   59
> 
> foo=journal
> 4k    95  297   69   83  420
> 8k    73  139  158   80   78
> mm    57   58   56   59   56
> 
> ./test-file-zero-alloc-speed 2 /dev/sdf3 /media/root ext3 rw,_netdev,noatime,data=foo,acl,user_xattr
> 
> foo=guarded
> 4k    28   27   27   28   28
> 8k    28   27   27   28   27
> 
> foo=writeback
> 4k    27   27   27   27   27
> 8k    28   28   27   27   28
> 
> All journal modes seem subject to bad throughput under heavy pressure,
> though data=ordered seems much less likely to suffer for some reason.
> Major difference _seems_ to be that write()+largefile induces very much
> swap activity.
  My rough guess is that this depends on the VM writeout behavior. In
ordered mode, we forcibly writeout all the dirty data on a transaction
commit which happens every 5 seconds so they don't accumulate that much.
In other journaling modes we don't do that and decisions about writeout
(probably how much pdflush manages to write in background vs. how much
VM throttles the process to do the writeback itself) cause variances in
the run time. But this is just a guess. You could gather blktraces of
slow and fast runs and then look if the amount of IO done by different
processes significantly differs. If Chris has merged by improvements to
Seekwatcher, then you could nicely visualize this (hmm, that doesn't seem
to be the case so I'm attaching the diff and a helper script - see comments
in the beginning of the script and command helps for usage).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: seekwatcher-tagging.diff --]
[-- Type: text/x-patch, Size: 15123 bytes --]

--- seekwatcher.orig	2009-01-15 00:04:20.000000000 +0100
+++ seekwatcher	2009-01-26 12:04:21.000000000 +0100
@@ -56,6 +56,7 @@
 from optparse import OptionParser
 
 blktrace_only = False
+tags = { "": 0 }
 
 try:
     from matplotlib import rcParams
@@ -136,9 +137,9 @@
     return 2.0
     sys.stderr.write("unknown command %s\n" % com)
 
-def loaddata(fh,delimiter=None, converters=None):
+def loaddata(fh,delimiter=None, converters=None, filter=True):
 
-    def iter(fh, delimiter, converters):
+    def iter(fh, delimiter, converters, filter):
         global devices_sector_max
 
         if converters is None: converters = {}
@@ -149,10 +150,20 @@
         last_cmd = None
         last_size = None
         last_dev = None
+        last_tag = None
         for i,line in enumerate(fh):
-            if not line.startswith('C'):
+            if filter and not line.startswith('C'):
                 continue
-            row = [converters.get(i,float)(val) for i,val in enumerate(line.split(delimiter))]
+            if not filter:
+                line = "C " + line
+            this_tag = 0
+            row = []
+            for i,val in enumerate(line.split(delimiter)):
+                if i < 9:
+                    row.append(converters.get(i,float)(val))
+                else:
+		    this_tag = tags.setdefault(val, len(tags))
+            row.append(this_tag)
             this_time = row[7]
             this_dev = row[8]
             this_sector = row[4]
@@ -165,7 +176,7 @@
             if (last_row and this_rw == last_rw and
                 this_dev == last_dev and
                 this_time - last_time < .5 and last_size < 512 and
-                this_sector == last_end):
+                this_sector == last_end and this_tag == last_tag):
                 last_end += this_size
                 last_size += this_size
                 last_row[5] += row[5]
@@ -182,11 +193,12 @@
             last_end = this_sector + this_size
             last_size = this_size
             last_dev = this_dev
+            last_tag = this_tag
         if last_row:
             for x in last_row:
                 yield x
 
-    X = numpy.fromiter(iter(fh, delimiter, converters), dtype=float)
+    X = numpy.fromiter(iter(fh, delimiter, converters, filter), dtype=float)
     return X
 
 def sort_by_time(data):
@@ -367,7 +379,7 @@
             os.remove(os.path.join(root, name))
     os.rmdir(png_dir)
 
-def plot_data(ax, rw, data, style, label, alpha=1):
+def plot_data(ax, data, style, label, alpha=1):
     def reduce_plot():
         reduce = {}
         skipped = 0
@@ -380,37 +392,48 @@
             y += 1
             h = reduce.setdefault(x, {})
             h[y] = 1
+            yield rbs[i]
             yield x * x_per_cell
             yield y * y_per_cell
+            yield tg[i]
     xcells = 325.0 * options.io_graph_cell_multi
     x_per_cell = (xmax - xmin) / xcells
     ycells = 80.0 * options.io_graph_cell_multi
     y_per_cell = (yzoommax - yzoommin) / ycells
 
-    if rw is None:
-        if options.reads_only:
-            rw = 0
-        if options.writes_only:
-            rw = 1
-    if rw != None:
-        if options.reads_only and rw != 0:
-            return
-        if options.writes_only and rw != 1:
-            return
-        rbs = data[:,1]
-        data = data[numpy.where(rbs == rw)]
+    rbs = data[:,1]
+    if options.reads_only:
+        data = data[numpy.where(rbs == 0)]
+    if options.writes_only:
+        data = data[numpy.where(rbs == 1)]
+
+    if len(data) == 0:
+        return []
+
     times = data[:,7]
     sectors = data[:,4]
-    if len(times) > 0:
-        t = numpy.fromiter(reduce_plot(), dtype=float)
-        t.shape = (len(t)/2, 2)
-        xdata = t[:,0]
-        ydata = t[:,1]
-        lines = ax.plot(t[:,0], t[:,1], options.io_graph_dots, mew=0,
-                        ms=options.io_graph_marker_size,
-                        label=label, alpha=alpha)
-        return lines
-    return []
+    tg = data[:,9]
+    t = numpy.fromiter(reduce_plot(), dtype=float)
+    t.shape = (len(t)/4, 4)
+    lines = []
+    for tag in tags:
+        at = t[numpy.where(t[:,3] == tags[tag])]
+        if len(at) == 0:
+            continue
+        if not options.writes_only:
+            atr = at[numpy.where(at[:,0] == 0)]
+            lines.extend(ax.plot(atr[:,1], atr[:,2], style, mew=0,
+                                 ms=options.io_graph_marker_size,
+                                 alpha=alpha,
+                                 label=tag + " Reads " + label))
+        if not options.reads_only:
+            atr = at[numpy.where(at[:,0] == 1)]
+            lines.extend(ax.plot(atr[:,1], atr[:,2], style, mew=0,
+                                 ms=options.io_graph_marker_size,
+                                 alpha=alpha,
+                                 label=tag + " Writes " + label))
+    return lines
+
 
 def add_roll(roll, max, num):
     if len(roll) == max:
@@ -624,11 +647,11 @@
     return data
 
 def shapeit(X):
-    lines = len(X) / 9
-    X.shape = (lines, 9)
+    lines = len(X) / 10
+    X.shape = (lines, 10)
 
 def unshapeit(X):
-    lines = len(X) * 9
+    lines = len(X) * 10
     X.shape = (lines, 1)
 
 def getlabel(i):
@@ -692,12 +715,49 @@
 def translate_sector(dev, sector):
     return device_translate[dev] + sector;
 
+def process_input(input, type):
+    global devices_sector_max
+    global device_translate
+    global must_sort
+
+    devices_sector_max = {}
+    if type == 0:
+        run = run_blkparse(input, converters)
+    elif type == 1:
+        if input == "-":
+            p = sys.stdin
+        else:
+            p = open(input, 'r')
+        run = loaddata(p, converters=converters, filter=False)
+
+    device_translate = {}
+    total = 0
+    if len(devices_sector_max) > 1:
+        must_sort = True
+        for x in devices_sector_max:
+            device_translate[x] = total + devices_sector_max[x]
+            total += devices_sector_max[x]
+    shapeit(run)
+    if len(devices_sector_max) > 1:
+        for x in run:
+            sector = x[4]
+            dev = x[8]
+            x[4] = device_translate[dev] + sector
+        
+    sorted = sort_by_time(run)
+    run = sorted
+
+    unshapeit(run)
+    return run
+
 usage = "usage: %prog [options]"
 parser = OptionParser(usage=usage)
 parser.add_option("-d", "--device", help="Device for blktrace", default=[],
                   action="append")
 parser.add_option("-t", "--trace", help="blktrace file", default=[],
                   action="append")
+parser.add_option("-f", "--file", help="parsed blktrace file", default=[],
+                  action="append")
 parser.add_option("-p", "--prog", help="exec program", default="")
 parser.add_option("", "--full-trace", help="Don't filter blktrace events",
                   default=False, action="store_true")
@@ -710,7 +770,7 @@
     parser.add_option("-o", "--output", help="output file", default="trace.png")
     parser.add_option("-l", "--label", help="label", default=[],
                       action="append")
-    parser.add_option("", "--dpi", help="dpi", default=120)
+    parser.add_option("", "--dpi", help="dpi", default=120, type="float")
     parser.add_option("", "--io-graph-dots", help="Disk IO dot style",
                       default='s')
     parser.add_option("", "--io-graph-marker-size", help="Disk IO dot size",
@@ -719,6 +779,8 @@
                       default=2, type="float")
     parser.add_option("-I", "--no-io-graph", help="Don't create an IO graph",
                       default=False, action="store_true")
+    parser.add_option("", "--only-io-graph", help="Create only IO graph",
+                      default=False, action="store_true");
     parser.add_option("-r", "--rolling-avg",
                   help="Rolling average for seeks and throughput (in seconds)",
                   default=None)
@@ -761,7 +823,7 @@
         rcParams['interactive'] = 'False'
     from pylab import *
 
-if not options.trace:
+if not options.trace and not options.file:
     parser.print_help()
     sys.exit(1)
 
@@ -788,29 +850,16 @@
 data = numpy.array([])
 runs = []
 must_sort = True
+devices_sector_max = {}
+device_translate = {}
 
 for x in options.trace:
-    devices_sector_max = {}
-    run = run_blkparse(x, converters)
-
-    device_translate = {}
-    total = 0
-    if len(devices_sector_max) > 1:
-        must_sort = True
-        for x in devices_sector_max:
-            device_translate[x] = total + devices_sector_max[x]
-            total += devices_sector_max[x]
-    shapeit(run)
-    if len(devices_sector_max) > 1:
-        for x in run:
-            sector = x[4]
-            dev = x[8]
-            x[4] = device_translate[dev] + sector
-        
-    sorted = sort_by_time(run)
-    run = sorted
+    run = process_input(x, 0)
+    runs.append(run)
+    data = numpy.append(data, run)
 
-    unshapeit(run)
+for x in options.file:
+    run = process_input(x, 1)
     runs.append(run)
     data = numpy.append(data, run)
 
@@ -910,6 +959,8 @@
 
 if options.no_io_graph:
     total_graphs = 2
+elif options.only_io_graph:
+    total_graphs = 1
 else:
     total_graphs = 3
 
@@ -922,76 +973,78 @@
 if options.title:
     options.title += "\n\n"
 
-# Throughput goes at the botoom
-a = subplot(total_graphs, 1, total_graphs)
-for i in xrange(len(runs)):
-    label = getlabel(i)
-    plot_throughput(a, None, runs[i], '-', label)
-
+# Prepare ticks
 # make sure the final second goes on the x axes
 ticks = list(arange(xmin, xmax, xmax/8))
 ticks.append(xmax)
 xticks = ticks
-a.set_xticks(ticks)
-a.set_yticklabels( [ "%d" % x for x in ticks ])
 if ticks[-1] < 4:
     xticklabels = [ "%.1f" % x for x in ticks ]
 else:
     xticklabels = [ "%d" % x for x in ticks ]
-a.set_xticklabels(xticklabels)
 
-# cut down the number of yticks to something more reasonable
-ticks = a.get_yticks()
-ticks = list(arange(0, ticks[-1] + ticks[-1]/4, ticks[-1]/4))
-a.set_yticks(ticks)
+if not options.only_io_graph:
+    # Throughput goes at the botoom
+    a = subplot(total_graphs, 1, total_graphs)
+    for i in xrange(len(runs)):
+        label = getlabel(i)
+        plot_throughput(a, None, runs[i], '-', label)
 
-if ticks[-1] < 4:
-    a.set_yticklabels( [ "%.1f" % x for x in ticks ])
-else:
-    a.set_yticklabels( [ "%d" % x for x in ticks ])
+    a.set_xticks(xticks)
+    a.set_yticklabels( [ "%d" % x for x in xticks ])
+    a.set_xticklabels(xticklabels)
 
-a.set_title('Throughput')
-a.set_ylabel('MB/s')
+    # cut down the number of yticks to something more reasonable
+    ticks = a.get_yticks()
+    ticks = list(arange(0, ticks[-1] + ticks[-1]/4, ticks[-1]/4))
+    a.set_yticks(ticks)
 
-# the bottom graph gets xticks, set it here
-a.set_xlabel('Time (seconds)')
-if options.label:
-    a.legend(loc=(1.01, 0.5), shadow=True, pad=0.5, numpoints=2,
-                  handletextsep = 0.005,
-                  labelsep = 0.01,
-                  prop=FontProperties(size='x-small') )
-
-# next is the seek count graph
-a = subplot(total_graphs, 1, total_graphs - 1)
-for i in xrange(len(runs)):
-    label = getlabel(i)
-    plot_seek_count(a, None, runs[i], '-', label)
-
-# cut down the number of yticks to something more reasonable
-ticks = a.get_yticks()
-ticks = list(arange(0, ticks[-1] + ticks[-1]/4, ticks[-1]/4))
-a.set_yticks(ticks)
-a.set_yticklabels( [ str(int(x)) for x in ticks ])
+    if ticks[-1] < 4:
+        a.set_yticklabels( [ "%.1f" % x for x in ticks ])
+    else:
+        a.set_yticklabels( [ "%d" % x for x in ticks ])
 
-if options.no_io_graph and options.title:
-    a.set_title(options.title + 'Seek Count')
-else:
-    a.set_title('Seek Count')
+    a.set_title('Throughput')
+    a.set_ylabel('MB/s')
+
+    # the bottom graph gets xticks, set it here
+    a.set_xlabel('Time (seconds)')
+    if options.label:
+        a.legend(loc=(1.01, 0.5), shadow=True, pad=0.5, numpoints=2,
+                      handletextsep = 0.005,
+                      labelsep = 0.01,
+                      prop=FontProperties(size='x-small') )
 
-a.set_ylabel('Seeks / sec')
-if options.label:
-    a.legend(loc=(1.01, 0.5), shadow=True, pad=0.5, numpoints=2,
-                  handletextsep = 0.005,
-                  labelsep = 0.01,
-                  prop=FontProperties(size='x-small') )
+    # next is the seek count graph
+    a = subplot(total_graphs, 1, total_graphs - 1)
+    for i in xrange(len(runs)):
+        label = getlabel(i)
+        plot_seek_count(a, None, runs[i], '-', label)
+
+    # cut down the number of yticks to something more reasonable
+    ticks = a.get_yticks()
+    ticks = list(arange(0, ticks[-1] + ticks[-1]/4, ticks[-1]/4))
+    a.set_yticks(ticks)
+    a.set_yticklabels( [ str(int(x)) for x in ticks ])
+
+    if options.no_io_graph and options.title:
+        a.set_title(options.title + 'Seek Count')
+    else:
+        a.set_title('Seek Count')
+
+    a.set_ylabel('Seeks / sec')
+    if options.label:
+        a.legend(loc=(1.01, 0.5), shadow=True, pad=0.5, numpoints=2,
+                      handletextsep = 0.005,
+                      labelsep = 0.01,
+                      prop=FontProperties(size='x-small') )
 
 # and the optional IO graph
 if not options.no_io_graph:
-    a = subplot(total_graphs, 1, total_graphs - 2)
+    a = subplot(total_graphs, 1, 1)
     for i in xrange(len(runs)):
         label = getlabel(i)
-        plot_data(a, 0, runs[i], options.io_graph_dots, label + " Read")
-        plot_data(a, 1, runs[i], options.io_graph_dots, label + " Write")
+        plot_data(a, runs[i], options.io_graph_dots, label)
 
     af = AnnoteFinder(axis=a)
     connect('button_press_event', af)
@@ -1008,17 +1061,27 @@
     ticks.append(yzoommax)
     a.set_yticks(ticks)
     a.set_yticklabels( [ str(int(x/2048)) for x in ticks ] )
-    a.legend(loc=(1.01, 0.5), shadow=True, pad=0.3, numpoints=1,
-                  handletextsep = 0.005,
-                  labelsep = 0.01,
-                  markerscale = 1.1,
-                  prop=FontProperties(size='x-small') )
+    if not options.only_io_graph:
+        a.legend(loc=(1.01, 0.5), shadow=True, pad=0.3, numpoints=1,
+                 handletextsep = 0.005,
+                 labelsep = 0.01,
+                 markerscale = 1.1,
+                 prop=FontProperties(size='x-small') )
+    else:
+        a.legend(loc=(0,-0.25), ncol=4, columnspacing=0.1,
+                 shadow=True, borderpad=0.3, numpoints=1,
+                 handletextpad = 0.005,
+                 labelspacing = 0.01,
+                 markerscale = 1.1,
+                 prop=FontProperties(size='x-small') )
+        subplots_adjust(bottom=0.2)
     a.set_ylim(yzoommin, yzoommax)
 
-# squeeze the graphs over to the left a bit to make room for the
-# legends
-#
-subplots_adjust(right = 0.8, hspace=0.3)
+if not options.only_io_graph:
+    # squeeze the graphs over to the left a bit to make room for the
+    # legends
+    #
+    subplots_adjust(right = 0.8, hspace=0.3)
 
 # finally, some global bits for each subplot
 for x in xrange(1, total_graphs + 1):

[-- Attachment #3: tag-by-process --]
[-- Type: text/plain, Size: 2648 bytes --]

#!/usr/bin/env python

# Copyright (C) 2009 Novell.  All rights reserved.
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public
# License v2 as published by the Free Software Foundation.
# 
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
# General Public License for more details.
# 
# You should have received a copy of the GNU General Public
# License along with this program; if not, write to the
# Free Software Foundation, Inc., 59 Temple Place - Suite 330,
# Boston, MA 021110-1307, USA.
#
# You can use this script to produce files for seekwatcher with
# IO tagged by PID and / or command. Use option -m <command> to
# merge processes which have the same command name.
#
# Example:
# tag-by-process -t sda | seekwatcher -f - --only-io-graph

import sys, os, signal, time, commands, tempfile, signal
from optparse import OptionParser

def loaddata(fh):
    for i,line in enumerate(fh):
        if not line.startswith('Q'):
            continue
        row = line.split()
        if options.merge.count(row[10]) > 0:
            row[9] = row[10]
        else:
            row[9] = row[10] + "(" + row[9] + ")"
        for i in range(1,10):
            print row[i],
        print

def run_blkparse(trace):
    tracefiles = []
    seen = {}
    sys.stderr.write("run_blkparse on %s\n" % trace)
    if not os.path.exists(trace + "blktrace.0"):
        dirname = os.path.dirname(trace) or "."
        files = os.listdir(dirname)
        joinname = os.path.dirname(trace) or ""
        for x in files:
            x = os.path.join(joinname, x)
            if x.startswith(trace) and ".blktrace." in x:
                i = x.rindex('.blktrace.')
                cur = x[0:i]
                if cur not in seen:
                    tracefiles.append(x[0:i])
                    seen[cur] = 1
    else:
        tracefiles.append(trace)

    for x in tracefiles:
        sys.stderr.write("using tracefile %s\n" % x)
        p = os.popen('blkparse -q -i ' + x +
             ' -f "%a %d %M %m %S %N %s %5T.%9t %D %p %C\n"')
        cur = loaddata(p)
 
usage = "usage: %prog [options]"
parser = OptionParser(usage=usage)
parser.add_option("-t", "--trace", help="blktrace file", default=[],
                  action="append")
parser.add_option("-m", "--merge", help="merge commands with given names",
                  default=[], action="append")

(options,args) = parser.parse_args()

for x in options.trace:
    run = run_blkparse(x)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Add ext3 data=guarded mode
  2009-04-20  9:26                   ` Jan Kara
@ 2009-04-20 12:15                     ` Mike Galbraith
  2009-04-20 12:56                       ` Amit Shah
  0 siblings, 1 reply; 34+ messages in thread
From: Mike Galbraith @ 2009-04-20 12:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amit Shah, Theodore Tso, Chris Mason, Linus Torvalds,
	Linux Kernel Developers List, Ext4 Developers List

On Mon, 2009-04-20 at 11:26 +0200, Jan Kara wrote:
> On Mon 20-04-09 11:07:17, Mike Galbraith wrote:
> > All journal modes seem subject to bad throughput under heavy pressure,
> > though data=ordered seems much less likely to suffer for some reason.
> > Major difference _seems_ to be that write()+largefile induces very much
> > swap activity.

>   My rough guess is that this depends on the VM writeout behavior. In
> ordered mode, we forcibly writeout all the dirty data on a transaction
> commit which happens every 5 seconds so they don't accumulate that much.

Aha.

> In other journaling modes we don't do that and decisions about writeout
> (probably how much pdflush manages to write in background vs. how much
> VM throttles the process to do the writeback itself) cause variances in
> the run time. But this is just a guess. You could gather blktraces of
> slow and fast runs and then look if the amount of IO done by different
> processes significantly differs. If Chris has merged by improvements to
> Seekwatcher, then you could nicely visualize this (hmm, that doesn't seem
> to be the case so I'm attaching the diff and a helper script - see comments
> in the beginning of the script and command helps for usage).

Not necessary methinks.  Actually _reading_ the proggy instead of just
glancing at it...

@@ -178,7 +178,7 @@ int do_write_chunks(char *target, int *f
        unsigned long long remain = len;
        char *zeros;

-       zeros = calloc(1, len);
+       zeros = calloc(1, data);

I assume that was a booboo. 

	-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Add ext3 data=guarded mode
  2009-04-20 12:15                     ` Mike Galbraith
@ 2009-04-20 12:56                       ` Amit Shah
  2009-04-20 13:06                         ` Mike Galbraith
  0 siblings, 1 reply; 34+ messages in thread
From: Amit Shah @ 2009-04-20 12:56 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Jan Kara, Theodore Tso, Chris Mason, Linus Torvalds,
	Linux Kernel Developers List, Ext4 Developers List

On (Mon) Apr 20 2009 [14:15:14], Mike Galbraith wrote:
> On Mon, 2009-04-20 at 11:26 +0200, Jan Kara wrote:
> > On Mon 20-04-09 11:07:17, Mike Galbraith wrote:
> > > All journal modes seem subject to bad throughput under heavy pressure,
> > > though data=ordered seems much less likely to suffer for some reason.
> > > Major difference _seems_ to be that write()+largefile induces very much
> > > swap activity.
> 
> >   My rough guess is that this depends on the VM writeout behavior. In
> > ordered mode, we forcibly writeout all the dirty data on a transaction
> > commit which happens every 5 seconds so they don't accumulate that much.
> 
> Aha.
> 
> > In other journaling modes we don't do that and decisions about writeout
> > (probably how much pdflush manages to write in background vs. how much
> > VM throttles the process to do the writeback itself) cause variances in
> > the run time. But this is just a guess. You could gather blktraces of
> > slow and fast runs and then look if the amount of IO done by different
> > processes significantly differs. If Chris has merged by improvements to
> > Seekwatcher, then you could nicely visualize this (hmm, that doesn't seem
> > to be the case so I'm attaching the diff and a helper script - see comments
> > in the beginning of the script and command helps for usage).
> 
> Not necessary methinks.  Actually _reading_ the proggy instead of just
> glancing at it...
> 
> @@ -178,7 +178,7 @@ int do_write_chunks(char *target, int *f
>         unsigned long long remain = len;
>         char *zeros;
> 
> -       zeros = calloc(1, len);
> +       zeros = calloc(1, data);
> 
> I assume that was a booboo. 

Ouch! thanks for spotting that!

I'll re-run the tests later tonight..

		Amit

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Add ext3 data=guarded mode
  2009-04-20 12:56                       ` Amit Shah
@ 2009-04-20 13:06                         ` Mike Galbraith
  0 siblings, 0 replies; 34+ messages in thread
From: Mike Galbraith @ 2009-04-20 13:06 UTC (permalink / raw)
  To: Amit Shah
  Cc: Jan Kara, Theodore Tso, Chris Mason, Linus Torvalds,
	Linux Kernel Developers List, Ext4 Developers List

On Mon, 2009-04-20 at 18:26 +0530, Amit Shah wrote:

> I'll re-run the tests later tonight..

I ran ext3 in all journaling modes (one run), and there was spit
difference.

(the mmap test is a bit slow and unfriendly)

marge:..tmp/tmp # ./test-file-zero-alloc-speed 4 /dev/sdf3 /media/root ext3 rw,_netdev,noatime,data=ordered,acl,user_xattr
posix-fallocate run time:
        seconds:microseconds: 1240230249:280060
        seconds:microseconds: 1240230303:989879
        (approx 54s)
mmap run time:
        seconds:microseconds: 1240230304:67286
        seconds:microseconds: 1240230363:882957
        (approx 59s)
4096-sized chunk run time:
        seconds:microseconds: 1240230363:989225
        seconds:microseconds: 1240230419:242924
        (approx 56s)
8192-sized chunk run time:
        seconds:microseconds: 1240230419:365540
        seconds:microseconds: 1240230473:307148
        (approx 54s)
marge:..tmp/tmp # ./test-file-zero-alloc-speed 4 /dev/sdf3 /media/root ext3 rw,_netdev,noatime,data=writeback,acl,user_xattr
posix-fallocate run time:
        seconds:microseconds: 1240230511:276152
        seconds:microseconds: 1240230565:40419
        (approx 54s)
mmap run time:
        seconds:microseconds: 1240230571:139959
        seconds:microseconds: 1240230633:254590
        (approx 62s)
4096-sized chunk run time:
        seconds:microseconds: 1240230639:300132
        seconds:microseconds: 1240230693:95665
        (approx 54s)
8192-sized chunk run time:
        seconds:microseconds: 1240230698:655975
        seconds:microseconds: 1240230753:902749
        (approx 55s)
marge:..tmp/tmp # ./test-file-zero-alloc-speed 4 /dev/sdf3 /media/root ext3 rw,_netdev,noatime,data=guarded,acl,user_xattr
posix-fallocate run time:
        seconds:microseconds: 1240230802:594939
        seconds:microseconds: 1240230856:602117
        (approx 54s)
mmap run time:
        seconds:microseconds: 1240230862:792587
        seconds:microseconds: 1240230924:985089
        (approx 62s)
4096-sized chunk run time:
        seconds:microseconds: 1240230930:827949
        seconds:microseconds: 1240230985:663422
        (approx 55s)
8192-sized chunk run time:
        seconds:microseconds: 1240230991:483085
        seconds:microseconds: 1240231046:601340
        (approx 55s)
marge:..tmp/tmp # ./test-file-zero-alloc-speed 4 /dev/sdf3 /media/root ext3 rw,_netdev,noatime,data=journal,acl,user_xattr
posix-fallocate run time:
        seconds:microseconds: 1240231074:745621
        seconds:microseconds: 1240231128:740106
        (approx 54s)
mmap run time:
        seconds:microseconds: 1240231134:830362
        seconds:microseconds: 1240231193:939029
        (approx 59s)
4096-sized chunk run time:
        seconds:microseconds: 1240231199:800554
        seconds:microseconds: 1240231254:600778
        (approx 55s)
8192-sized chunk run time:
        seconds:microseconds: 1240231260:386834
        seconds:microseconds: 1240231315:121745
        (approx 55s)




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Add ext3 data=guarded mode
  2009-04-16 19:42     ` [PATCH] " Chris Mason
  2009-04-17 11:04       ` Mike Galbraith
  2009-04-17 18:09       ` Amit Shah
@ 2009-04-20 13:44       ` Jan Kara
  2009-04-20 14:18         ` Chris Mason
  2 siblings, 1 reply; 34+ messages in thread
From: Jan Kara @ 2009-04-20 13:44 UTC (permalink / raw)
  To: Chris Mason
  Cc: Linus Torvalds, Theodore Ts'o, Linux Kernel Developers List,
	Ext4 Developers List

  Hi Chris,

On Thu 16-04-09 15:42:01, Chris Mason wrote:
> 
> ext3 data=ordered mode makes sure that data blocks are on disk before
> the metadata that references them, which avoids files full of garbage
> or previously deleted data after a crash.  It does this by adding every dirty
> buffer onto a list of things that must be written before a commit.
> 
> This makes every fsync write out all the dirty data on the entire FS, which
> has high latencies and is generally much more expensive than it needs to be.
> 
> Another way to avoid exposing stale data after a crash is to wait until
> after the data buffers are written before updating the on-disk record
> of the file's size.  If we crash before the data IO is done, i_size
> doesn't yet include the new blocks and no stale data is exposed.
> 
> This patch adds the delayed i_size update to ext3, along with a new
> mount option (data=guarded) to enable it.  The basic mechanism works like
> this:
> 
> * Change block_write_full_page to take an end_io handler as a parameter.
> This allows us to make an end_io handler that queues buffer heads for
> a workqueue where the real work of updating the on disk i_size is done.
> 
> * Add an rbtree to the in-memory ext3 inode for tracking data=guarded
> buffer heads that are waiting to be sent to disk.
> 
> * Add an ext3 guarded write_end call to add buffer heads for newly
> allocated blocks into the rbtree.  If we have a newly allocated block that is
> filling a hole inside i_size, this is done as an old style data=ordered write
> instead.
> 
> * Add an ext3 guarded writepage call that uses a special buffer head
> end_io handler for buffers that are marked as guarded.  Again, if we find
> newly allocated blocks filling holes, they are sent through data=ordered
> instead of data=guarded.
> 
> * When a guarded IO finishes, kick a per-FS workqueue to do the
> on disk i_size updates.  The workqueue function must be very careful.  We
> only update the on disk i_size if all of the IO between the old on
> disk i_size and the new on disk i_size is complete.  This is why an
> rbtree is used to track the pending buffers, that way we can verify all
> of the IO is actually done.  The on disk i_size is incrementally updated to
> the largest safe value every time an IO completes.
> 
> * When we start tracking guarded buffers on a given inode, we put the
> inode into ext3's orphan list.  This way if we do crash, the file will
> be truncated back down to the on disk i_size and we'll free any blocks that
> were not completely written.  The inode is removed from the orphan list
> only after all the guarded buffers are done.
> 
> Signed-off-by: Chris Mason <chris.mason@oracle.com>
  I've read the patch. I don't think I've got all the subtleties but before
diving into it more I'd like to ask why do we do the things in so
complicated way? Maybe I'm missing some issues so let's see:
  1) If I got it right, hole filling goes through standard ordered mode so
we can ignore such writes. So why do we have special writepage? I should
look just like writepage for ordered mode and we could just tweak
ext3_ordered_writepage() (probably renamed) to do:
	if (ext3_should_order_data(inode))
		err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
                                        NULL, journal_dirty_data_fn);
	else
		err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
                                        NULL, journal_dirty_guarded_data_fn);
  2) Why is there any RB tree after all? Since guarded are just extending
writes, we can have a linked list of guarded buffers. We always append
at the end, we update i_size if the current buffer has no predecestor in the
list.
  3) Currently truncate() does filemap_write_and_wait() - is it really
needed? Each guarded bh could carry with itself i_disksize it should update
to when IO is finished. Extending truncate will just update this i_disksize
at the last member of the list (or update i_disksize when the list is
empty). Shortening truncate will walk the list of guarded bh's, removing from
the list those beyond new i_size, then it will behave like the extending
truncate (it works even if current i_disksize is larger than new i_size).
Note, that before we get to ext3_truncate() mm walks all the pages beyond
i_size and waits for page writeback so by the time ext3_truncate() is
called, all the IO is finished and dirty pages are canceled.
  IO finished callback will update i_disksize to carried value if the
buffer is the first in the list, otherwise it will copy it's value to the
previous member of the list.
  4) Do we have to call end_page_writeback() from the work queue? That
could make IO completion times significantly longer on a good disk array,
couldn't it? There is a way how to solve this I believe although it might
be too hacky / complicated. We have to update i_disksize before calling
end_page_writeback() because of truncate races and generally for
filemap_fdatawrite() to work. So what we could do is:
  guarded_end_io():
    update i_disksize
    call something like __mark_inode_dirty(inode, I_DIRTY_DATASYNC) but
      avoid calling ext3_dirty_inode() or somehow make that we immediately
      return from it.
    call end_buffer_async_write()
    queue addition of inode to the transaction / removal from orphan list

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Add ext3 data=guarded mode
  2009-04-20 13:44       ` Jan Kara
@ 2009-04-20 14:18         ` Chris Mason
  2009-04-20 14:42           ` Jan Kara
  0 siblings, 1 reply; 34+ messages in thread
From: Chris Mason @ 2009-04-20 14:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linus Torvalds, Theodore Ts'o, Linux Kernel Developers List,
	Ext4 Developers List

On Mon, 2009-04-20 at 15:44 +0200, Jan Kara wrote:
> Hi Chris,
> 
> On Thu 16-04-09 15:42:01, Chris Mason wrote:
> > 
> > ext3 data=ordered mode makes sure that data blocks are on disk before
> > the metadata that references them, which avoids files full of garbage
> > or previously deleted data after a crash.  It does this by adding every dirty
> > buffer onto a list of things that must be written before a commit.
> > 
> > This makes every fsync write out all the dirty data on the entire FS, which
> > has high latencies and is generally much more expensive than it needs to be.
> > 
> > Another way to avoid exposing stale data after a crash is to wait until
> > after the data buffers are written before updating the on-disk record
> > of the file's size.  If we crash before the data IO is done, i_size
> > doesn't yet include the new blocks and no stale data is exposed.
> > 
> > This patch adds the delayed i_size update to ext3, along with a new
> > mount option (data=guarded) to enable it.  The basic mechanism works like
> > this:
> > 
> > * Change block_write_full_page to take an end_io handler as a parameter.
> > This allows us to make an end_io handler that queues buffer heads for
> > a workqueue where the real work of updating the on disk i_size is done.
> > 
> > * Add an rbtree to the in-memory ext3 inode for tracking data=guarded
> > buffer heads that are waiting to be sent to disk.
> > 
> > * Add an ext3 guarded write_end call to add buffer heads for newly
> > allocated blocks into the rbtree.  If we have a newly allocated block that is
> > filling a hole inside i_size, this is done as an old style data=ordered write
> > instead.
> > 
> > * Add an ext3 guarded writepage call that uses a special buffer head
> > end_io handler for buffers that are marked as guarded.  Again, if we find
> > newly allocated blocks filling holes, they are sent through data=ordered
> > instead of data=guarded.
> > 
> > * When a guarded IO finishes, kick a per-FS workqueue to do the
> > on disk i_size updates.  The workqueue function must be very careful.  We
> > only update the on disk i_size if all of the IO between the old on
> > disk i_size and the new on disk i_size is complete.  This is why an
> > rbtree is used to track the pending buffers, that way we can verify all
> > of the IO is actually done.  The on disk i_size is incrementally updated to
> > the largest safe value every time an IO completes.
> > 
> > * When we start tracking guarded buffers on a given inode, we put the
> > inode into ext3's orphan list.  This way if we do crash, the file will
> > be truncated back down to the on disk i_size and we'll free any blocks that
> > were not completely written.  The inode is removed from the orphan list
> > only after all the guarded buffers are done.
> > 
> > Signed-off-by: Chris Mason <chris.mason@oracle.com>
>   I've read the patch. I don't think I've got all the subtleties but before
> diving into it more I'd like to ask why do we do the things in so
> complicated way? 

Thanks for reviewing things!

> Maybe I'm missing some issues so let's see:
>   1) If I got it right, hole filling goes through standard ordered mode so
> we can ignore such writes. So why do we have special writepage? I should
> look just like writepage for ordered mode and we could just tweak
> ext3_ordered_writepage() (probably renamed) to do:
> 	if (ext3_should_order_data(inode))
> 		err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
>                                         NULL, journal_dirty_data_fn);
> 	else
> 		err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
>                                         NULL, journal_dirty_guarded_data_fn);

That would work.  My first writepage was more complex, it shrunk as the
patch evolved.  Another question is if we want to use exactly the same
writepage for guarded and ordered.  I've always though data=ordered
should only order new blocks...

>   2) Why is there any RB tree after all? Since guarded are just extending
> writes, we can have a linked list of guarded buffers. We always append
> at the end, we update i_size if the current buffer has no predecestor in the
> list.

A guarded write is anything from write() that is past the disk i_size.
lseek and friends mean it could happen in any order.

>   3) Currently truncate() does filemap_write_and_wait() - is it really
> needed? Each guarded bh could carry with itself i_disksize it should update
> to when IO is finished. Extending truncate will just update this i_disksize
> at the last member of the list (or update i_disksize when the list is
> empty). 
>
> Shortening truncate will walk the list of guarded bh's, removing from
> the list those beyond new i_size, then it will behave like the extending
> truncate (it works even if current i_disksize is larger than new i_size).
> Note, that before we get to ext3_truncate() mm walks all the pages beyond
> i_size and waits for page writeback so by the time ext3_truncate() is
> called, all the IO is finished and dirty pages are canceled.

The problem here was the disk i_size being updated by ext3_setattr
before the vmtruncate calls calls ext3_truncate().  So the guarded IO
might wander in and change the i_disksize update done by setattr.

It all made me a bit dizzy and I just tossed the write_and_wait in
instead.

At the end of the day, we're waiting for guarded writes only, and we
probably would have ended up waiting on those exact same pages in
vmtruncate anyway.  So, I do agree we could avoid the write with more
code, but is this really a performance critical section?

>   IO finished callback will update i_disksize to carried value if the
> buffer is the first in the list, otherwise it will copy it's value to the
> previous member of the list.
>   4) Do we have to call end_page_writeback() from the work queue? That
> could make IO completion times significantly longer on a good disk array,
> couldn't it? 

My understanding is that XFS is doing something similar with the
workqueue already, without big performance problems.

> There is a way how to solve this I believe although it might
> be too hacky / complicated. We have to update i_disksize before calling
> end_page_writeback() because of truncate races and generally for
> filemap_fdatawrite() to work. So what we could do is:
>   guarded_end_io():
>     update i_disksize
>     call something like __mark_inode_dirty(inode, I_DIRTY_DATASYNC) but
>       avoid calling ext3_dirty_inode() or somehow make that we immediately
>       return from it.
>     call end_buffer_async_write()
>     queue addition of inode to the transaction / removal from orphan list

It could, but we end up with a list of inodes that must be logged before
they can be freed.  This was a problem in the past (before the
dirty_inode operation was added) because logging an inode is relatively
expensive, and we have no mechanism to throttle them.

In the past it lead to deadlocks because kswapd would try and log all
the dirty inodes, and someone else had the transaction pinned while
waiting on kswapd to find free ram.  We might be able to do better
today, but I didn't want to cram that into this patch series as well.

While I was resisting urges, I also didn't make a writepages op.  But
with just a little more code....

-chris



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Add ext3 data=guarded mode
  2009-04-20 14:18         ` Chris Mason
@ 2009-04-20 14:42           ` Jan Kara
  2009-04-20 14:58             ` Chris Mason
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Kara @ 2009-04-20 14:42 UTC (permalink / raw)
  To: Chris Mason
  Cc: Linus Torvalds, Theodore Ts'o, Linux Kernel Developers List,
	Ext4 Developers List

On Mon 20-04-09 10:18:25, Chris Mason wrote:
> On Mon, 2009-04-20 at 15:44 +0200, Jan Kara wrote:
> > Hi Chris,
> > 
> > On Thu 16-04-09 15:42:01, Chris Mason wrote:
> > > 
> > > ext3 data=ordered mode makes sure that data blocks are on disk before
> > > the metadata that references them, which avoids files full of garbage
> > > or previously deleted data after a crash.  It does this by adding every dirty
> > > buffer onto a list of things that must be written before a commit.
> > > 
> > > This makes every fsync write out all the dirty data on the entire FS, which
> > > has high latencies and is generally much more expensive than it needs to be.
> > > 
> > > Another way to avoid exposing stale data after a crash is to wait until
> > > after the data buffers are written before updating the on-disk record
> > > of the file's size.  If we crash before the data IO is done, i_size
> > > doesn't yet include the new blocks and no stale data is exposed.
> > > 
> > > This patch adds the delayed i_size update to ext3, along with a new
> > > mount option (data=guarded) to enable it.  The basic mechanism works like
> > > this:
> > > 
> > > * Change block_write_full_page to take an end_io handler as a parameter.
> > > This allows us to make an end_io handler that queues buffer heads for
> > > a workqueue where the real work of updating the on disk i_size is done.
> > > 
> > > * Add an rbtree to the in-memory ext3 inode for tracking data=guarded
> > > buffer heads that are waiting to be sent to disk.
> > > 
> > > * Add an ext3 guarded write_end call to add buffer heads for newly
> > > allocated blocks into the rbtree.  If we have a newly allocated block that is
> > > filling a hole inside i_size, this is done as an old style data=ordered write
> > > instead.
> > > 
> > > * Add an ext3 guarded writepage call that uses a special buffer head
> > > end_io handler for buffers that are marked as guarded.  Again, if we find
> > > newly allocated blocks filling holes, they are sent through data=ordered
> > > instead of data=guarded.
> > > 
> > > * When a guarded IO finishes, kick a per-FS workqueue to do the
> > > on disk i_size updates.  The workqueue function must be very careful.  We
> > > only update the on disk i_size if all of the IO between the old on
> > > disk i_size and the new on disk i_size is complete.  This is why an
> > > rbtree is used to track the pending buffers, that way we can verify all
> > > of the IO is actually done.  The on disk i_size is incrementally updated to
> > > the largest safe value every time an IO completes.
> > > 
> > > * When we start tracking guarded buffers on a given inode, we put the
> > > inode into ext3's orphan list.  This way if we do crash, the file will
> > > be truncated back down to the on disk i_size and we'll free any blocks that
> > > were not completely written.  The inode is removed from the orphan list
> > > only after all the guarded buffers are done.
> > > 
> > > Signed-off-by: Chris Mason <chris.mason@oracle.com>
> >   I've read the patch. I don't think I've got all the subtleties but before
> > diving into it more I'd like to ask why do we do the things in so
> > complicated way? 
> 
> Thanks for reviewing things!
> 
> > Maybe I'm missing some issues so let's see:
> >   1) If I got it right, hole filling goes through standard ordered mode so
> > we can ignore such writes. So why do we have special writepage? I should
> > look just like writepage for ordered mode and we could just tweak
> > ext3_ordered_writepage() (probably renamed) to do:
> > 	if (ext3_should_order_data(inode))
> > 		err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
> >                                         NULL, journal_dirty_data_fn);
> > 	else
> > 		err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
> >                                         NULL, journal_dirty_guarded_data_fn);
> 
> That would work.  My first writepage was more complex, it shrunk as the
> patch evolved.  Another question is if we want to use exactly the same
> writepage for guarded and ordered.  I've always though data=ordered
> should only order new blocks...
  Hmm, true. After my last change we already don't file data buffers in
ordered_writepage() if the page is fully mapped to disk so doing this in
all the cases is fine. Actually, I'll soon write ext3_page_mkwrite() to do
the block allocation on page fault time so after that we can get rid of most
of the code in ext3_ordered_writepage().

> >   2) Why is there any RB tree after all? Since guarded are just extending
> > writes, we can have a linked list of guarded buffers. We always append
> > at the end, we update i_size if the current buffer has no predecestor in the
> > list.
> 
> A guarded write is anything from write() that is past the disk i_size.
> lseek and friends mean it could happen in any order.
  Are you sure? Looking and ext3_guarded_write_end():
...
+       if (test_clear_buffer_datanew(bh)) {
+               /*
+                * if we're filling a hole inside i_size, we need to
+                * fall back to the old style data=ordered
+                */
+               if (offset < inode->i_size) {
+                       ret = ext3_journal_dirty_data(handle, bh);
+                       goto out;
+               }
...
  So it seems we always to ordered write unless we are appending / have
blocks allocated. You could have i_disksize in the check but is it really
worth it? IMO getting rid of the RB tree might be better ;)

> >   3) Currently truncate() does filemap_write_and_wait() - is it really
> > needed? Each guarded bh could carry with itself i_disksize it should update
> > to when IO is finished. Extending truncate will just update this i_disksize
> > at the last member of the list (or update i_disksize when the list is
> > empty). 
> >
> > Shortening truncate will walk the list of guarded bh's, removing from
> > the list those beyond new i_size, then it will behave like the extending
> > truncate (it works even if current i_disksize is larger than new i_size).
> > Note, that before we get to ext3_truncate() mm walks all the pages beyond
> > i_size and waits for page writeback so by the time ext3_truncate() is
> > called, all the IO is finished and dirty pages are canceled.
> 
> The problem here was the disk i_size being updated by ext3_setattr
> before the vmtruncate calls calls ext3_truncate().  So the guarded IO
> might wander in and change the i_disksize update done by setattr.
> 
> It all made me a bit dizzy and I just tossed the write_and_wait in
> instead.
> 
> At the end of the day, we're waiting for guarded writes only, and we
> probably would have ended up waiting on those exact same pages in
> vmtruncate anyway.  So, I do agree we could avoid the write with more
> code, but is this really a performance critical section?
  Well, not really critical but also not negligible - mainly because with
your approach we end up *submitting* new writes we could just be canceled
otherwise. Without fdatawrite(), data of short-lived files need not ever
reach the disk similarly as in writeback mode (OK, this is connected with
the fact that you actually don't have fdatawrite() before ext3_truncate()
in ext3_delete_inode() and that's what initially puzzled me).

> >   IO finished callback will update i_disksize to carried value if the
> > buffer is the first in the list, otherwise it will copy it's value to the
> > previous member of the list.
> >   4) Do we have to call end_page_writeback() from the work queue? That
> > could make IO completion times significantly longer on a good disk array,
> > couldn't it? 
> 
> My understanding is that XFS is doing something similar with the
> workqueue already, without big performance problems.
  OK.

> > There is a way how to solve this I believe although it might
> > be too hacky / complicated. We have to update i_disksize before calling
> > end_page_writeback() because of truncate races and generally for
> > filemap_fdatawrite() to work. So what we could do is:
> >   guarded_end_io():
> >     update i_disksize
> >     call something like __mark_inode_dirty(inode, I_DIRTY_DATASYNC) but
> >       avoid calling ext3_dirty_inode() or somehow make that we immediately
> >       return from it.
> >     call end_buffer_async_write()
> >     queue addition of inode to the transaction / removal from orphan list
> 
> It could, but we end up with a list of inodes that must be logged before
> they can be freed.  This was a problem in the past (before the
> dirty_inode operation was added) because logging an inode is relatively
> expensive, and we have no mechanism to throttle them.
> 
> In the past it lead to deadlocks because kswapd would try and log all
> the dirty inodes, and someone else had the transaction pinned while
> waiting on kswapd to find free ram.  We might be able to do better
> today, but I didn't want to cram that into this patch series as well.
  Ah, OK. I didn't know this. Anyway, if we find getting rid of the work
queue is useful, we can do it later. It would be rather local change.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Add ext3 data=guarded mode
  2009-04-20 14:42           ` Jan Kara
@ 2009-04-20 14:58             ` Chris Mason
  2009-04-20 15:50               ` Jan Kara
  0 siblings, 1 reply; 34+ messages in thread
From: Chris Mason @ 2009-04-20 14:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linus Torvalds, Theodore Ts'o, Linux Kernel Developers List,
	Ext4 Developers List

On Mon, 2009-04-20 at 16:42 +0200, Jan Kara wrote:
> On Mon 20-04-09 10:18:25, Chris Mason wrote:
> > On Mon, 2009-04-20 at 15:44 +0200, Jan Kara wrote:
> > > Hi Chris,
> > > 
> > > On Thu 16-04-09 15:42:01, Chris Mason wrote:
> > > > 
> > > > ext3 data=ordered mode makes sure that data blocks are on disk before
> > > > the metadata that references them, which avoids files full of garbage
> > > > or previously deleted data after a crash.  It does this by adding every dirty
> > > > buffer onto a list of things that must be written before a commit.
> > > > 
> > > > This makes every fsync write out all the dirty data on the entire FS, which
> > > > has high latencies and is generally much more expensive than it needs to be.
> > > > 
> > > > Another way to avoid exposing stale data after a crash is to wait until
> > > > after the data buffers are written before updating the on-disk record
> > > > of the file's size.  If we crash before the data IO is done, i_size
> > > > doesn't yet include the new blocks and no stale data is exposed.
> > > > 
> > > > This patch adds the delayed i_size update to ext3, along with a new
> > > > mount option (data=guarded) to enable it.  The basic mechanism works like
> > > > this:
> > > > 
> > > > * Change block_write_full_page to take an end_io handler as a parameter.
> > > > This allows us to make an end_io handler that queues buffer heads for
> > > > a workqueue where the real work of updating the on disk i_size is done.
> > > > 
> > > > * Add an rbtree to the in-memory ext3 inode for tracking data=guarded
> > > > buffer heads that are waiting to be sent to disk.
> > > > 
> > > > * Add an ext3 guarded write_end call to add buffer heads for newly
> > > > allocated blocks into the rbtree.  If we have a newly allocated block that is
> > > > filling a hole inside i_size, this is done as an old style data=ordered write
> > > > instead.
> > > > 
> > > > * Add an ext3 guarded writepage call that uses a special buffer head
> > > > end_io handler for buffers that are marked as guarded.  Again, if we find
> > > > newly allocated blocks filling holes, they are sent through data=ordered
> > > > instead of data=guarded.
> > > > 
> > > > * When a guarded IO finishes, kick a per-FS workqueue to do the
> > > > on disk i_size updates.  The workqueue function must be very careful.  We
> > > > only update the on disk i_size if all of the IO between the old on
> > > > disk i_size and the new on disk i_size is complete.  This is why an
> > > > rbtree is used to track the pending buffers, that way we can verify all
> > > > of the IO is actually done.  The on disk i_size is incrementally updated to
> > > > the largest safe value every time an IO completes.
> > > > 
> > > > * When we start tracking guarded buffers on a given inode, we put the
> > > > inode into ext3's orphan list.  This way if we do crash, the file will
> > > > be truncated back down to the on disk i_size and we'll free any blocks that
> > > > were not completely written.  The inode is removed from the orphan list
> > > > only after all the guarded buffers are done.
> > > > 
> > > > Signed-off-by: Chris Mason <chris.mason@oracle.com>
> > >   I've read the patch. I don't think I've got all the subtleties but before
> > > diving into it more I'd like to ask why do we do the things in so
> > > complicated way? 
> > 
> > Thanks for reviewing things!
> > 
> > > Maybe I'm missing some issues so let's see:
> > >   1) If I got it right, hole filling goes through standard ordered mode so
> > > we can ignore such writes. So why do we have special writepage? I should
> > > look just like writepage for ordered mode and we could just tweak
> > > ext3_ordered_writepage() (probably renamed) to do:
> > > 	if (ext3_should_order_data(inode))
> > > 		err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
> > >                                         NULL, journal_dirty_data_fn);
> > > 	else
> > > 		err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
> > >                                         NULL, journal_dirty_guarded_data_fn);
> > 
> > That would work.  My first writepage was more complex, it shrunk as the
> > patch evolved.  Another question is if we want to use exactly the same
> > writepage for guarded and ordered.  I've always though data=ordered
> > should only order new blocks...
>   Hmm, true. After my last change we already don't file data buffers in
> ordered_writepage() if the page is fully mapped to disk so doing this in
> all the cases is fine. Actually, I'll soon write ext3_page_mkwrite() to do
> the block allocation on page fault time so after that we can get rid of most
> of the code in ext3_ordered_writepage().

Even better.

> 
> > >   2) Why is there any RB tree after all? Since guarded are just extending
> > > writes, we can have a linked list of guarded buffers. We always append
> > > at the end, we update i_size if the current buffer has no predecestor in the
> > > list.
> > 
> > A guarded write is anything from write() that is past the disk i_size.
> > lseek and friends mean it could happen in any order.
>   Are you sure? Looking and ext3_guarded_write_end():
> ...
> +       if (test_clear_buffer_datanew(bh)) {
> +               /*
> +                * if we're filling a hole inside i_size, we need to
> +                * fall back to the old style data=ordered
> +                */
> +               if (offset < inode->i_size) {
> +                       ret = ext3_journal_dirty_data(handle, bh);
> +                       goto out;
> +               }
> ...
>   So it seems we always to ordered write unless we are appending / have
> blocks allocated. You could have i_disksize in the check but is it really
> worth it? IMO getting rid of the RB tree might be better ;)
> 

I had this little todo on my list to change that to i_datasize and
retest.  But I think you're right, the rbtree isn't worth it at all.
Btrfs needs it for different reasons, and I just had it stuck in my head
when I copied the code over.

I'll redo the patch without the rbtree.  It won't change most of the
things that make the patch complex (the orphan handling and these
truncate interactions), but it'll definitely be nicer.

> > >   3) Currently truncate() does filemap_write_and_wait() - is it really
> > > needed? Each guarded bh could carry with itself i_disksize it should update
> > > to when IO is finished. Extending truncate will just update this i_disksize
> > > at the last member of the list (or update i_disksize when the list is
> > > empty). 
> > >
> > > Shortening truncate will walk the list of guarded bh's, removing from
> > > the list those beyond new i_size, then it will behave like the extending
> > > truncate (it works even if current i_disksize is larger than new i_size).
> > > Note, that before we get to ext3_truncate() mm walks all the pages beyond
> > > i_size and waits for page writeback so by the time ext3_truncate() is
> > > called, all the IO is finished and dirty pages are canceled.
> > 
> > The problem here was the disk i_size being updated by ext3_setattr
> > before the vmtruncate calls calls ext3_truncate().  So the guarded IO
> > might wander in and change the i_disksize update done by setattr.
> > 
> > It all made me a bit dizzy and I just tossed the write_and_wait in
> > instead.
> > 
> > At the end of the day, we're waiting for guarded writes only, and we
> > probably would have ended up waiting on those exact same pages in
> > vmtruncate anyway.  So, I do agree we could avoid the write with more
> > code, but is this really a performance critical section?
>   Well, not really critical but also not negligible - mainly because with
> your approach we end up *submitting* new writes we could just be canceled
> otherwise. Without fdatawrite(), data of short-lived files need not ever
> reach the disk similarly as in writeback mode (OK, this is connected with
> the fact that you actually don't have fdatawrite() before ext3_truncate()
> in ext3_delete_inode() and that's what initially puzzled me).

When we're going down to zero, we don't need it.  The i_disksize gets
updated again by ext3_truncate.  I'll toss in a special case for that
before the write_and_wait.

-chris



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] Add ext3 data=guarded mode
  2009-04-20 14:58             ` Chris Mason
@ 2009-04-20 15:50               ` Jan Kara
  0 siblings, 0 replies; 34+ messages in thread
From: Jan Kara @ 2009-04-20 15:50 UTC (permalink / raw)
  To: Chris Mason
  Cc: Linus Torvalds, Theodore Ts'o, Linux Kernel Developers List,
	Ext4 Developers List

On Mon 20-04-09 10:58:10, Chris Mason wrote:
> > > >   3) Currently truncate() does filemap_write_and_wait() - is it really
> > > > needed? Each guarded bh could carry with itself i_disksize it should update
> > > > to when IO is finished. Extending truncate will just update this i_disksize
> > > > at the last member of the list (or update i_disksize when the list is
> > > > empty). 
> > > >
> > > > Shortening truncate will walk the list of guarded bh's, removing from
> > > > the list those beyond new i_size, then it will behave like the extending
> > > > truncate (it works even if current i_disksize is larger than new i_size).
> > > > Note, that before we get to ext3_truncate() mm walks all the pages beyond
> > > > i_size and waits for page writeback so by the time ext3_truncate() is
> > > > called, all the IO is finished and dirty pages are canceled.
> > > 
> > > The problem here was the disk i_size being updated by ext3_setattr
> > > before the vmtruncate calls calls ext3_truncate().  So the guarded IO
> > > might wander in and change the i_disksize update done by setattr.
> > > 
> > > It all made me a bit dizzy and I just tossed the write_and_wait in
> > > instead.
> > > 
> > > At the end of the day, we're waiting for guarded writes only, and we
> > > probably would have ended up waiting on those exact same pages in
> > > vmtruncate anyway.  So, I do agree we could avoid the write with more
> > > code, but is this really a performance critical section?
> >   Well, not really critical but also not negligible - mainly because with
> > your approach we end up *submitting* new writes we could just be canceled
> > otherwise. Without fdatawrite(), data of short-lived files need not ever
> > reach the disk similarly as in writeback mode (OK, this is connected with
> > the fact that you actually don't have fdatawrite() before ext3_truncate()
> > in ext3_delete_inode() and that's what initially puzzled me).
> 
> When we're going down to zero, we don't need it.  The i_disksize gets
> updated again by ext3_truncate.  I'll toss in a special case for that
> before the write_and_wait.
  I'm sorry but why truncate to zero does not need it? If we assume that
IO completion can still happen while ext3_truncate() is running which is
what you're afraid of, then I don't see a big difference between truncate
to zero, truncate to i_disksize (which is from where you do fdatawrite) or
truncate to anything else.
  Also two other comments:
...
@@ -915,14 +1042,19 @@ int ext3_get_blocks_handle(handle_t *handle, struct
inode *inode,
         * i_disksize growing is protected by truncate_mutex.  Don't forget
         * to
         * protect it if you're about to implement concurrent
         * ext3_get_block() -bzzz
+        *
+        * FIXME, I think this only needs to extend the disk i_size when
+        * we're filling holes that came from using ftruncate to increase
+        * i_size.  Need to verify.
        */
-       if (!err && extend_disksize && inode->i_size > ei->i_disksize)
-               ei->i_disksize = inode->i_size;
+       if (!ext3_should_guard_data(inode) && !err && extend_disksize)
+               maybe_update_disk_isize(inode, inode->i_size);
This is kind of confusing. extend_disksize is used only from ext3_getblk()
which is called only for directories so the first condition is always true
- and if it wasn't sometime in future, you'd have a hard time tracking why
i_disksize is not updated.  So I'd rather add something like
  WARN_ON(extend_disksize && ext3_should_guard_data(inode));
if you wish to keep the check.

  Also I think you can have races between direct IO writing to the file
(updating i_disksize) and your completion handler updating i_disksize -
direct IO plays tricks with i_disksize to truncate allocated blocks in case
of failed write... It's all nasty ;( Probably we should somehow set clear
rules about i_disksize updates and clean up the code to obey them.
Otherwise we'll be hunting nasty races another two years...

									Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] ext3 data=guarded v3
  2009-04-15 17:22 [PATCH RFC] ext3 data=guarded v3 Chris Mason
  2009-04-15 17:22 ` [PATCH 1/3] Export filemap_write_and_wait_range Chris Mason
@ 2009-04-15 19:10 ` Eric Sandeen
  2009-04-15 20:35 ` Linus Torvalds
  2009-04-16 11:39 ` Mike Galbraith
  3 siblings, 0 replies; 34+ messages in thread
From: Eric Sandeen @ 2009-04-15 19:10 UTC (permalink / raw)
  To: Chris Mason
  Cc: Jan Kara, Linus Torvalds, Theodore Ts'o,
	Linux Kernel Developers List, Ext4 Developers List

Chris Mason wrote:
> Hello everyone,
> 
> Here is another version of the data=guarded work for ext3.  The main
> difference between this code and yesterday's is the guarded writepage
> function now sends any newly allocated block through the old data=ordered code.
> 
> This is important because at the time we're walking the buffers, the page
> may be unlocked, so we can't trust anything inside the page.  In general,
> any allocation done by writepage is to fill a hole, so the old data=ordered
> is what we want anyway.
> 
> This passed a longer stress test and generally seems to be working.  I
> don't think anyone would recommend it as a default for 2.6.30, but it
> may be a good idea to have a review party and decide if it is safe enough
> to include so people can experiment with it.

Chris, thanks for getting this going.  I think this is a great idea, as
it gives us most of the performance benefits of writeback without the
security issues.

I'm under the gun on some other deadlines at the moment so will have to
do a detailed review later, but this seems like a very good approach,
and in my testing it does indeed solve the the security problem of
writeback mode.

-Eric

> Overall diffstat of the series:
> 
>  fs/buffer.c                 |   45 ++-
>  fs/ext3/Makefile            |    3
>  fs/ext3/fsync.c             |   12
>  fs/ext3/inode.c             |  546 +++++++++++++++++++++++++++++++++++++++++++-
>  fs/ext3/namei.c             |    3
>  fs/ext3/ordered-data.c      |  318 +++++++++++++++++++++++++
>  fs/ext3/super.c             |   48 +++
>  fs/jbd/transaction.c        |    1
>  include/linux/buffer_head.h |    3
>  include/linux/ext3_fs.h     |   33 ++
>  include/linux/ext3_fs_i.h   |   44 +++
>  include/linux/ext3_fs_sb.h  |    6
>  include/linux/ext3_jbd.h    |   11
>  include/linux/jbd.h         |   10
>  mm/filemap.c                |    1
> 
> -chris
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] ext3 data=guarded v3
  2009-04-15 17:22 [PATCH RFC] ext3 data=guarded v3 Chris Mason
  2009-04-15 17:22 ` [PATCH 1/3] Export filemap_write_and_wait_range Chris Mason
  2009-04-15 19:10 ` [PATCH RFC] ext3 data=guarded v3 Eric Sandeen
@ 2009-04-15 20:35 ` Linus Torvalds
  2009-04-15 21:09   ` Theodore Tso
                     ` (2 more replies)
  2009-04-16 11:39 ` Mike Galbraith
  3 siblings, 3 replies; 34+ messages in thread
From: Linus Torvalds @ 2009-04-15 20:35 UTC (permalink / raw)
  To: Chris Mason
  Cc: Jan Kara, Theodore Ts'o, Linux Kernel Developers List,
	Ext4 Developers List



On Wed, 15 Apr 2009, Chris Mason wrote:
> 
> Here is another version of the data=guarded work for ext3.  The main
> difference between this code and yesterday's is the guarded writepage
> function now sends any newly allocated block through the old data=ordered code.

I'm inclined to apply the first two patches as infrastructure, since they 
seem to make sense regardless of data=ordered. The ability to get a 
callback when IO ends sounds like something that a number of cases might 
find intriguing, and it's obviously how the actual IO has worked 
internally anyway.

Comments?

			Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] ext3 data=guarded v3
  2009-04-15 20:35 ` Linus Torvalds
@ 2009-04-15 21:09   ` Theodore Tso
  2009-04-16  8:44   ` Jan Kara
  2009-04-16 18:09   ` Nick Piggin
  2 siblings, 0 replies; 34+ messages in thread
From: Theodore Tso @ 2009-04-15 21:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Mason, Jan Kara, Linux Kernel Developers List,
	Ext4 Developers List

On Wed, Apr 15, 2009 at 01:35:23PM -0700, Linus Torvalds wrote:
> I'm inclined to apply the first two patches as infrastructure, since they 
> seem to make sense regardless of data=ordered. The ability to get a 
> callback when IO ends sounds like something that a number of cases might 
> find intriguing, and it's obviously how the actual IO has worked 
> internally anyway.

The first two patches look good to me.  I haven't had time to go
through the last patch carefully (and that's the most complicated one
of course).  They don't currently have any in-kernel users, so it
fails the traditional test for inclusion, but the basic functionality
they provide makes a lot of sense to me.

					- Ted

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] ext3 data=guarded v3
  2009-04-15 20:35 ` Linus Torvalds
  2009-04-15 21:09   ` Theodore Tso
@ 2009-04-16  8:44   ` Jan Kara
  2009-04-16 18:09   ` Nick Piggin
  2 siblings, 0 replies; 34+ messages in thread
From: Jan Kara @ 2009-04-16  8:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Mason, Theodore Ts'o, Linux Kernel Developers List,
	Ext4 Developers List

On Wed 15-04-09 13:35:23, Linus Torvalds wrote:
> 
> 
> On Wed, 15 Apr 2009, Chris Mason wrote:
> > 
> > Here is another version of the data=guarded work for ext3.  The main
> > difference between this code and yesterday's is the guarded writepage
> > function now sends any newly allocated block through the old data=ordered code.
> 
> I'm inclined to apply the first two patches as infrastructure, since they 
> seem to make sense regardless of data=ordered. The ability to get a 
> callback when IO ends sounds like something that a number of cases might 
> find intriguing, and it's obviously how the actual IO has worked 
> internally anyway.
> 
> Comments?
  Yes, the first two patches look fine to me. I tried to review the third
patch yesterday but my mind has blown up when trying to track all the
possible interactions and I started doing something else to preserve last
bits of sanity ;). Will retry later...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] ext3 data=guarded v3
  2009-04-15 20:35 ` Linus Torvalds
  2009-04-15 21:09   ` Theodore Tso
  2009-04-16  8:44   ` Jan Kara
@ 2009-04-16 18:09   ` Nick Piggin
  2 siblings, 0 replies; 34+ messages in thread
From: Nick Piggin @ 2009-04-16 18:09 UTC (permalink / raw)
  To: Linus Torvalds, linux-fsdevel
  Cc: Chris Mason, Jan Kara, Theodore Ts'o,
	Linux Kernel Developers List, Ext4 Developers List

On Thursday 16 April 2009 06:35:23 Linus Torvalds wrote:
> 
> On Wed, 15 Apr 2009, Chris Mason wrote:
> > 
> > Here is another version of the data=guarded work for ext3.  The main
> > difference between this code and yesterday's is the guarded writepage
> > function now sends any newly allocated block through the old data=ordered code.
> 
> I'm inclined to apply the first two patches as infrastructure, since they 
> seem to make sense regardless of data=ordered. The ability to get a 
> callback when IO ends sounds like something that a number of cases might 
> find intriguing, and it's obviously how the actual IO has worked 
> internally anyway.
> 
> Comments?

I think also fscache really should be using proper read completed notifications
rather than hijacking the page lock waitqueue for this.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] ext3 data=guarded v3
  2009-04-15 17:22 [PATCH RFC] ext3 data=guarded v3 Chris Mason
                   ` (2 preceding siblings ...)
  2009-04-15 20:35 ` Linus Torvalds
@ 2009-04-16 11:39 ` Mike Galbraith
  2009-04-16 11:40   ` Mike Galbraith
  2009-04-16 14:56   ` Chris Mason
  3 siblings, 2 replies; 34+ messages in thread
From: Mike Galbraith @ 2009-04-16 11:39 UTC (permalink / raw)
  To: Chris Mason
  Cc: Jan Kara, Linus Torvalds, Theodore Ts'o,
	Linux Kernel Developers List, Ext4 Developers List

On Wed, 2009-04-15 at 13:22 -0400, Chris Mason wrote:
> Hello everyone,

Greetings,

> This passed a longer stress test and generally seems to be working.  I
> don't think anyone would recommend it as a default for 2.6.30, but it
> may be a good idea to have a review party and decide if it is safe enough
> to include so people can experiment with it.

I know you didn't say RFT, but I did some anyway, and found a 100%
repeatable corruption scenario wrt git+umount.

/dev/sdf3 on /media/root type ext3 (rw,_netdev,noatime,data=guarded,acl,user_xattr)

<user git>
cd /media/root/home/git/testo
git init
cp /usr/local/src/kernel/linux-2.6.30.git/.git/config .git/.
git fetch
git pull
cd
<user root>
umount /media/root
mount /media/root
<user git>
cd /media/root/home/git/testo
git checkout -f

git@marge:..git/testo> git checkout -f
error: packfile .git/objects/pack/pack-5897f5ab55217236982c43fd644e854d0c42fa97.pack does not match index
error: packfile .git/objects/pack/pack-5897f5ab55217236982c43fd644e854d0c42fa97.pack cannot be accessed
error: packfile .git/objects/pack/pack-5897f5ab55217236982c43fd644e854d0c42fa97.pack does not match index
error: packfile .git/objects/pack/pack-5897f5ab55217236982c43fd644e854d0c42fa97.pack cannot be accessed
fatal: You are on a branch yet to be born

git@marge:..git/testo> md5sum .git/objects/pack/pack-5897f5ab55217236982c43fd644e854d0c42fa97.pack .git.good/objects/pack/pack-5897f5ab55217236982c43fd644e854d0c42fa97.pack
a54de2e5ed5f2dd8925169b404463261  .git/objects/pack/pack-5897f5ab55217236982c43fd644e854d0c42fa97.pack
88ca6b0eac30ef18e623b3a6f6490299  .git.good/objects/pack/pack-5897f5ab55217236982c43fd644e854d0c42fa97.pack
t@marge:..git/testo> md5sum .git/index .git.good/index
f2f9a1ffd2544841985f112e0831d2a5  .git/index
1f88bfbe71adfd649254e3a4bd91e7fa  .git.good/index

git version 1.6.2.3
.git/config
[core]
	repositoryformatversion = 0
	filemode = true
	bare = false
	logallrefupdates = true
[remote "master"]
	url = git://localhost/~git/linux-2.6
	fetch = +refs/heads/master:refs/remotes/master/master
[branch "master"]
	remote = master
	merge = refs/heads/master

If I move a .git created while mounted data=writeback into this
directory, I can mount/umount/checkout to my hearts content while
mounted data=guarded, but if I do the fetch/pull/umount while
data=guarded, after umount/mount, the repo is corrupt.

Prior to umount, all seems fine, trees build, git fsck is happy etc.

	-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] ext3 data=guarded v3
  2009-04-16 11:39 ` Mike Galbraith
@ 2009-04-16 11:40   ` Mike Galbraith
  2009-04-16 14:56   ` Chris Mason
  1 sibling, 0 replies; 34+ messages in thread
From: Mike Galbraith @ 2009-04-16 11:40 UTC (permalink / raw)
  To: Chris Mason
  Cc: Jan Kara, Linus Torvalds, Theodore Ts'o,
	Linux Kernel Developers List, Ext4 Developers List

(btw, patches were plugged into .tip.this_morning)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] ext3 data=guarded v3
  2009-04-16 11:39 ` Mike Galbraith
  2009-04-16 11:40   ` Mike Galbraith
@ 2009-04-16 14:56   ` Chris Mason
  2009-04-16 17:12     ` Chris Mason
  2009-04-16 18:00     ` Mike Galbraith
  1 sibling, 2 replies; 34+ messages in thread
From: Chris Mason @ 2009-04-16 14:56 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Jan Kara, Linus Torvalds, Theodore Ts'o,
	Linux Kernel Developers List, Ext4 Developers List

On Thu, 2009-04-16 at 13:39 +0200, Mike Galbraith wrote:
> On Wed, 2009-04-15 at 13:22 -0400, Chris Mason wrote:
> > Hello everyone,
> 
> Greetings,
> 
> > This passed a longer stress test and generally seems to be working.  I
> > don't think anyone would recommend it as a default for 2.6.30, but it
> > may be a good idea to have a review party and decide if it is safe enough
> > to include so people can experiment with it.
> 
> I know you didn't say RFT, but I did some anyway, and found a 100%
> repeatable corruption scenario wrt git+umount.
> 

Well, that's a surprise.  I can trigger it here too, trying to figure
out how this is different from the fsx and fsstress hammering.  It looks
like git is just going good old fashioned writes, so I must be losing
one or two of them.

-chris



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] ext3 data=guarded v3
  2009-04-16 14:56   ` Chris Mason
@ 2009-04-16 17:12     ` Chris Mason
  2009-04-16 18:25       ` Mike Galbraith
  2009-04-16 18:37       ` Linus Torvalds
  2009-04-16 18:00     ` Mike Galbraith
  1 sibling, 2 replies; 34+ messages in thread
From: Chris Mason @ 2009-04-16 17:12 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Jan Kara, Linus Torvalds, Theodore Ts'o,
	Linux Kernel Developers List, Ext4 Developers List

On Thu, 2009-04-16 at 10:56 -0400, Chris Mason wrote:
> On Thu, 2009-04-16 at 13:39 +0200, Mike Galbraith wrote:
> > On Wed, 2009-04-15 at 13:22 -0400, Chris Mason wrote:
> > > Hello everyone,
> > 
> > Greetings,
> > 
> > > This passed a longer stress test and generally seems to be working.  I
> > > don't think anyone would recommend it as a default for 2.6.30, but it
> > > may be a good idea to have a review party and decide if it is safe enough
> > > to include so people can experiment with it.
> > 
> > I know you didn't say RFT, but I did some anyway, and found a 100%
> > repeatable corruption scenario wrt git+umount.
> > 
> 
> Well, that's a surprise.  I can trigger it here too, trying to figure
> out how this is different from the fsx and fsstress hammering.  It looks
> like git is just going good old fashioned writes, so I must be losing
> one or two of them.

Ah ok, it is just a missed i_size update.  Basically because file_write
doesn't wait for page writeback to finish, someone can be updating
i_size at the same time the end_io handler for the last page is running.

Git triggers this when it does the sha1flush just before closing the
file.

I'm testing the fixes here, will resend in a little bit.

Thanks!

-chris



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] ext3 data=guarded v3
  2009-04-16 17:12     ` Chris Mason
@ 2009-04-16 18:25       ` Mike Galbraith
  2009-04-16 18:37       ` Linus Torvalds
  1 sibling, 0 replies; 34+ messages in thread
From: Mike Galbraith @ 2009-04-16 18:25 UTC (permalink / raw)
  To: Chris Mason
  Cc: Jan Kara, Linus Torvalds, Theodore Ts'o,
	Linux Kernel Developers List, Ext4 Developers List

On Thu, 2009-04-16 at 13:12 -0400, Chris Mason wrote:
> On Thu, 2009-04-16 at 10:56 -0400, Chris Mason wrote:
> > On Thu, 2009-04-16 at 13:39 +0200, Mike Galbraith wrote:
> > > On Wed, 2009-04-15 at 13:22 -0400, Chris Mason wrote:
> > > > Hello everyone,
> > > 
> > > Greetings,
> > > 
> > > > This passed a longer stress test and generally seems to be working.  I
> > > > don't think anyone would recommend it as a default for 2.6.30, but it
> > > > may be a good idea to have a review party and decide if it is safe enough
> > > > to include so people can experiment with it.
> > > 
> > > I know you didn't say RFT, but I did some anyway, and found a 100%
> > > repeatable corruption scenario wrt git+umount.
> > > 
> > 
> > Well, that's a surprise.  I can trigger it here too, trying to figure
> > out how this is different from the fsx and fsstress hammering.  It looks
> > like git is just going good old fashioned writes, so I must be losing
> > one or two of them.
> 
> Ah ok, it is just a missed i_size update.  Basically because file_write
> doesn't wait for page writeback to finish, someone can be updating
> i_size at the same time the end_io handler for the last page is running.
> 
> Git triggers this when it does the sha1flush just before closing the
> file.

A brief note (preferably with 8x10 color glossies with circles and
arrows) on how you figured that out so quick would be a good LKML
archive investment :)

	-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] ext3 data=guarded v3
  2009-04-16 17:12     ` Chris Mason
  2009-04-16 18:25       ` Mike Galbraith
@ 2009-04-16 18:37       ` Linus Torvalds
  2009-04-16 19:38         ` Chris Mason
  1 sibling, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2009-04-16 18:37 UTC (permalink / raw)
  To: Chris Mason
  Cc: Mike Galbraith, Jan Kara, Theodore Ts'o,
	Linux Kernel Developers List, Ext4 Developers List

On Thu, 16 Apr 2009, Chris Mason wrote:
> 
> Ah ok, it is just a missed i_size update.  Basically because file_write
> doesn't wait for page writeback to finish, someone can be updating
> i_size at the same time the end_io handler for the last page is running.
> 
> Git triggers this when it does the sha1flush just before closing the
> file.

Can you say exactly what the IO pattern is?

One of the original git design issues was to actually never _ever_ do 
anything even half-way strange in the filesystem patterns, exactly because 
I've seen so many filesystem bugs over the years.

Now, it turns ou that "original design intent" and "actual code" then 
don't always match, and git did some things that are unusual and triggered 
bugs.

Example: in order to be extra safe, git does "fchown()" after doing all 
the writes to file descriptor just before closing it. I wanted git to make 
it hard to corrupt things by mistake, and marking all the files that only 
get written once (which is most of them) read-only as soon as possible 
seemed to be a great safety feature.

Except, in the process it triggers a network filesystem bug where earlier 
writes were still writeback cached data hadn't made it to the server yet, 
and then the client would do the whole "mark it read-only" before the 
writes had even been done. Oops.

We had a few other issues with just renaming files around (basic rule: 
only rename files _within_ one directory if you want to avoid filesystem 
bugs) and with using "pread/pwrite" (basic rule: pread/pwrite is unusual, 
and is apparently buggy on some operating systems. So avoid them).

Anyway, what was the exact pattern that caused this to show, and maybe I 
can find yet another place where git could just be even more anally safe 
by not doing anything half-way odd?

			Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] ext3 data=guarded v3
  2009-04-16 18:37       ` Linus Torvalds
@ 2009-04-16 19:38         ` Chris Mason
  0 siblings, 0 replies; 34+ messages in thread
From: Chris Mason @ 2009-04-16 19:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Galbraith, Jan Kara, Theodore Ts'o,
	Linux Kernel Developers List, Ext4 Developers List

On Thu, 2009-04-16 at 11:37 -0700, Linus Torvalds wrote:
> 
> On Thu, 16 Apr 2009, Chris Mason wrote:
> > 
> > Ah ok, it is just a missed i_size update.  Basically because file_write
> > doesn't wait for page writeback to finish, someone can be updating
> > i_size at the same time the end_io handler for the last page is running.
> > 
> > Git triggers this when it does the sha1flush just before closing the
> > file.
> 
> Can you say exactly what the IO pattern is?
> 
> One of the original git design issues was to actually never _ever_ do 
> anything even half-way strange in the filesystem patterns, exactly because 
> I've seen so many filesystem bugs over the years.
> 

I wish it were git doing something fancy, but this is a good old
fashioned bug in the guarded code.  I was missing the on disk update
when you appended onto the end of the file without adding a new block. 

The race in my code I mentioned was there too, but when doing small
appends to the file without other writes, the size of the race window is
infinite.

The bug was usually hidden because the updated i_size was usually copied
out when the orphan link was deleted, but small appends with no other
file traffic didn't trigger that code.  This is the old "works fine
under load but fails when lightly used" problem.

Git made the whole thing much easier to track down.  I didn't have to
read the git code to know the packed files from the good data=ordered
and bad data=guarded run were supposed to be the same, the sha was right
in the filename ;)  Just a quick diff of the hexdumps made it clear
where the bug had to be.

-chris

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] ext3 data=guarded v3
  2009-04-16 14:56   ` Chris Mason
  2009-04-16 17:12     ` Chris Mason
@ 2009-04-16 18:00     ` Mike Galbraith
  1 sibling, 0 replies; 34+ messages in thread
From: Mike Galbraith @ 2009-04-16 18:00 UTC (permalink / raw)
  To: Chris Mason
  Cc: Jan Kara, Linus Torvalds, Theodore Ts'o,
	Linux Kernel Developers List, Ext4 Developers List

On Thu, 2009-04-16 at 10:56 -0400, Chris Mason wrote:
> On Thu, 2009-04-16 at 13:39 +0200, Mike Galbraith wrote:
> > On Wed, 2009-04-15 at 13:22 -0400, Chris Mason wrote:
> > > Hello everyone,
> > 
> > Greetings,
> > 
> > > This passed a longer stress test and generally seems to be working.  I
> > > don't think anyone would recommend it as a default for 2.6.30, but it
> > > may be a good idea to have a review party and decide if it is safe enough
> > > to include so people can experiment with it.
> > 
> > I know you didn't say RFT, but I did some anyway, and found a 100%
> > repeatable corruption scenario wrt git+umount.
> > 
> 
> Well, that's a surprise.  I can trigger it here too, trying to figure
> out how this is different from the fsx and fsstress hammering.  It looks
> like git is just going good old fashioned writes, so I must be losing
> one or two of them.

Yeah, I spent 6 hours doing a lot of this and that, and all was peachy.
This was the only hiccup, and I totally accidentally hit it.

	-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2009-04-20 15:50 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-15 17:22 [PATCH RFC] ext3 data=guarded v3 Chris Mason
2009-04-15 17:22 ` [PATCH 1/3] Export filemap_write_and_wait_range Chris Mason
2009-04-15 17:22   ` [PATCH 2/3] Add block_write_full_page_endio for passing endio handler Chris Mason
2009-04-15 17:22   ` [PATCH 3/3] Add ext3 data=guarded mode Chris Mason
2009-04-16 19:42     ` [PATCH] " Chris Mason
2009-04-17 11:04       ` Mike Galbraith
2009-04-17 18:09       ` Amit Shah
2009-04-17 20:13         ` Theodore Tso
2009-04-18  6:03           ` Amit Shah
     [not found]           ` <20090418060312.GA10943@amit-x200.pnq.redhat.com>
2009-04-18  7:28             ` Mike Galbraith
2009-04-19  6:24               ` Amit Shah
2009-04-20  9:07                 ` Mike Galbraith
2009-04-20  9:26                   ` Jan Kara
2009-04-20 12:15                     ` Mike Galbraith
2009-04-20 12:56                       ` Amit Shah
2009-04-20 13:06                         ` Mike Galbraith
2009-04-20 13:44       ` Jan Kara
2009-04-20 14:18         ` Chris Mason
2009-04-20 14:42           ` Jan Kara
2009-04-20 14:58             ` Chris Mason
2009-04-20 15:50               ` Jan Kara
2009-04-15 19:10 ` [PATCH RFC] ext3 data=guarded v3 Eric Sandeen
2009-04-15 20:35 ` Linus Torvalds
2009-04-15 21:09   ` Theodore Tso
2009-04-16  8:44   ` Jan Kara
2009-04-16 18:09   ` Nick Piggin
2009-04-16 11:39 ` Mike Galbraith
2009-04-16 11:40   ` Mike Galbraith
2009-04-16 14:56   ` Chris Mason
2009-04-16 17:12     ` Chris Mason
2009-04-16 18:25       ` Mike Galbraith
2009-04-16 18:37       ` Linus Torvalds
2009-04-16 19:38         ` Chris Mason
2009-04-16 18:00     ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).