[PATCH v4 0/3] dioread

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/3] dioread_nolock patch
@ 2010-01-15 19:30 Theodore Ts'o
  2010-01-15 19:30 ` [PATCH v4 1/3] ext4: mechanical change on dio get_block code in prepare for it to be used by buffer write Theodore Ts'o
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Theodore Ts'o @ 2010-01-15 19:30 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o

I've worked with Jiaying to ready this patch for submission.

It's currently a mount option for maximum safety, but after we do some
benchmarking to make sure it doesn't degrade performance for buffered
writes, we may want to make this the default.  Once really nice side
effect of this patch is that it effectively gives us "guarded mode" by
default, since the blocks are marked as uninitialized and only converted
to be initialized when the I/O has completed for both buffered and
direct I/O writes now.  This means that we could possibly change the
default mode to be data=writeback if the extents feature is enabled,
since data=ordered would only needed for safety when writing new
old-style indirect blocks.

The plan is to merge this for 2.6.34.  I've looked this over pretty
carefully, but another pair of eyes would be appreciated, especially if
we make this the default.  Beyond the advantages of being able to use
data=writeback, I believe this should be a major win for database
workloads.

					- Ted

Theodore Ts'o (3):
  ext4: mechanical change on dio get_block code in prepare for it to be
    used by buffer write
  ext4: use ext4_get_block_write in buffer write
  ext4: Use direct_IO_no_locking in ext4 dio read.

 fs/ext4/ext4.h      |   28 +++++---
 fs/ext4/ext4_jbd2.h |   24 +++++++
 fs/ext4/extents.c   |   36 +++++-----
 fs/ext4/fsync.c     |    2 +-
 fs/ext4/inode.c     |  192 +++++++++++++++++++++++++++++++++-----------------
 fs/ext4/super.c     |   32 +++++++--
 6 files changed, 217 insertions(+), 97 deletions(-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v4 1/3] ext4: mechanical change on dio get_block code in prepare for it to be used by buffer write
  2010-01-15 19:30 [PATCH v4 0/3] dioread_nolock patch Theodore Ts'o
@ 2010-01-15 19:30 ` Theodore Ts'o
  2010-01-17 14:36   ` Aneesh Kumar K. V
  2010-01-15 19:30 ` [PATCH v4 2/3] ext4: use ext4_get_block_write in " Theodore Ts'o
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Theodore Ts'o @ 2010-01-15 19:30 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o, Jiaying Zhang

Renaming the dio block allocation flags, variables, and functions
introduced in Mingming's "Direct IO for holes and fallocate"
patches so that they can be used by ext4 buffer write as well.
Also changed the related function comments accordingly to cover
both direct write and buffer wirte cases.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext4/ext4.h    |   18 ++++++------
 fs/ext4/extents.c |   24 +++++++-------
 fs/ext4/fsync.c   |    2 +-
 fs/ext4/inode.c   |   84 ++++++++++++++++++++++++-----------------------------
 fs/ext4/super.c   |    2 +-
 5 files changed, 61 insertions(+), 69 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 2ca1b41..b1dcbb7 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -133,7 +133,7 @@ struct mpage_da_data {
 	int pages_written;
 	int retval;
 };
-#define	DIO_AIO_UNWRITTEN	0x1
+#define	EXT4_IO_UNWRITTEN	0x1
 typedef struct ext4_io_end {
 	struct list_head	list;		/* per-file finished AIO list */
 	struct inode		*inode;		/* file being written to */
@@ -364,13 +364,13 @@ struct ext4_new_group_data {
 	/* caller is from the direct IO path, request to creation of an
 	unitialized extents if not allocated, split the uninitialized
 	extent if blocks has been preallocated already*/
-#define EXT4_GET_BLOCKS_DIO			0x0008
+#define EXT4_GET_BLOCKS_PRE_IO			0x0008
 #define EXT4_GET_BLOCKS_CONVERT			0x0010
-#define EXT4_GET_BLOCKS_DIO_CREATE_EXT		(EXT4_GET_BLOCKS_DIO|\
+#define EXT4_GET_BLOCKS_IO_CREATE_EXT		(EXT4_GET_BLOCKS_PRE_IO|\
 					 EXT4_GET_BLOCKS_CREATE_UNINIT_EXT)
-	/* Convert extent to initialized after direct IO complete */
-#define EXT4_GET_BLOCKS_DIO_CONVERT_EXT		(EXT4_GET_BLOCKS_CONVERT|\
-					 EXT4_GET_BLOCKS_DIO_CREATE_EXT)
+	/* Convert extent to initialized after IO complete */
+#define EXT4_GET_BLOCKS_IO_CONVERT_EXT		(EXT4_GET_BLOCKS_CONVERT|\
+					 EXT4_GET_BLOCKS_IO_CREATE_EXT)
 
 /*
  * Flags used by ext4_free_blocks
@@ -709,8 +709,8 @@ struct ext4_inode_info {
 	qsize_t i_reserved_quota;
 #endif
 
-	/* completed async DIOs that might need unwritten extents handling */
-	struct list_head i_aio_dio_complete_list;
+	/* completed IOs that might need unwritten extents handling */
+	struct list_head i_completed_io_list;
 	/* current io_end structure for async DIO write*/
 	ext4_io_end_t *cur_aio_dio;
 
@@ -1440,7 +1440,7 @@ extern int ext4_block_truncate_page(handle_t *handle,
 		struct address_space *mapping, loff_t from);
 extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
 extern qsize_t *ext4_get_reserved_space(struct inode *inode);
-extern int flush_aio_dio_completed_IO(struct inode *inode);
+extern int flush_completed_IO(struct inode *inode);
 extern void ext4_da_update_reserve_space(struct inode *inode,
 					int used, int quota_claim);
 /* ioctl.c */
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 8a20a5e..e3eddc0 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1618,7 +1618,7 @@ int ext4_ext_insert_extent(handle_t *handle, struct inode *inode,
 	BUG_ON(path[depth].p_hdr == NULL);
 
 	/* try to insert block into found extent and return */
-	if (ex && (flag != EXT4_GET_BLOCKS_DIO_CREATE_EXT)
+	if (ex && (flag != EXT4_GET_BLOCKS_PRE_IO)
 		&& ext4_can_extents_be_merged(inode, ex, newext)) {
 		ext_debug("append [%d]%d block to %d:[%d]%d (from %llu)\n",
 				ext4_ext_is_uninitialized(newext),
@@ -1739,7 +1739,7 @@ has_space:
 
 merge:
 	/* try to merge extents to the right */
-	if (flag != EXT4_GET_BLOCKS_DIO_CREATE_EXT)
+	if (flag != EXT4_GET_BLOCKS_PRE_IO)
 		ext4_ext_try_to_merge(inode, path, nearex);
 
 	/* try to merge extents to the left */
@@ -2983,7 +2983,7 @@ fix_extent_len:
 	ext4_ext_dirty(handle, inode, path + depth);
 	return err;
 }
-static int ext4_convert_unwritten_extents_dio(handle_t *handle,
+static int ext4_convert_unwritten_extents_endio(handle_t *handle,
 					      struct inode *inode,
 					      struct ext4_ext_path *path)
 {
@@ -3055,8 +3055,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
 		  flags, allocated);
 	ext4_ext_show_leaf(inode, path);
 
-	/* DIO get_block() before submit the IO, split the extent */
-	if (flags == EXT4_GET_BLOCKS_DIO_CREATE_EXT) {
+	/* get_block() before submit the IO, split the extent */
+	if (flags == EXT4_GET_BLOCKS_PRE_IO) {
 		ret = ext4_split_unwritten_extents(handle,
 						inode, path, iblock,
 						max_blocks, flags);
@@ -3066,14 +3066,14 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
 		 * completed
 		 */
 		if (io)
-			io->flag = DIO_AIO_UNWRITTEN;
+			io->flag = EXT4_IO_UNWRITTEN;
 		else
 			EXT4_I(inode)->i_state |= EXT4_STATE_DIO_UNWRITTEN;
 		goto out;
 	}
-	/* async DIO end_io complete, convert the filled extent to written */
-	if (flags == EXT4_GET_BLOCKS_DIO_CONVERT_EXT) {
-		ret = ext4_convert_unwritten_extents_dio(handle, inode,
+	/* IO end_io complete, convert the filled extent to written */
+	if (flags == EXT4_GET_BLOCKS_CONVERT) {
+		ret = ext4_convert_unwritten_extents_endio(handle, inode,
 							path);
 		if (ret >= 0)
 			ext4_update_inode_fsync_trans(handle, inode, 1);
@@ -3338,9 +3338,9 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
 		 * For non asycn direct IO case, flag the inode state
 		 * that we need to perform convertion when IO is done.
 		 */
-		if (flags == EXT4_GET_BLOCKS_DIO_CREATE_EXT) {
+		if (flags == EXT4_GET_BLOCKS_PRE_IO) {
 			if (io)
-				io->flag = DIO_AIO_UNWRITTEN;
+				io->flag = EXT4_IO_UNWRITTEN;
 			else
 				EXT4_I(inode)->i_state |=
 					EXT4_STATE_DIO_UNWRITTEN;;
@@ -3617,7 +3617,7 @@ int ext4_convert_unwritten_extents(struct inode *inode, loff_t offset,
 		map_bh.b_state = 0;
 		ret = ext4_get_blocks(handle, inode, block,
 				      max_blocks, &map_bh,
-				      EXT4_GET_BLOCKS_DIO_CONVERT_EXT);
+				      EXT4_GET_BLOCKS_IO_CONVERT_EXT);
 		if (ret <= 0) {
 			WARN_ON(ret <= 0);
 			printk(KERN_ERR "%s: ext4_ext_get_blocks "
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 98bd140..0d0c323 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -63,7 +63,7 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
 	if (inode->i_sb->s_flags & MS_RDONLY)
 		return 0;
 
-	ret = flush_aio_dio_completed_IO(inode);
+	ret = flush_completed_IO(inode);
 	if (ret < 0)
 		return ret;
 	
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ce8d007..a3a5149 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3445,7 +3445,7 @@ out:
 	return ret;
 }
 
-static int ext4_get_block_dio_write(struct inode *inode, sector_t iblock,
+static int ext4_get_block_write(struct inode *inode, sector_t iblock,
 		   struct buffer_head *bh_result, int create)
 {
 	handle_t *handle = NULL;
@@ -3453,28 +3453,14 @@ static int ext4_get_block_dio_write(struct inode *inode, sector_t iblock,
 	unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
 	int dio_credits;
 
-	ext4_debug("ext4_get_block_dio_write: inode %lu, create flag %d\n",
+	ext4_debug("ext4_get_block_write: inode %lu, create flag %d\n",
 		   inode->i_ino, create);
 	/*
-	 * DIO VFS code passes create = 0 flag for write to
-	 * the middle of file. It does this to avoid block
-	 * allocation for holes, to prevent expose stale data
-	 * out when there is parallel buffered read (which does
-	 * not hold the i_mutex lock) while direct IO write has
-	 * not completed. DIO request on holes finally falls back
-	 * to buffered IO for this reason.
-	 *
-	 * For ext4 extent based file, since we support fallocate,
-	 * new allocated extent as uninitialized, for holes, we
-	 * could fallocate blocks for holes, thus parallel
-	 * buffered IO read will zero out the page when read on
-	 * a hole while parallel DIO write to the hole has not completed.
-	 *
-	 * when we come here, we know it's a direct IO write to
-	 * to the middle of file (<i_size)
-	 * so it's safe to override the create flag from VFS.
+	 * ext4_get_block in prepare for a DIO write or buffer write.
+	 * We allocate an uinitialized extent if blocks haven't been allocated.
+	 * The extent will be converted to initialized after IO complete.
 	 */
-	create = EXT4_GET_BLOCKS_DIO_CREATE_EXT;
+	create = EXT4_GET_BLOCKS_IO_CREATE_EXT;
 
 	if (max_blocks > DIO_MAX_BLOCKS)
 		max_blocks = DIO_MAX_BLOCKS;
@@ -3501,19 +3487,20 @@ static void ext4_free_io_end(ext4_io_end_t *io)
 	iput(io->inode);
 	kfree(io);
 }
-static void dump_aio_dio_list(struct inode * inode)
+
+static void dump_completed_IO(struct inode * inode)
 {
 #ifdef	EXT4_DEBUG
 	struct list_head *cur, *before, *after;
 	ext4_io_end_t *io, *io0, *io1;
 
-	if (list_empty(&EXT4_I(inode)->i_aio_dio_complete_list)){
-		ext4_debug("inode %lu aio dio list is empty\n", inode->i_ino);
+	if (list_empty(&EXT4_I(inode)->i_completed_io_list)){
+		ext4_debug("inode %lu completed_io list is empty\n", inode->i_ino);
 		return;
 	}
 
-	ext4_debug("Dump inode %lu aio_dio_completed_IO list \n", inode->i_ino);
-	list_for_each_entry(io, &EXT4_I(inode)->i_aio_dio_complete_list, list){
+	ext4_debug("Dump inode %lu completed_io list \n", inode->i_ino);
+	list_for_each_entry(io, &EXT4_I(inode)->i_completed_io_list, list){
 		cur = &io->list;
 		before = cur->prev;
 		io0 = container_of(before, ext4_io_end_t, list);
@@ -3529,21 +3516,21 @@ static void dump_aio_dio_list(struct inode * inode)
 /*
  * check a range of space and convert unwritten extents to written.
  */
-static int ext4_end_aio_dio_nolock(ext4_io_end_t *io)
+static int ext4_end_io_nolock(ext4_io_end_t *io)
 {
 	struct inode *inode = io->inode;
 	loff_t offset = io->offset;
 	size_t size = io->size;
 	int ret = 0;
 
-	ext4_debug("end_aio_dio_onlock: io 0x%p from inode %lu,list->next 0x%p,"
+	ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu,list->next 0x%p,"
 		   "list->prev 0x%p\n",
 	           io, inode->i_ino, io->list.next, io->list.prev);
 
 	if (list_empty(&io->list))
 		return ret;
 
-	if (io->flag != DIO_AIO_UNWRITTEN)
+	if (io->flag != EXT4_IO_UNWRITTEN)
 		return ret;
 
 	if (offset + size <= i_size_read(inode))
@@ -3561,17 +3548,18 @@ static int ext4_end_aio_dio_nolock(ext4_io_end_t *io)
 	io->flag = 0;
 	return ret;
 }
+
 /*
  * work on completed aio dio IO, to convert unwritten extents to extents
  */
-static void ext4_end_aio_dio_work(struct work_struct *work)
+static void ext4_end_io_work(struct work_struct *work)
 {
 	ext4_io_end_t *io  = container_of(work, ext4_io_end_t, work);
 	struct inode *inode = io->inode;
 	int ret = 0;
 
 	mutex_lock(&inode->i_mutex);
-	ret = ext4_end_aio_dio_nolock(io);
+	ret = ext4_end_io_nolock(io);
 	if (ret >= 0) {
 		if (!list_empty(&io->list))
 			list_del_init(&io->list);
@@ -3579,32 +3567,35 @@ static void ext4_end_aio_dio_work(struct work_struct *work)
 	}
 	mutex_unlock(&inode->i_mutex);
 }
+
 /*
  * This function is called from ext4_sync_file().
  *
- * When AIO DIO IO is completed, the work to convert unwritten
- * extents to written is queued on workqueue but may not get immediately
+ * When IO is completed, the work to convert unwritten extents to
+ * written is queued on workqueue but may not get immediately
  * scheduled. When fsync is called, we need to ensure the
  * conversion is complete before fsync returns.
- * The inode keeps track of a list of completed AIO from DIO path
- * that might needs to do the conversion. This function walks through
- * the list and convert the related unwritten extents to written.
+ * The inode keeps track of a list of pending/completed IO that
+ * might needs to do the conversion. This function walks through
+ * the list and convert the related unwritten extents for completed IO
+ * to written.
+ * The function return the number of pending IOs on success.
  */
-int flush_aio_dio_completed_IO(struct inode *inode)
+int flush_completed_IO(struct inode *inode)
 {
 	ext4_io_end_t *io;
 	int ret = 0;
 	int ret2 = 0;
 
-	if (list_empty(&EXT4_I(inode)->i_aio_dio_complete_list))
+	if (list_empty(&EXT4_I(inode)->i_completed_io_list))
 		return ret;
 
-	dump_aio_dio_list(inode);
-	while (!list_empty(&EXT4_I(inode)->i_aio_dio_complete_list)){
-		io = list_entry(EXT4_I(inode)->i_aio_dio_complete_list.next,
+	dump_completed_IO(inode);
+	while (!list_empty(&EXT4_I(inode)->i_completed_io_list)){
+		io = list_entry(EXT4_I(inode)->i_completed_io_list.next,
 				ext4_io_end_t, list);
 		/*
-		 * Calling ext4_end_aio_dio_nolock() to convert completed
+		 * Calling ext4_end_io_nolock() to convert completed
 		 * IO to written.
 		 *
 		 * When ext4_sync_file() is called, run_queue() may already
@@ -3617,7 +3608,7 @@ int flush_aio_dio_completed_IO(struct inode *inode)
 		 * avoid double converting from both fsync and background work
 		 * queue work.
 		 */
-		ret = ext4_end_aio_dio_nolock(io);
+		ret = ext4_end_io_nolock(io);
 		if (ret < 0)
 			ret2 = ret;
 		else
@@ -3639,7 +3630,7 @@ static ext4_io_end_t *ext4_init_io_end (struct inode *inode)
 		io->offset = 0;
 		io->size = 0;
 		io->error = 0;
-		INIT_WORK(&io->work, ext4_end_aio_dio_work);
+		INIT_WORK(&io->work, ext4_end_io_work);
 		INIT_LIST_HEAD(&io->list);
 	}
 
@@ -3662,7 +3653,7 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
 		  size);
 
 	/* if not aio dio with unwritten extents, just free io and return */
-	if (io_end->flag != DIO_AIO_UNWRITTEN){
+	if (io_end->flag != EXT4_IO_UNWRITTEN){
 		ext4_free_io_end(io_end);
 		iocb->private = NULL;
 		return;
@@ -3677,9 +3668,10 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
 
 	/* Add the io_end to per-inode completed aio dio list*/
 	list_add_tail(&io_end->list,
-		 &EXT4_I(io_end->inode)->i_aio_dio_complete_list);
+		 &EXT4_I(io_end->inode)->i_completed_io_list);
 	iocb->private = NULL;
 }
+
 /*
  * For ext4 extent files, ext4 will do direct-io write to holes,
  * preallocated extents, and those write extend the file, no need to
@@ -3749,7 +3741,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 		ret = blockdev_direct_IO(rw, iocb, inode,
 					 inode->i_sb->s_bdev, iov,
 					 offset, nr_segs,
-					 ext4_get_block_dio_write,
+					 ext4_get_block_write,
 					 ext4_end_io_dio);
 		if (iocb->private)
 			EXT4_I(inode)->cur_aio_dio = NULL;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 735c20d..2a64aeb 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -708,7 +708,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 #ifdef CONFIG_QUOTA
 	ei->i_reserved_quota = 0;
 #endif
-	INIT_LIST_HEAD(&ei->i_aio_dio_complete_list);
+	INIT_LIST_HEAD(&ei->i_completed_io_list);
 	ei->cur_aio_dio = NULL;
 	ei->i_sync_tid = 0;
 	ei->i_datasync_tid = 0;
-- 
1.6.5.216.g5288a.dirty


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/3] ext4: mechanical change on dio get_block code in prepare for it to be used by buffer write
  2010-01-15 19:30 ` [PATCH v4 1/3] ext4: mechanical change on dio get_block code in prepare for it to be used by buffer write Theodore Ts'o
@ 2010-01-17 14:36   ` Aneesh Kumar K. V
  2010-01-17 16:19     ` Eric Sandeen
  0 siblings, 1 reply; 23+ messages in thread
From: Aneesh Kumar K. V @ 2010-01-17 14:36 UTC (permalink / raw)
  To: Theodore Ts'o, Ext4 Developers List; +Cc: Theodore Ts'o, Jiaying Zhang

On Fri, 15 Jan 2010 14:30:10 -0500, "Theodore Ts'o" <tytso@mit.edu> wrote:
> Renaming the dio block allocation flags, variables, and functions
> introduced in Mingming's "Direct IO for holes and fallocate"
> patches so that they can be used by ext4 buffer write as well.
> Also changed the related function comments accordingly to cover
> both direct write and buffer wirte cases.
> 
> Signed-off-by: Jiaying Zhang <jiayingz@google.com>
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
>  fs/ext4/ext4.h    |   18 ++++++------
>  fs/ext4/extents.c |   24 +++++++-------
>  fs/ext4/fsync.c   |    2 +-
>  fs/ext4/inode.c   |   84 ++++++++++++++++++++++++-----------------------------
>  fs/ext4/super.c   |    2 +-
>  5 files changed, 61 insertions(+), 69 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 2ca1b41..b1dcbb7 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -133,7 +133,7 @@ struct mpage_da_data {
>  	int pages_written;
>  	int retval;
>  };
> -#define	DIO_AIO_UNWRITTEN	0x1
> +#define	EXT4_IO_UNWRITTEN	0x1
>  typedef struct ext4_io_end {
>  	struct list_head	list;		/* per-file finished AIO list */
>  	struct inode		*inode;		/* file being written to */
> @@ -364,13 +364,13 @@ struct ext4_new_group_data {
>  	/* caller is from the direct IO path, request to creation of an
>  	unitialized extents if not allocated, split the uninitialized
>  	extent if blocks has been preallocated already*/
> -#define EXT4_GET_BLOCKS_DIO			0x0008
> +#define EXT4_GET_BLOCKS_PRE_IO			0x0008
>  #define EXT4_GET_BLOCKS_CONVERT			0x0010
> -#define EXT4_GET_BLOCKS_DIO_CREATE_EXT		(EXT4_GET_BLOCKS_DIO|\
> +#define EXT4_GET_BLOCKS_IO_CREATE_EXT		(EXT4_GET_BLOCKS_PRE_IO|\
>  					 EXT4_GET_BLOCKS_CREATE_UNINIT_EXT)
> -	/* Convert extent to initialized after direct IO complete */
> -#define EXT4_GET_BLOCKS_DIO_CONVERT_EXT		(EXT4_GET_BLOCKS_CONVERT|\
> -					 EXT4_GET_BLOCKS_DIO_CREATE_EXT)
> +	/* Convert extent to initialized after IO complete */
> +#define EXT4_GET_BLOCKS_IO_CONVERT_EXT		(EXT4_GET_BLOCKS_CONVERT|\
> +					 EXT4_GET_BLOCKS_IO_CREATE_EXT)
>

All these flags are really confusing. I guess we can make it much more
cleaner. For ex: Why is EXT4_GET_BLOCKS_IO_CONVERT_EXT enabling
EXT4_GET_BLOCKS_CREATE_UNINIT_EXT. The renaming to PRE_IO made it
better. But i guess these names should be self documenting.

How about

EXT4_GET_BLOCKS_CREATE. Indicate we should do block
allocation. But that flag alone doesn't say whether we are suppose
to create init or uninit extent.

EXT4_GET_BLOCKS_UNINIT_EXT -> Request the creation of uninit extent

EXT4_GET_BLOCKS_CREATE_UNINIT_EXT -> EXT4_GET_BLOCKS_CREATE|EXT4_GET_BLOCKS_UNINIT_EXT;

EXT4_GET_BLOCKS_DELALLOC_RESERVE -> Request for delayed allocaion
reservation

EXT4_GET_BLOCKS_PRE_IO  -> 0x0008 -> Indicate that we should do all
necessary extent split and make the requested range in to single extent.

EXT4_GET_BLOCKS_CONVERT_IO -> Convert the specified range which should be a
single extent into init and then try to merge the extent to left/right

EXT4_GET_BLOCKS_IO_CREATE_EXT -> EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_CREATE_UNINIT_EXT

EXT4_GET_BLOCKS_IO_CONVERT_EXT -> EXT4_GET_BLOCKS_CREATE | EXT4_GET_BLOCKS_CONVERT_IO; 

So from the above list it is only the last flag that is different from
what is already there. But i guess we need more documentation around
these flags.

-aneesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/3] ext4: mechanical change on dio get_block code in prepare for it to be used by buffer write
  2010-01-17 14:36   ` Aneesh Kumar K. V
@ 2010-01-17 16:19     ` Eric Sandeen
  2010-01-17 16:42       ` Aneesh Kumar K. V
  2010-01-18  3:57       ` tytso
  0 siblings, 2 replies; 23+ messages in thread
From: Eric Sandeen @ 2010-01-17 16:19 UTC (permalink / raw)
  To: Aneesh Kumar K. V; +Cc: Theodore Ts'o, Ext4 Developers List, Jiaying Zhang

Aneesh Kumar K. V wrote:
> On Fri, 15 Jan 2010 14:30:10 -0500, "Theodore Ts'o" <tytso@mit.edu> wrote:
>> Renaming the dio block allocation flags, variables, and functions
>> introduced in Mingming's "Direct IO for holes and fallocate"
>> patches so that they can be used by ext4 buffer write as well.
>> Also changed the related function comments accordingly to cover
>> both direct write and buffer wirte cases.
>>
>> Signed-off-by: Jiaying Zhang <jiayingz@google.com>
>> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>> ---
>>  fs/ext4/ext4.h    |   18 ++++++------
>>  fs/ext4/extents.c |   24 +++++++-------
>>  fs/ext4/fsync.c   |    2 +-
>>  fs/ext4/inode.c   |   84 ++++++++++++++++++++++++-----------------------------
>>  fs/ext4/super.c   |    2 +-
>>  5 files changed, 61 insertions(+), 69 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 2ca1b41..b1dcbb7 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -133,7 +133,7 @@ struct mpage_da_data {
>>  	int pages_written;
>>  	int retval;
>>  };
>> -#define	DIO_AIO_UNWRITTEN	0x1
>> +#define	EXT4_IO_UNWRITTEN	0x1
>>  typedef struct ext4_io_end {
>>  	struct list_head	list;		/* per-file finished AIO list */
>>  	struct inode		*inode;		/* file being written to */
>> @@ -364,13 +364,13 @@ struct ext4_new_group_data {
>>  	/* caller is from the direct IO path, request to creation of an
>>  	unitialized extents if not allocated, split the uninitialized
>>  	extent if blocks has been preallocated already*/
>> -#define EXT4_GET_BLOCKS_DIO			0x0008
>> +#define EXT4_GET_BLOCKS_PRE_IO			0x0008
>>  #define EXT4_GET_BLOCKS_CONVERT			0x0010
>> -#define EXT4_GET_BLOCKS_DIO_CREATE_EXT		(EXT4_GET_BLOCKS_DIO|\
>> +#define EXT4_GET_BLOCKS_IO_CREATE_EXT		(EXT4_GET_BLOCKS_PRE_IO|\
>>  					 EXT4_GET_BLOCKS_CREATE_UNINIT_EXT)
>> -	/* Convert extent to initialized after direct IO complete */
>> -#define EXT4_GET_BLOCKS_DIO_CONVERT_EXT		(EXT4_GET_BLOCKS_CONVERT|\
>> -					 EXT4_GET_BLOCKS_DIO_CREATE_EXT)
>> +	/* Convert extent to initialized after IO complete */
>> +#define EXT4_GET_BLOCKS_IO_CONVERT_EXT		(EXT4_GET_BLOCKS_CONVERT|\
>> +					 EXT4_GET_BLOCKS_IO_CREATE_EXT)
>>
> 
> All these flags are really confusing. I guess we can make it much more
> cleaner. For ex: Why is EXT4_GET_BLOCKS_IO_CONVERT_EXT enabling
> EXT4_GET_BLOCKS_CREATE_UNINIT_EXT. The renaming to PRE_IO made it
> better. But i guess these names should be self documenting.


> How about
> 
> EXT4_GET_BLOCKS_CREATE. Indicate we should do block
> allocation. But that flag alone doesn't say whether we are suppose
> to create init or uninit extent.
> 
> EXT4_GET_BLOCKS_UNINIT_EXT -> Request the creation of uninit extent
> 
> EXT4_GET_BLOCKS_CREATE_UNINIT_EXT -> EXT4_GET_BLOCKS_CREATE|EXT4_GET_BLOCKS_UNINIT_EXT;
> 
> EXT4_GET_BLOCKS_DELALLOC_RESERVE -> Request for delayed allocaion
> reservation
> 
> EXT4_GET_BLOCKS_PRE_IO  -> 0x0008 -> Indicate that we should do all
> necessary extent split and make the requested range in to single extent.
> 
> EXT4_GET_BLOCKS_CONVERT_IO -> Convert the specified range which should be a
> single extent into init and then try to merge the extent to left/right
> 
> EXT4_GET_BLOCKS_IO_CREATE_EXT -> EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_CREATE_UNINIT_EXT
> 
> EXT4_GET_BLOCKS_IO_CONVERT_EXT -> EXT4_GET_BLOCKS_CREATE | EXT4_GET_BLOCKS_CONVERT_IO;

 
In addition to Aneesh's suggestions, I'm not sure of the value of
creating more

#define FLAG_A = FLAG_B|FLAG_C

flag macros; unless you have this all in your head you just have to
go look up the flag definition anyway, since we usually test individual
flags not the aggregates.  I'm wondering if it might be better to just
explicitly send in the OR'd flags rather than creating a new one, to
see the code flow better.

Maybe it saves space, but at the cost of easy understanding IMHO.
At least that's been my experience.

-Eric

> So from the above list it is only the last flag that is different from
> what is already there. But i guess we need more documentation around
> these flags.
> 
> -aneesh


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/3] ext4: mechanical change on dio get_block code in prepare for it to be used by buffer write
  2010-01-17 16:19     ` Eric Sandeen
@ 2010-01-17 16:42       ` Aneesh Kumar K. V
  2010-01-18  3:57       ` tytso
  1 sibling, 0 replies; 23+ messages in thread
From: Aneesh Kumar K. V @ 2010-01-17 16:42 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Theodore Ts'o, Ext4 Developers List, Jiaying Zhang

On Sun, 17 Jan 2010 10:19:30 -0600, Eric Sandeen <sandeen@redhat.com> wrote:
> Aneesh Kumar K. V wrote:
> 
> > How about
> > 
> > EXT4_GET_BLOCKS_CREATE. Indicate we should do block
> > allocation. But that flag alone doesn't say whether we are suppose
> > to create init or uninit extent.
> > 
> > EXT4_GET_BLOCKS_UNINIT_EXT -> Request the creation of uninit extent
> > 
> > EXT4_GET_BLOCKS_CREATE_UNINIT_EXT -> EXT4_GET_BLOCKS_CREATE|EXT4_GET_BLOCKS_UNINIT_EXT;
> > 
> > EXT4_GET_BLOCKS_DELALLOC_RESERVE -> Request for delayed allocaion
> > reservation
> > 
> > EXT4_GET_BLOCKS_PRE_IO  -> 0x0008 -> Indicate that we should do all
> > necessary extent split and make the requested range in to single extent.
> > 
> > EXT4_GET_BLOCKS_CONVERT_IO -> Convert the specified range which should be a
> > single extent into init and then try to merge the extent to left/right
> > 
> > EXT4_GET_BLOCKS_IO_CREATE_EXT -> EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_CREATE_UNINIT_EXT
> > 
> > EXT4_GET_BLOCKS_IO_CONVERT_EXT -> EXT4_GET_BLOCKS_CREATE | EXT4_GET_BLOCKS_CONVERT_IO;
> 
> 
> In addition to Aneesh's suggestions, I'm not sure of the value of
> creating more
> 
> #define FLAG_A = FLAG_B|FLAG_C
> 
> flag macros; unless you have this all in your head you just have to
> go look up the flag definition anyway, since we usually test individual
> flags not the aggregates.  I'm wondering if it might be better to just
> explicitly send in the OR'd flags rather than creating a new one, to
> see the code flow better.
> 
> Maybe it saves space, but at the cost of easy understanding IMHO.
> At least that's been my experience.


It help us to do things like below

  if (flag & FLAG_B)
     /* we need to do things for flag B */

 if (flag & FLAG_C)
     /* things for flag C */

instead of

  if ((flag & FLAG_A) || (flag & FLAG_D) 
    /* things related to previous flag B */

So it simplifies the if condition.

-aneesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/3] ext4: mechanical change on dio get_block code in prepare for it to be used by buffer write
  2010-01-17 16:19     ` Eric Sandeen
  2010-01-17 16:42       ` Aneesh Kumar K. V
@ 2010-01-18  3:57       ` tytso
  1 sibling, 0 replies; 23+ messages in thread
From: tytso @ 2010-01-18  3:57 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Aneesh Kumar K. V, Ext4 Developers List, Jiaying Zhang

On Sun, Jan 17, 2010 at 10:19:30AM -0600, Eric Sandeen wrote:

> In addition to Aneesh's suggestions, I'm not sure of the value of
> creating more
> 
> #define FLAG_A = FLAG_B|FLAG_C
> 
> flag macros; unless you have this all in your head you just have to
> go look up the flag definition anyway, since we usually test individual
> flags not the aggregates.  I'm wondering if it might be better to just
> explicitly send in the OR'd flags rather than creating a new one, to
> see the code flow better.

I'd agree with that.  The other reason why it's good to avoid
aggregates is that if you don't realize that that FLAG_A is an
aggregate, you can end up doing this:

	if (flag & FLAG_A) {
		...
	}

and then be surprise when this tests true not just when someone passed
in FLAG_A, but also if someone passes in FLAG_B or FLAG_C...

   	       	       	       	      	 - Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v4 2/3] ext4: use ext4_get_block_write in buffer write
  2010-01-15 19:30 [PATCH v4 0/3] dioread_nolock patch Theodore Ts'o
  2010-01-15 19:30 ` [PATCH v4 1/3] ext4: mechanical change on dio get_block code in prepare for it to be used by buffer write Theodore Ts'o
@ 2010-01-15 19:30 ` Theodore Ts'o
  2010-01-16  2:17   ` tytso
  2010-01-17 14:21   ` Aneesh Kumar K. V
  2010-01-15 19:30 ` [PATCH v4 3/3] ext4: Use direct_IO_no_locking in ext4 dio read Theodore Ts'o
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 23+ messages in thread
From: Theodore Ts'o @ 2010-01-15 19:30 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o, Jiaying Zhang

Allocate uninitialized extent before ext4 buffer write and
convert the extent to initialized after io completes.
The purpose is to make sure an extent can only be marked
initialized after it has been written with new data so
we can safely drop the i_mutex lock in ext4 DIO read without
exposing stale data. This helps to improve multi-thread DIO
read performance on high-speed disks.

Skip the nobh and data=journal mount cases to make things simple for now.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext4/ext4.h      |   12 +++++-
 fs/ext4/ext4_jbd2.h |   24 ++++++++++++
 fs/ext4/extents.c   |   22 ++++++-----
 fs/ext4/inode.c     |  105 ++++++++++++++++++++++++++++++++++++++++----------
 fs/ext4/super.c     |   30 +++++++++++++--
 5 files changed, 157 insertions(+), 36 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b1dcbb7..b8b4887 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -134,6 +134,7 @@ struct mpage_da_data {
 	int retval;
 };
 #define	EXT4_IO_UNWRITTEN	0x1
+#define	EXT4_IO_WRITTEN		0x2
 typedef struct ext4_io_end {
 	struct list_head	list;		/* per-file finished AIO list */
 	struct inode		*inode;		/* file being written to */
@@ -370,7 +371,7 @@ struct ext4_new_group_data {
 					 EXT4_GET_BLOCKS_CREATE_UNINIT_EXT)
 	/* Convert extent to initialized after IO complete */
 #define EXT4_GET_BLOCKS_IO_CONVERT_EXT		(EXT4_GET_BLOCKS_CONVERT|\
-					 EXT4_GET_BLOCKS_IO_CREATE_EXT)
+					 EXT4_GET_BLOCKS_CREATE_UNINIT_EXT)
 
 /*
  * Flags used by ext4_free_blocks
@@ -761,6 +762,7 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_QUOTA		0x80000 /* Some quota option set */
 #define EXT4_MOUNT_USRQUOTA		0x100000 /* "old" user quota */
 #define EXT4_MOUNT_GRPQUOTA		0x200000 /* "old" group quota */
+#define EXT4_MOUNT_DIOREAD_NOLOCK	0x400000 /* Enable support for dio read nolocking */
 #define EXT4_MOUNT_JOURNAL_CHECKSUM	0x800000 /* Journal checksums */
 #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT	0x1000000 /* Journal Async Commit */
 #define EXT4_MOUNT_I_VERSION            0x2000000 /* i_version support */
@@ -1774,6 +1776,14 @@ static inline void set_bitmap_uptodate(struct buffer_head *bh)
 	set_bit(BH_BITMAP_UPTODATE, &(bh)->b_state);
 }
 
+/* BH_Uninit flag: blocks are allocated but uninitialized on disk */
+enum ext4_state_bits {
+	BH_Uninit	/* blocks are allocated but uninitialized on disk */
+	  = BH_JBDPrivateStart,
+};
+
+BUFFER_FNS(Uninit, uninit)
+
 /*
  * __unmap_underlying_bh_blocks - just a helper function to unmap
  * set of blocks described by @bh
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 05eca81..dd58020 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -304,4 +304,28 @@ static inline int ext4_should_writeback_data(struct inode *inode)
 	return 0;
 }
 
+/*
+ * This function controls whether or not we should try to go down the
+ * dioread_nolock code paths, which makes it safe to avoid taking
+ * i_mutex for direct I/O reads.  This only works for extent-based
+ * files, and it doesn't work for nobh or if data journaling is
+ * enabled, since the dioread_nolock code uses b_private to pass
+ * information back to the I/O completion handler, and this conflicts
+ * with the jbd's use of b_private.
+ */
+static inline int ext4_should_dioread_nolock(struct inode *inode)
+{
+	if (!test_opt(inode->i_sb, DIOREAD_NOLOCK))
+		return 0;
+	if (test_opt(inode->i_sb, NOBH))
+		return 0;
+	if (!S_ISREG(inode->i_mode))
+		return 0;
+	if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
+		return 0;
+	if (ext4_should_journal_data(inode))
+		return 0;
+	return 1;
+}
+
 #endif	/* _EXT4_JBD2_H */
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e3eddc0..eb9bce0 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1618,7 +1618,7 @@ int ext4_ext_insert_extent(handle_t *handle, struct inode *inode,
 	BUG_ON(path[depth].p_hdr == NULL);
 
 	/* try to insert block into found extent and return */
-	if (ex && (flag != EXT4_GET_BLOCKS_PRE_IO)
+	if (ex && !(flag & EXT4_GET_BLOCKS_PRE_IO)
 		&& ext4_can_extents_be_merged(inode, ex, newext)) {
 		ext_debug("append [%d]%d block to %d:[%d]%d (from %llu)\n",
 				ext4_ext_is_uninitialized(newext),
@@ -1739,7 +1739,7 @@ has_space:
 
 merge:
 	/* try to merge extents to the right */
-	if (flag != EXT4_GET_BLOCKS_PRE_IO)
+	if (!(flag & EXT4_GET_BLOCKS_PRE_IO))
 		ext4_ext_try_to_merge(inode, path, nearex);
 
 	/* try to merge extents to the left */
@@ -3056,7 +3056,7 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
 	ext4_ext_show_leaf(inode, path);
 
 	/* get_block() before submit the IO, split the extent */
-	if (flags == EXT4_GET_BLOCKS_PRE_IO) {
+	if ((flags & EXT4_GET_BLOCKS_PRE_IO)) {
 		ret = ext4_split_unwritten_extents(handle,
 						inode, path, iblock,
 						max_blocks, flags);
@@ -3069,10 +3069,12 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
 			io->flag = EXT4_IO_UNWRITTEN;
 		else
 			EXT4_I(inode)->i_state |= EXT4_STATE_DIO_UNWRITTEN;
+		if (ext4_should_dioread_nolock(inode))
+			set_buffer_uninit(bh_result);
 		goto out;
 	}
 	/* IO end_io complete, convert the filled extent to written */
-	if (flags == EXT4_GET_BLOCKS_CONVERT) {
+	if ((flags & EXT4_GET_BLOCKS_CONVERT)) {
 		ret = ext4_convert_unwritten_extents_endio(handle, inode,
 							path);
 		if (ret >= 0)
@@ -3330,21 +3332,21 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
 	if (flags & EXT4_GET_BLOCKS_UNINIT_EXT){
 		ext4_ext_mark_uninitialized(&newex);
 		/*
-		 * io_end structure was created for every async
-		 * direct IO write to the middle of the file.
-		 * To avoid unecessary convertion for every aio dio rewrite
-		 * to the mid of file, here we flag the IO that is really
-		 * need the convertion.
+		 * io_end structure was created for every IO write to an
+		 * uninitialized extent. To avoid unecessary convertion,
+		 * here we flag the IO that really needs the convertion.
 		 * For non asycn direct IO case, flag the inode state
 		 * that we need to perform convertion when IO is done.
 		 */
-		if (flags == EXT4_GET_BLOCKS_PRE_IO) {
+		if ((flags & EXT4_GET_BLOCKS_PRE_IO)) {
 			if (io)
 				io->flag = EXT4_IO_UNWRITTEN;
 			else
 				EXT4_I(inode)->i_state |=
 					EXT4_STATE_DIO_UNWRITTEN;;
 		}
+		if (ext4_should_dioread_nolock(inode))
+			set_buffer_uninit(bh_result);
 	}
 	err = ext4_ext_insert_extent(handle, inode, path, &newex, flags);
 	if (err) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a3a5149..1f56484 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1535,6 +1535,8 @@ static void ext4_truncate_failed_write(struct inode *inode)
 	ext4_truncate(inode);
 }
 
+static int ext4_get_block_write(struct inode *inode, sector_t iblock,
+		   struct buffer_head *bh_result, int create);
 static int ext4_write_begin(struct file *file, struct address_space *mapping,
 			    loff_t pos, unsigned len, unsigned flags,
 			    struct page **pagep, void **fsdata)
@@ -1576,8 +1578,12 @@ retry:
 	}
 	*pagep = page;
 
-	ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
-				ext4_get_block);
+	if (ext4_should_dioread_nolock(inode))
+		ret = block_write_begin(file, mapping, pos, len, flags, pagep,
+				fsdata, ext4_get_block_write);
+	else
+		ret = block_write_begin(file, mapping, pos, len, flags, pagep,
+				fsdata, ext4_get_block);
 
 	if (!ret && ext4_should_journal_data(inode)) {
 		ret = walk_page_buffers(handle, page_buffers(page),
@@ -2105,6 +2111,8 @@ static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd, sector_t logical,
 				} else if (buffer_mapped(bh))
 					BUG_ON(bh->b_blocknr != pblock);
 
+				if (buffer_uninit(exbh))
+					set_buffer_uninit(bh);
 				cur_logical++;
 				pblock++;
 			} while ((bh = bh->b_this_page) != head);
@@ -2218,6 +2226,8 @@ static int mpage_da_map_blocks(struct mpage_da_data *mpd)
 	 */
 	new.b_state = 0;
 	get_blocks_flags = EXT4_GET_BLOCKS_CREATE;
+	if (ext4_should_dioread_nolock(mpd->inode))
+		get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
 	if (mpd->b_state & (1 << BH_Delay))
 		get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
 
@@ -2633,6 +2643,9 @@ out:
 	return ret;
 }
 
+static int ext4_set_bh_endio(struct buffer_head *bh, struct inode *inode);
+static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate);
+
 /*
  * Note that we don't need to start a transaction unless we're journaling data
  * because we should have holes filled from ext4_page_mkwrite(). We even don't
@@ -2680,7 +2693,7 @@ static int ext4_writepage(struct page *page,
 	int ret = 0;
 	loff_t size;
 	unsigned int len;
-	struct buffer_head *page_bufs;
+	struct buffer_head *page_bufs = NULL;
 	struct inode *inode = page->mapping->host;
 
 	trace_ext4_writepage(inode, page);
@@ -2756,7 +2769,11 @@ static int ext4_writepage(struct page *page,
 
 	if (test_opt(inode->i_sb, NOBH) && ext4_should_writeback_data(inode))
 		ret = nobh_writepage(page, noalloc_get_block_write, wbc);
-	else
+	else if (page_bufs && buffer_uninit(page_bufs)) {
+		ext4_set_bh_endio(page_bufs, inode);
+		ret = block_write_full_page_endio(page, noalloc_get_block_write,
+					    wbc, ext4_end_io_buffer_write);
+	} else
 		ret = block_write_full_page(page, noalloc_get_block_write,
 					    wbc);
 
@@ -3448,10 +3465,11 @@ out:
 static int ext4_get_block_write(struct inode *inode, sector_t iblock,
 		   struct buffer_head *bh_result, int create)
 {
-	handle_t *handle = NULL;
+	handle_t *handle = ext4_journal_current_handle();
 	int ret = 0;
 	unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
 	int dio_credits;
+	int started = 0;
 
 	ext4_debug("ext4_get_block_write: inode %lu, create flag %d\n",
 		   inode->i_ino, create);
@@ -3462,21 +3480,26 @@ static int ext4_get_block_write(struct inode *inode, sector_t iblock,
 	 */
 	create = EXT4_GET_BLOCKS_IO_CREATE_EXT;
 
-	if (max_blocks > DIO_MAX_BLOCKS)
-		max_blocks = DIO_MAX_BLOCKS;
-	dio_credits = ext4_chunk_trans_blocks(inode, max_blocks);
-	handle = ext4_journal_start(inode, dio_credits);
-	if (IS_ERR(handle)) {
-		ret = PTR_ERR(handle);
-		goto out;
+	if (!handle) {
+		if (max_blocks > DIO_MAX_BLOCKS)
+			max_blocks = DIO_MAX_BLOCKS;
+		dio_credits = ext4_chunk_trans_blocks(inode, max_blocks);
+		handle = ext4_journal_start(inode, dio_credits);
+		if (IS_ERR(handle)) {
+			ret = PTR_ERR(handle);
+			goto out;
+		}
+		started = 1;
 	}
+
 	ret = ext4_get_blocks(handle, inode, iblock, max_blocks, bh_result,
 			      create);
 	if (ret > 0) {
 		bh_result->b_size = (ret << inode->i_blkbits);
 		ret = 0;
 	}
-	ext4_journal_stop(handle);
+	if (started)
+		ext4_journal_stop(handle);
 out:
 	return ret;
 }
@@ -3530,12 +3553,10 @@ static int ext4_end_io_nolock(ext4_io_end_t *io)
 	if (list_empty(&io->list))
 		return ret;
 
-	if (io->flag != EXT4_IO_UNWRITTEN)
+	if (io->flag != EXT4_IO_WRITTEN)
 		return ret;
 
-	if (offset + size <= i_size_read(inode))
-		ret = ext4_convert_unwritten_extents(inode, offset, size);
-
+	ret = ext4_convert_unwritten_extents(inode, offset, size);
 	if (ret < 0) {
 		printk(KERN_EMERG "%s: failed to convert unwritten"
 			"extents to written extents, error is %d"
@@ -3583,7 +3604,7 @@ static void ext4_end_io_work(struct work_struct *work)
  */
 int flush_completed_IO(struct inode *inode)
 {
-	ext4_io_end_t *io;
+	ext4_io_end_t *io, *tmp;
 	int ret = 0;
 	int ret2 = 0;
 
@@ -3591,9 +3612,10 @@ int flush_completed_IO(struct inode *inode)
 		return ret;
 
 	dump_completed_IO(inode);
-	while (!list_empty(&EXT4_I(inode)->i_completed_io_list)){
-		io = list_entry(EXT4_I(inode)->i_completed_io_list.next,
-				ext4_io_end_t, list);
+	list_for_each_entry_safe(io, tmp,
+			&EXT4_I(inode)->i_completed_io_list, list) {
+		if (io->flag == EXT4_IO_UNWRITTEN)
+			continue;
 		/*
 		 * Calling ext4_end_io_nolock() to convert completed
 		 * IO to written.
@@ -3661,6 +3683,7 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
 
 	io_end->offset = offset;
 	io_end->size = size;
+	io_end->flag = EXT4_IO_WRITTEN;
 	wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
 
 	/* queue the work to convert unwritten extents to written */
@@ -3672,6 +3695,46 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
 	iocb->private = NULL;
 }
 
+static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
+{
+	ext4_io_end_t *io_end = bh->b_private;
+	struct workqueue_struct *wq;
+
+	if (!io_end)
+		goto out;
+	io_end->flag = EXT4_IO_WRITTEN;
+	wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
+	/* queue the work to convert unwritten extents to written */
+	queue_work(wq, &io_end->work);
+out:
+	bh->b_private = NULL;
+	bh->b_end_io = NULL;
+	clear_buffer_uninit(bh);
+	end_buffer_async_write(bh, uptodate);
+}
+
+static int ext4_set_bh_endio(struct buffer_head *bh, struct inode *inode)
+{
+	ext4_io_end_t *io_end;
+	struct page *page = bh->b_page;
+	loff_t offset = (sector_t)page->index << PAGE_CACHE_SHIFT;
+	size_t size = bh->b_size;
+
+	io_end = ext4_init_io_end(inode);
+	if (!io_end)
+		return -ENOMEM;
+	io_end->offset = offset;
+	io_end->size = size;
+	io_end->flag = EXT4_IO_UNWRITTEN;
+	/* Add the io_end to per-inode completed io list*/
+	list_add_tail(&io_end->list,
+		 &EXT4_I(io_end->inode)->i_completed_io_list);
+
+	bh->b_private = io_end;
+	bh->b_end_io = ext4_end_io_buffer_write;
+	return 0;
+}
+
 /*
  * For ext4 extent files, ext4 will do direct-io write to holes,
  * preallocated extents, and those write extend the file, no need to
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 2a64aeb..20f18d8 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -926,6 +926,9 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
 	if (test_opt(sb, NOLOAD))
 		seq_puts(seq, ",norecovery");
 
+	if (test_opt(sb, DIOREAD_NOLOCK))
+		seq_puts(seq, ",dioread_nolock");
+
 	ext4_show_quota_options(seq, sb);
 
 	return 0;
@@ -1109,6 +1112,7 @@ enum {
 	Opt_stripe, Opt_delalloc, Opt_nodelalloc,
 	Opt_block_validity, Opt_noblock_validity,
 	Opt_inode_readahead_blks, Opt_journal_ioprio,
+	Opt_dioread_nolock, Opt_dioread_lock,
 	Opt_discard, Opt_nodiscard,
 };
 
@@ -1176,6 +1180,8 @@ static const match_table_t tokens = {
 	{Opt_auto_da_alloc, "auto_da_alloc=%u"},
 	{Opt_auto_da_alloc, "auto_da_alloc"},
 	{Opt_noauto_da_alloc, "noauto_da_alloc"},
+	{Opt_dioread_nolock, "dioread_nolock"},
+	{Opt_dioread_lock, "dioread_lock"},
 	{Opt_discard, "discard"},
 	{Opt_nodiscard, "nodiscard"},
 	{Opt_err, NULL},
@@ -1609,6 +1615,12 @@ set_qf_format:
 		case Opt_nodiscard:
 			clear_opt(sbi->s_mount_opt, DISCARD);
 			break;
+		case Opt_dioread_nolock:
+			set_opt(sbi->s_mount_opt, DIOREAD_NOLOCK);
+			break;
+		case Opt_dioread_lock:
+			clear_opt(sbi->s_mount_opt, DIOREAD_NOLOCK);
+			break;
 		default:
 			ext4_msg(sb, KERN_ERR,
 			       "Unrecognized mount option \"%s\" "
@@ -2766,7 +2778,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	      EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER)) {
 		ext4_msg(sb, KERN_ERR, "required journal recovery "
 		       "suppressed and not mounted read-only");
-		goto failed_mount4;
+		goto failed_mount_wq;
 	} else {
 		clear_opt(sbi->s_mount_opt, DATA_FLAGS);
 		set_opt(sbi->s_mount_opt, WRITEBACK_DATA);
@@ -2779,7 +2791,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	    !jbd2_journal_set_features(EXT4_SB(sb)->s_journal, 0, 0,
 				       JBD2_FEATURE_INCOMPAT_64BIT)) {
 		ext4_msg(sb, KERN_ERR, "Failed to set 64-bit journal feature");
-		goto failed_mount4;
+		goto failed_mount_wq;
 	}
 
 	if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
@@ -2818,7 +2830,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		    (sbi->s_journal, 0, 0, JBD2_FEATURE_INCOMPAT_REVOKE)) {
 			ext4_msg(sb, KERN_ERR, "Journal does not support "
 			       "requested data journaling mode");
-			goto failed_mount4;
+			goto failed_mount_wq;
 		}
 	default:
 		break;
@@ -2826,13 +2838,17 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
 
 no_journal:
-
 	if (test_opt(sb, NOBH)) {
 		if (!(test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_WRITEBACK_DATA)) {
 			ext4_msg(sb, KERN_WARNING, "Ignoring nobh option - "
 				"its supported only with writeback mode");
 			clear_opt(sbi->s_mount_opt, NOBH);
 		}
+		if (test_opt(sb, DIOREAD_NOLOCK)) {
+			ext4_msg(sb, KERN_WARNING, "dioread_nolock option is "
+				"not supported with nobh mode");
+			goto failed_mount_wq;
+		}
 	}
 	EXT4_SB(sb)->dio_unwritten_wq = create_workqueue("ext4-dio-unwritten");
 	if (!EXT4_SB(sb)->dio_unwritten_wq) {
@@ -2897,6 +2913,12 @@ no_journal:
 			 "requested data journaling mode");
 		clear_opt(sbi->s_mount_opt, DELALLOC);
 	}
+	if (test_opt(sb, DIOREAD_NOLOCK) &&
+	    (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)) {
+		ext4_msg(sb, KERN_WARNING, "Ignoring dioread_nolock option - "
+			 "requested data journaling mode");
+		clear_opt(sbi->s_mount_opt, DIOREAD_NOLOCK);
+	}
 
 	err = ext4_setup_system_zone(sb);
 	if (err) {
-- 
1.6.5.216.g5288a.dirty


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/3] ext4: use ext4_get_block_write in buffer write
  2010-01-15 19:30 ` [PATCH v4 2/3] ext4: use ext4_get_block_write in " Theodore Ts'o
@ 2010-01-16  2:17   ` tytso
  2010-01-17 14:21   ` Aneesh Kumar K. V
  1 sibling, 0 replies; 23+ messages in thread
From: tytso @ 2010-01-16  2:17 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Jiaying Zhang

On Fri, Jan 15, 2010 at 02:30:11PM -0500, Theodore Ts'o wrote:
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 05eca81..dd58020 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -304,4 +304,28 @@ static inline int ext4_should_writeback_data(struct inode *inode)
>  	return 0;
>  }
>  
> +/*
> + * This function controls whether or not we should try to go down the
> + * dioread_nolock code paths, which makes it safe to avoid taking
> + * i_mutex for direct I/O reads.  This only works for extent-based
> + * files, and it doesn't work for nobh or if data journaling is
> + * enabled, since the dioread_nolock code uses b_private to pass
> + * information back to the I/O completion handler, and this conflicts
> + * with the jbd's use of b_private.
> + */
> +static inline int ext4_should_dioread_nolock(struct inode *inode)
> +{
> +	if (!test_opt(inode->i_sb, DIOREAD_NOLOCK))
> +		return 0;
> +	if (test_opt(inode->i_sb, NOBH))
> +		return 0;
> +	if (!S_ISREG(inode->i_mode))
> +		return 0;
> +	if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)

Oops, this was an embarassing typo.   This should have been:

+	if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))

Thanks to Jiaying for pointing this out.

						- Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/3] ext4: use ext4_get_block_write in buffer write
  2010-01-15 19:30 ` [PATCH v4 2/3] ext4: use ext4_get_block_write in " Theodore Ts'o
  2010-01-16  2:17   ` tytso
@ 2010-01-17 14:21   ` Aneesh Kumar K. V
  2010-01-18  5:25     ` Jiaying Zhang
  1 sibling, 1 reply; 23+ messages in thread
From: Aneesh Kumar K. V @ 2010-01-17 14:21 UTC (permalink / raw)
  To: Theodore Ts'o, Ext4 Developers List; +Cc: Theodore Ts'o, Jiaying Zhang

On Fri, 15 Jan 2010 14:30:11 -0500, "Theodore Ts'o" <tytso@mit.edu> wrote:
> Allocate uninitialized extent before ext4 buffer write and
> convert the extent to initialized after io completes.
> The purpose is to make sure an extent can only be marked
> initialized after it has been written with new data so
> we can safely drop the i_mutex lock in ext4 DIO read without
> exposing stale data. This helps to improve multi-thread DIO
> read performance on high-speed disks.
> 
> Skip the nobh and data=journal mount cases to make things simple for now.
> 
> Signed-off-by: Jiaying Zhang <jiayingz@google.com>
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
>  fs/ext4/ext4.h      |   12 +++++-
>  fs/ext4/ext4_jbd2.h |   24 ++++++++++++
>  fs/ext4/extents.c   |   22 ++++++-----
>  fs/ext4/inode.c     |  105 ++++++++++++++++++++++++++++++++++++++++----------
>  fs/ext4/super.c     |   30 +++++++++++++--
>  5 files changed, 157 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index b1dcbb7..b8b4887 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -134,6 +134,7 @@ struct mpage_da_data {
>  	int retval;
>  };
>  #define	EXT4_IO_UNWRITTEN	0x1
> +#define	EXT4_IO_WRITTEN		0x2
>  typedef struct ext4_io_end {
>  	struct list_head	list;		/* per-file finished AIO list */
>  	struct inode		*inode;		/* file being written to */
> @@ -370,7 +371,7 @@ struct ext4_new_group_data {
>  					 EXT4_GET_BLOCKS_CREATE_UNINIT_EXT)
>  	/* Convert extent to initialized after IO complete */
>  #define EXT4_GET_BLOCKS_IO_CONVERT_EXT		(EXT4_GET_BLOCKS_CONVERT|\
> -					 EXT4_GET_BLOCKS_IO_CREATE_EXT)
> +					 EXT4_GET_BLOCKS_CREATE_UNINIT_EXT)
> 
>  /*
>   * Flags used by ext4_free_blocks
> @@ -761,6 +762,7 @@ struct ext4_inode_info {
>  #define EXT4_MOUNT_QUOTA		0x80000 /* Some quota option set */
>  #define EXT4_MOUNT_USRQUOTA		0x100000 /* "old" user quota */
>  #define EXT4_MOUNT_GRPQUOTA		0x200000 /* "old" group quota */
> +#define EXT4_MOUNT_DIOREAD_NOLOCK	0x400000 /* Enable support for dio read nolocking */
>  #define EXT4_MOUNT_JOURNAL_CHECKSUM	0x800000 /* Journal checksums */
>  #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT	0x1000000 /* Journal Async Commit */
>  #define EXT4_MOUNT_I_VERSION            0x2000000 /* i_version support */
> @@ -1774,6 +1776,14 @@ static inline void set_bitmap_uptodate(struct buffer_head *bh)
>  	set_bit(BH_BITMAP_UPTODATE, &(bh)->b_state);
>  }
> 
> +/* BH_Uninit flag: blocks are allocated but uninitialized on disk */
> +enum ext4_state_bits {
> +	BH_Uninit	/* blocks are allocated but uninitialized on disk */
> +	  = BH_JBDPrivateStart,
> +};
> +
> +BUFFER_FNS(Uninit, uninit)
> +


I asked this in the last post. Why we need a new buffer head flag ?
Why can't we use the unwritten flag ?

-aneesh


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/3] ext4: use ext4_get_block_write in buffer write
  2010-01-17 14:21   ` Aneesh Kumar K. V
@ 2010-01-18  5:25     ` Jiaying Zhang
  0 siblings, 0 replies; 23+ messages in thread
From: Jiaying Zhang @ 2010-01-18  5:25 UTC (permalink / raw)
  To: Aneesh Kumar K. V; +Cc: Theodore Ts'o, Ext4 Developers List

I agree that unwritten flag would be a better choice. I was thinking
to use it at the beginning but found it would be tricky to get it work.
See e.g. the unwritten flag usage in the current ext4_get_block.
I guess at some time later, we should clean up the buffer head
flag usage.

Jiaying

On Sun, Jan 17, 2010 at 6:21 AM, Aneesh Kumar K. V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
>
> On Fri, 15 Jan 2010 14:30:11 -0500, "Theodore Ts'o" <tytso@mit.edu> wrote:
> > Allocate uninitialized extent before ext4 buffer write and
> > convert the extent to initialized after io completes.
> > The purpose is to make sure an extent can only be marked
> > initialized after it has been written with new data so
> > we can safely drop the i_mutex lock in ext4 DIO read without
> > exposing stale data. This helps to improve multi-thread DIO
> > read performance on high-speed disks.
> >
> > Skip the nobh and data=journal mount cases to make things simple for now.
> >
> > Signed-off-by: Jiaying Zhang <jiayingz@google.com>
> > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> > ---
> >  fs/ext4/ext4.h      |   12 +++++-
> >  fs/ext4/ext4_jbd2.h |   24 ++++++++++++
> >  fs/ext4/extents.c   |   22 ++++++-----
> >  fs/ext4/inode.c     |  105 ++++++++++++++++++++++++++++++++++++++++----------
> >  fs/ext4/super.c     |   30 +++++++++++++--
> >  5 files changed, 157 insertions(+), 36 deletions(-)
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index b1dcbb7..b8b4887 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -134,6 +134,7 @@ struct mpage_da_data {
> >       int retval;
> >  };
> >  #define      EXT4_IO_UNWRITTEN       0x1
> > +#define      EXT4_IO_WRITTEN         0x2
> >  typedef struct ext4_io_end {
> >       struct list_head        list;           /* per-file finished AIO list */
> >       struct inode            *inode;         /* file being written to */
> > @@ -370,7 +371,7 @@ struct ext4_new_group_data {
> >                                        EXT4_GET_BLOCKS_CREATE_UNINIT_EXT)
> >       /* Convert extent to initialized after IO complete */
> >  #define EXT4_GET_BLOCKS_IO_CONVERT_EXT               (EXT4_GET_BLOCKS_CONVERT|\
> > -                                      EXT4_GET_BLOCKS_IO_CREATE_EXT)
> > +                                      EXT4_GET_BLOCKS_CREATE_UNINIT_EXT)
> >
> >  /*
> >   * Flags used by ext4_free_blocks
> > @@ -761,6 +762,7 @@ struct ext4_inode_info {
> >  #define EXT4_MOUNT_QUOTA             0x80000 /* Some quota option set */
> >  #define EXT4_MOUNT_USRQUOTA          0x100000 /* "old" user quota */
> >  #define EXT4_MOUNT_GRPQUOTA          0x200000 /* "old" group quota */
> > +#define EXT4_MOUNT_DIOREAD_NOLOCK    0x400000 /* Enable support for dio read nolocking */
> >  #define EXT4_MOUNT_JOURNAL_CHECKSUM  0x800000 /* Journal checksums */
> >  #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT      0x1000000 /* Journal Async Commit */
> >  #define EXT4_MOUNT_I_VERSION            0x2000000 /* i_version support */
> > @@ -1774,6 +1776,14 @@ static inline void set_bitmap_uptodate(struct buffer_head *bh)
> >       set_bit(BH_BITMAP_UPTODATE, &(bh)->b_state);
> >  }
> >
> > +/* BH_Uninit flag: blocks are allocated but uninitialized on disk */
> > +enum ext4_state_bits {
> > +     BH_Uninit       /* blocks are allocated but uninitialized on disk */
> > +       = BH_JBDPrivateStart,
> > +};
> > +
> > +BUFFER_FNS(Uninit, uninit)
> > +
>
>
> I asked this in the last post. Why we need a new buffer head flag ?
> Why can't we use the unwritten flag ?
>
> -aneesh
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v4 3/3] ext4: Use direct_IO_no_locking in ext4 dio read.
  2010-01-15 19:30 [PATCH v4 0/3] dioread_nolock patch Theodore Ts'o
  2010-01-15 19:30 ` [PATCH v4 1/3] ext4: mechanical change on dio get_block code in prepare for it to be used by buffer write Theodore Ts'o
  2010-01-15 19:30 ` [PATCH v4 2/3] ext4: use ext4_get_block_write in " Theodore Ts'o
@ 2010-01-15 19:30 ` Theodore Ts'o
  2010-01-17 14:19   ` Aneesh Kumar K. V
  2010-01-15 19:39 ` [PATCH v4 0/3] dioread_nolock patch Ric Wheeler
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Theodore Ts'o @ 2010-01-15 19:30 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o, Jiaying Zhang

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext4/inode.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1f56484..ec0bbdd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3419,7 +3419,14 @@ static ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
 	}
 
 retry:
-	ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
+	if (rw == READ && ext4_should_dioread_nolock(inode))
+		ret = blockdev_direct_IO_no_locking(rw, iocb, inode,
+				 inode->i_sb->s_bdev, iov,
+				 offset, nr_segs,
+				 ext4_get_block, NULL);
+	else
+		ret = blockdev_direct_IO(rw, iocb, inode,
+				 inode->i_sb->s_bdev, iov,
 				 offset, nr_segs,
 				 ext4_get_block, NULL);
 	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
-- 
1.6.5.216.g5288a.dirty


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 3/3] ext4: Use direct_IO_no_locking in ext4 dio read.
  2010-01-15 19:30 ` [PATCH v4 3/3] ext4: Use direct_IO_no_locking in ext4 dio read Theodore Ts'o
@ 2010-01-17 14:19   ` Aneesh Kumar K. V
  0 siblings, 0 replies; 23+ messages in thread
From: Aneesh Kumar K. V @ 2010-01-17 14:19 UTC (permalink / raw)
  To: Theodore Ts'o, Ext4 Developers List; +Cc: Theodore Ts'o, Jiaying Zhang

On Fri, 15 Jan 2010 14:30:12 -0500, "Theodore Ts'o" <tytso@mit.edu> wrote:
> Signed-off-by: Jiaying Zhang <jiayingz@google.com>
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
>  fs/ext4/inode.c |    9 ++++++++-
>  1 files changed, 8 insertions(+), 1 deletions(-)



This need a commit message explaining why we can use
direct_IO_no_locking now. 



> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1f56484..ec0bbdd 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3419,7 +3419,14 @@ static ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>  	}
> 
>  retry:
> -	ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
> +	if (rw == READ && ext4_should_dioread_nolock(inode))
> +		ret = blockdev_direct_IO_no_locking(rw, iocb, inode,
> +				 inode->i_sb->s_bdev, iov,
> +				 offset, nr_segs,
> +				 ext4_get_block, NULL);
> +	else
> +		ret = blockdev_direct_IO(rw, iocb, inode,
> +				 inode->i_sb->s_bdev, iov,
>  				 offset, nr_segs,
>  				 ext4_get_block, NULL);
>  	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> -- 
> 1.6.5.216.g5288a.dirty


-aneesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/3] dioread_nolock patch
  2010-01-15 19:30 [PATCH v4 0/3] dioread_nolock patch Theodore Ts'o
                   ` (2 preceding siblings ...)
  2010-01-15 19:30 ` [PATCH v4 3/3] ext4: Use direct_IO_no_locking in ext4 dio read Theodore Ts'o
@ 2010-01-15 19:39 ` Ric Wheeler
  2010-01-15 19:52 ` Eric Sandeen
  2010-02-16 21:07 ` Darrick J. Wong
  5 siblings, 0 replies; 23+ messages in thread
From: Ric Wheeler @ 2010-01-15 19:39 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List, Chris Mason

On 01/15/2010 02:30 PM, Theodore Ts'o wrote:
> I've worked with Jiaying to ready this patch for submission.
>
> It's currently a mount option for maximum safety, but after we do some
> benchmarking to make sure it doesn't degrade performance for buffered
> writes, we may want to make this the default.  Once really nice side
> effect of this patch is that it effectively gives us "guarded mode" by
> default, since the blocks are marked as uninitialized and only converted
> to be initialized when the I/O has completed for both buffered and
> direct I/O writes now.  This means that we could possibly change the
> default mode to be data=writeback if the extents feature is enabled,
> since data=ordered would only needed for safety when writing new
> old-style indirect blocks.
>
> The plan is to merge this for 2.6.34.  I've looked this over pretty
> carefully, but another pair of eyes would be appreciated, especially if
> we make this the default.  Beyond the advantages of being able to use
> data=writeback, I believe this should be a major win for database
> workloads.
>
> 					- Ted
>    

I would be really cautious about turning this on unless we are 100% 
certain that we have not introduced data integrity issues. Performance 
testing is great, but we need to work hard on the power failure testing, 
etc as well....

What ever did happen to guarded mode? Is it still lurking out there?

ric


> Theodore Ts'o (3):
>    ext4: mechanical change on dio get_block code in prepare for it to be
>      used by buffer write
>    ext4: use ext4_get_block_write in buffer write
>    ext4: Use direct_IO_no_locking in ext4 dio read.
>
>   fs/ext4/ext4.h      |   28 +++++---
>   fs/ext4/ext4_jbd2.h |   24 +++++++
>   fs/ext4/extents.c   |   36 +++++-----
>   fs/ext4/fsync.c     |    2 +-
>   fs/ext4/inode.c     |  192 +++++++++++++++++++++++++++++++++-----------------
>   fs/ext4/super.c     |   32 +++++++--
>   6 files changed, 217 insertions(+), 97 deletions(-)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>    


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/3] dioread_nolock patch
  2010-01-15 19:30 [PATCH v4 0/3] dioread_nolock patch Theodore Ts'o
                   ` (3 preceding siblings ...)
  2010-01-15 19:39 ` [PATCH v4 0/3] dioread_nolock patch Ric Wheeler
@ 2010-01-15 19:52 ` Eric Sandeen
  2010-01-15 20:15   ` tytso
  2010-02-16 21:07 ` Darrick J. Wong
  5 siblings, 1 reply; 23+ messages in thread
From: Eric Sandeen @ 2010-01-15 19:52 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List

Theodore Ts'o wrote:
> I've worked with Jiaying to ready this patch for submission.
> 
> It's currently a mount option for maximum safety, but after we do some
> benchmarking to make sure it doesn't degrade performance for buffered
> writes, we may want to make this the default.  Once really nice side
> effect of this patch is that it effectively gives us "guarded mode" by
> default, since the blocks are marked as uninitialized and only converted
> to be initialized when the I/O has completed for both buffered and
> direct I/O writes now.  This means that we could possibly change the
> default mode to be data=writeback if the extents feature is enabled,
> since data=ordered would only needed for safety when writing new
> old-style indirect blocks.

At least as far as that last bit goes, simply having the extents
feature is not sufficient; we allow both formats of files to exist
on a filesystem with the extents feature turned on.

As to the general idea I'll have to give it more thought. :)

-Eric


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/3] dioread_nolock patch
  2010-01-15 19:52 ` Eric Sandeen
@ 2010-01-15 20:15   ` tytso
  2010-01-15 20:17     ` Eric Sandeen
  0 siblings, 1 reply; 23+ messages in thread
From: tytso @ 2010-01-15 20:15 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Ext4 Developers List

On Fri, Jan 15, 2010 at 01:52:45PM -0600, Eric Sandeen wrote:
> 
> At least as far as that last bit goes, simply having the extents
> feature is not sufficient; we allow both formats of files to exist
> on a filesystem with the extents feature turned on.

... and I guess someone could be appending to a legacy file when the
system crashes.  I suppose we can at least exempt extent files from
ordered mode handling.

> As to the general idea I'll have to give it more thought. :)

Yeah, and we need to do a lot of performance and functional testing.
Jiaying has done a lot of testing of this in the past couple of
months, but more testing, especially power fail testing, is definitely
a good thing.  I also want to do power fail testing for journal
checksums and async commits so we can turn that feature on by default,
since with those features enabled, it almost doubles fs_mark
performance.  (Async commit is now badly named, what it does is
reduces the number of write barriers needed from two per commit to
just one.  But we do need to test it some more...)

This was more of a statement of intentions than a "we'll turn this on
by default in 2.3.34".  I figure we'll merge first, and then change
the default later, and still later we'll simplify the code paths by
removing the old code path.

Speaking of which, something more to think about --- does anybody
still care about nobh mode?  It was necessary to preserve lowmem for
32-bit kernels with lots of memory, and it was mainly useful for
database workloads.  But with 64-bit kernels, it's not clear the
tradeoffs of not caching the block number are really worth it any
more.  What would people think about potentially dropping the nobh
option and write paths from ext4?

     	       	     	     - Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/3] dioread_nolock patch
  2010-01-15 20:15   ` tytso
@ 2010-01-15 20:17     ` Eric Sandeen
  2010-01-15 21:47       ` Michael Rubin
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Sandeen @ 2010-01-15 20:17 UTC (permalink / raw)
  To: tytso; +Cc: Ext4 Developers List

tytso@mit.edu wrote:
> On Fri, Jan 15, 2010 at 01:52:45PM -0600, Eric Sandeen wrote:
>> At least as far as that last bit goes, simply having the extents
>> feature is not sufficient; we allow both formats of files to exist
>> on a filesystem with the extents feature turned on.
> 
> ... and I guess someone could be appending to a legacy file when the
> system crashes.  I suppose we can at least exempt extent files from
> ordered mode handling.
> 
>> As to the general idea I'll have to give it more thought. :)
> 
> Yeah, and we need to do a lot of performance and functional testing.
> Jiaying has done a lot of testing of this in the past couple of
> months, but more testing, especially power fail testing, is definitely
> a good thing.  I also want to do power fail testing for journal
> checksums and async commits so we can turn that feature on by default,
> since with those features enabled, it almost doubles fs_mark
> performance.  (Async commit is now badly named, what it does is
> reduces the number of write barriers needed from two per commit to
> just one.  But we do need to test it some more...)

At one point google was planning to devise a power-fail test
harness.  Any news on that?

> This was more of a statement of intentions than a "we'll turn this on
> by default in 2.3.34".  I figure we'll merge first, and then change
> the default later, and still later we'll simplify the code paths by
> removing the old code path.
> 
> Speaking of which, something more to think about --- does anybody
> still care about nobh mode?  It was necessary to preserve lowmem for
> 32-bit kernels with lots of memory, and it was mainly useful for
> database workloads.  But with 64-bit kernels, it's not clear the
> tradeoffs of not caching the block number are really worth it any
> more.  What would people think about potentially dropping the nobh
> option and write paths from ext4?

I have no special love for it personally, and I don't run into
fedora users or red hat customers using it, as far as I know.

-Eric


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/3] dioread_nolock patch
  2010-01-15 20:17     ` Eric Sandeen
@ 2010-01-15 21:47       ` Michael Rubin
  2010-01-22 20:47         ` Valerie Aurora
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Rubin @ 2010-01-15 21:47 UTC (permalink / raw)
  To: Eric Sandeen, Jiaying Zhang; +Cc: tytso, Ext4 Developers List

On Fri, Jan 15, 2010 at 12:17 PM, Eric Sandeen <sandeen@redhat.com> wrote:
> tytso@mit.edu wrote:
> At one point google was planning to devise a power-fail test
> harness.  Any news on that?

We completed the tests. But there is good news and bad news. The good
news is that we were able to shake out a lot of bugs in the no journal
case (which have already been submitted upstream). We now can fairly
quickly and easily drive a lot of traffic to a system and then cut the
power, issue a panic or other event.

The bad news is that I was hoping to use open source tools to drive
the traffic. The goal would be to allow everyone to reproduce the
experiment. We got a little short handed on resources and ended up
using Google closed source workloads instead. This was mostly since we
were in a rush and able to get the network traffic up in one day with
those tools.

I was holding back on publishing the results since I was hoping we
would be able to generate traffic in a more open manner. But if you
are interested in the results anyway then give me or some one on the
team a day or two to dig them up.

In any case we have powerfail testing as part of our validation and
once Jiayingz is satisfied with the patch we will be running those
tests.

mrubin
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/3] dioread_nolock patch
  2010-01-15 21:47       ` Michael Rubin
@ 2010-01-22 20:47         ` Valerie Aurora
  2010-02-20  0:56           ` Michael Rubin
  0 siblings, 1 reply; 23+ messages in thread
From: Valerie Aurora @ 2010-01-22 20:47 UTC (permalink / raw)
  To: Michael Rubin; +Cc: Eric Sandeen, Jiaying Zhang, tytso, Ext4 Developers List

On Fri, Jan 15, 2010 at 01:47:26PM -0800, Michael Rubin wrote:
> On Fri, Jan 15, 2010 at 12:17 PM, Eric Sandeen <sandeen@redhat.com> wrote:
> > tytso@mit.edu wrote:
> > At one point google was planning to devise a power-fail test
> > harness. ?Any news on that?
> 
> We completed the tests. But there is good news and bad news. The good
> news is that we were able to shake out a lot of bugs in the no journal
> case (which have already been submitted upstream). We now can fairly
> quickly and easily drive a lot of traffic to a system and then cut the
> power, issue a panic or other event.
> 
> The bad news is that I was hoping to use open source tools to drive
> the traffic. The goal would be to allow everyone to reproduce the
> experiment. We got a little short handed on resources and ended up
> using Google closed source workloads instead. This was mostly since we
> were in a rush and able to get the network traffic up in one day with
> those tools.
> 
> I was holding back on publishing the results since I was hoping we
> would be able to generate traffic in a more open manner. But if you
> are interested in the results anyway then give me or some one on the
> team a day or two to dig them up.

Don't let the perfect be the enemy of the good. :) I'd love to see
what you've got now, even if you have to leave out the closed part of it.

-VAL

> In any case we have powerfail testing as part of our validation and
> once Jiayingz is satisfied with the patch we will be running those
> tests.
> 
> mrubin
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/3] dioread_nolock patch
  2010-01-22 20:47         ` Valerie Aurora
@ 2010-02-20  0:56           ` Michael Rubin
  2010-02-23  0:36             ` Andreas Dilger
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Rubin @ 2010-02-20  0:56 UTC (permalink / raw)
  To: Valerie Aurora; +Cc: Eric Sandeen, Jiaying Zhang, tytso, Ext4 Developers List

On Fri, Jan 22, 2010 at 12:47 PM, Valerie Aurora <vaurora@redhat.com> wrote:
> Don't let the perfect be the enemy of the good. :) I'd love to see
> what you've got now, even if you have to leave out the closed part of it.
>

We are currently reviewing a paper to send out this data.
Sorry to take so long but we have been very busy with the ext4 upgrade.
We plan on publishing other papers about our experiences with ext4
this summer also.
Out of curiosity is there anything anyone is curious about in the file
system space?

mrubin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/3] dioread_nolock patch
  2010-02-20  0:56           ` Michael Rubin
@ 2010-02-23  0:36             ` Andreas Dilger
  0 siblings, 0 replies; 23+ messages in thread
From: Andreas Dilger @ 2010-02-23  0:36 UTC (permalink / raw)
  To: Michael Rubin
  Cc: Valerie Aurora, Eric Sandeen, Jiaying Zhang, tytso,
	Ext4 Developers List

On 2010-02-19, at 17:56, Michael Rubin wrote:
> On Fri, Jan 22, 2010 at 12:47 PM, Valerie Aurora  
> <vaurora@redhat.com> wrote:
>> Don't let the perfect be the enemy of the good. :) I'd love to see
>> what you've got now, even if you have to leave out the closed part  
>> of it.
>
> We are currently reviewing a paper to send out this data.
> Sorry to take so long but we have been very busy with the ext4  
> upgrade.
> We plan on publishing other papers about our experiences with ext4
> this summer also.
> Out of curiosity is there anything anyone is curious about in the file
> system space?


What are you asking for, in particular.  I'd of course be interested  
to know if you have numbers for bandwidth/latency improvements for  
ext4, e2fsck time improvements (if the system is doing this), etc.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/3] dioread_nolock patch
  2010-01-15 19:30 [PATCH v4 0/3] dioread_nolock patch Theodore Ts'o
                   ` (4 preceding siblings ...)
  2010-01-15 19:52 ` Eric Sandeen
@ 2010-02-16 21:07 ` Darrick J. Wong
  2010-02-17 19:34   ` Jiaying Zhang
  5 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2010-02-16 21:07 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List

On Fri, Jan 15, 2010 at 02:30:09PM -0500, Theodore Ts'o wrote:

> The plan is to merge this for 2.6.34.  I've looked this over pretty
> carefully, but another pair of eyes would be appreciated, especially if

I don't have a high speed disk but it was suggested that I give this patchset a
whirl anyway, so down the rabbit hole I went.  I created a 16GB ext4 image in
an equally big tmpfs, then ran the read/readall directio tests in ffsb to see
if I could observe any difference.  The kernel is 2.6.33-rc8, and the machine
in question has 2 Xeon E5335 processors and 24GB of RAM.  I reran the test
several times, with varying thread counts, to produce the table below.  The
units are MB/s.

For the dio_lock case, mount options were: rw,relatime,barrier=1,data=ordered.
For the dio_nolock case, they were: rw,relatime,barrier=1,data=ordered,dioread_nolock.

	dio_nolock	dio_lock
threads	read	readall	read	readall
1	37.6	149	39	159
2	59.2	245	62.4	246
4	114	453	112	445
8	111	444	115	459
16	109	442	113	448
32	114	443	121	484
64	106	422	108	434
128	104	417	101	393
256	101	412	90.5	366
512	93.3	377	84.8	349
1000	87.1	353	88.7	348

It would seem that the old code paths are faster with a small number of
threads, but the new patch seems to be faster when the thread counts become
very high.  That said, I'm not all that familiar with what exactly tmpfs does,
or how well it mimicks an SSD (though I wouldn't be surprised to hear
"poorly").  This of course makes me wonder--do other people see results like
this, or is this particular to my harebrained setup?

For that matter, do I need to have more patches than just 2.6.33-rc8 and the
four posted in this thread?

I also observed that I could make the kernel spit up "Process hung for more
than 120s!" messages if I happened to be running ffsb on a real disk during a
heavy directio write load.  I'll poke around on that a little more and write
back when I have more details.

For poweroff testing, could one simulate a power failure by running IO
workloads in a VM and then SIGKILLing the VM?  I don't remember seeing any sort
of powerfail test suite from the Googlers, but my mail client has been drinking
out of firehoses lately. ;)

--D

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/3] dioread_nolock patch
  2010-02-16 21:07 ` Darrick J. Wong
@ 2010-02-17 19:34   ` Jiaying Zhang
  2010-02-19 21:25     ` Darrick J. Wong
  0 siblings, 1 reply; 23+ messages in thread
From: Jiaying Zhang @ 2010-02-17 19:34 UTC (permalink / raw)
  To: djwong; +Cc: Theodore Ts'o, Ext4 Developers List

Hi Darrick,

Thank you for running these tests!

On Tue, Feb 16, 2010 at 1:07 PM, Darrick J. Wong <djwong@us.ibm.com> wrote:
> On Fri, Jan 15, 2010 at 02:30:09PM -0500, Theodore Ts'o wrote:
>
>> The plan is to merge this for 2.6.34.  I've looked this over pretty
>> carefully, but another pair of eyes would be appreciated, especially if
>
> I don't have a high speed disk but it was suggested that I give this patchset a
> whirl anyway, so down the rabbit hole I went.  I created a 16GB ext4 image in
> an equally big tmpfs, then ran the read/readall directio tests in ffsb to see
> if I could observe any difference.  The kernel is 2.6.33-rc8, and the machine
> in question has 2 Xeon E5335 processors and 24GB of RAM.  I reran the test
> several times, with varying thread counts, to produce the table below.  The
> units are MB/s.
>
> For the dio_lock case, mount options were: rw,relatime,barrier=1,data=ordered.
> For the dio_nolock case, they were: rw,relatime,barrier=1,data=ordered,dioread_nolock.
>
>        dio_nolock      dio_lock
> threads read    readall read    readall
> 1       37.6    149     39      159
> 2       59.2    245     62.4    246
> 4       114     453     112     445
> 8       111     444     115     459
> 16      109     442     113     448
> 32      114     443     121     484
> 64      106     422     108     434
> 128     104     417     101     393
> 256     101     412     90.5    366
> 512     93.3    377     84.8    349
> 1000    87.1    353     88.7    348
>
> It would seem that the old code paths are faster with a small number of
> threads, but the new patch seems to be faster when the thread counts become
> very high.  That said, I'm not all that familiar with what exactly tmpfs does,
> or how well it mimicks an SSD (though I wouldn't be surprised to hear
> "poorly").  This of course makes me wonder--do other people see results like
> this, or is this particular to my harebrained setup?
The dioread_nolock patch set is to eliminate the need of holding i_mutex lock
during DIO read. That is why we usually see more improvements as the number
of threads increases on high-speed SSDs. The performance difference is
also more obvious as the bandwidth of device increases.

I am surprised to see around 6% performance drop on single thread case.
The dioread_nolock patches change the ext4 buffer write code path a lot but on
the dio read code path, the only change is to not grab the i_mutex lock.
I haven't seen such difference in my tests. I mostly use fio test for
performance
comparison. I will give ffsb test a try.

Meanwhile, could you also post the stdev numbers?

>
> For that matter, do I need to have more patches than just 2.6.33-rc8 and the
> four posted in this thread?
>
> I also observed that I could make the kernel spit up "Process hung for more
> than 120s!" messages if I happened to be running ffsb on a real disk during a
> heavy directio write load.  I'll poke around on that a little more and write
> back when I have more details.

Did the hang happen only with dioread_nolock or it also happened without
the patches applied? It is not surprising to see such messages on slow disk
since the processes are all waiting for IOs.

>
> For poweroff testing, could one simulate a power failure by running IO
> workloads in a VM and then SIGKILLing the VM?  I don't remember seeing any sort
> of powerfail test suite from the Googlers, but my mail client has been drinking
> out of firehoses lately. ;)
As far as I know, these numbers are not posted yet but will come out soon.

Jiaying
>
> --D
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/3] dioread_nolock patch
  2010-02-17 19:34   ` Jiaying Zhang
@ 2010-02-19 21:25     ` Darrick J. Wong
  0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2010-02-19 21:25 UTC (permalink / raw)
  To: Jiaying Zhang; +Cc: Theodore Ts'o, Ext4 Developers List

[-- Attachment #1: Type: text/plain, Size: 4959 bytes --]

On Wed, Feb 17, 2010 at 11:34:32AM -0800, Jiaying Zhang wrote:
> Hi Darrick,
> 
> Thank you for running these tests!

No problem.

> On Tue, Feb 16, 2010 at 1:07 PM, Darrick J. Wong <djwong@us.ibm.com> wrote:
> > On Fri, Jan 15, 2010 at 02:30:09PM -0500, Theodore Ts'o wrote:
> >
> >> The plan is to merge this for 2.6.34.  I've looked this over pretty
> >> carefully, but another pair of eyes would be appreciated, especially if
> >
> > I don't have a high speed disk but it was suggested that I give this patchset a
> > whirl anyway, so down the rabbit hole I went.  I created a 16GB ext4 image in
> > an equally big tmpfs, then ran the read/readall directio tests in ffsb to see
> > if I could observe any difference.  The kernel is 2.6.33-rc8, and the machine
> > in question has 2 Xeon E5335 processors and 24GB of RAM.  I reran the test
> > several times, with varying thread counts, to produce the table below.  The
> > units are MB/s.
> >
> > For the dio_lock case, mount options were: rw,relatime,barrier=1,data=ordered.
> > For the dio_nolock case, they were: rw,relatime,barrier=1,data=ordered,dioread_nolock.
> >
> >        dio_nolock      dio_lock
> > threads read    readall read    readall
> > 1       37.6    149     39      159
> > 2       59.2    245     62.4    246
> > 4       114     453     112     445
> > 8       111     444     115     459
> > 16      109     442     113     448
> > 32      114     443     121     484
> > 64      106     422     108     434
> > 128     104     417     101     393
> > 256     101     412     90.5    366
> > 512     93.3    377     84.8    349
> > 1000    87.1    353     88.7    348
> >
> > It would seem that the old code paths are faster with a small number of
> > threads, but the new patch seems to be faster when the thread counts become
> > very high.  That said, I'm not all that familiar with what exactly tmpfs does,
> > or how well it mimicks an SSD (though I wouldn't be surprised to hear
> > "poorly").  This of course makes me wonder--do other people see results like
> > this, or is this particular to my harebrained setup?
> The dioread_nolock patch set is to eliminate the need of holding i_mutex lock
> during DIO read. That is why we usually see more improvements as the number
> of threads increases on high-speed SSDs. The performance difference is
> also more obvious as the bandwidth of device increases.

Running my streaming profiler, it looks like I can "get" 1500MB/s off the
ramdisk.

> I am surprised to see around 6% performance drop on single thread case.
> The dioread_nolock patches change the ext4 buffer write code path a lot but on
> the dio read code path, the only change is to not grab the i_mutex lock.
> I haven't seen such difference in my tests. I mostly use fio test for
> performance
> comparison. I will give ffsb test a try.

Ok, I'll attach the config file and script I was using.  Make sure /mnt is the
filesystem to test, and then you can run the script via:

$ ./readwrite 1 2 4 8 16 32 64 128 256 512

> Meanwhile, could you also post the stdev numbers?

I don't have that spreadsheet on this computer, but I recall that the std
deviations weren't more than about 10 for the first run.

Oddly, I tried a second computer, and saw very little difference (units MB/s):

threads	lock avg	nolock avg	lock stdev	nolock stdev
1	235		214		1		5.57
2	318		316.67		3		2.52
4	589.67		581.67		8.14		22.14
8	594.67		583		15.7		4
16	596.67		576		8.96		8.72
32	578		576.67		7.81		5.69
64	570.33		575.67		1.15		7.51
128	573.67		573.67		10.69		10.69
256	575.33		570		8.14		6.08
512	539.67		544.33		3.21		4.04
1000	479.33		482		3.21		2

This one has somewhat faster RAM (ECC registered vs FBDIMMs) and 8x 2.5GHz Xeon
L5420 CPUs.

> > For that matter, do I need to have more patches than just 2.6.33-rc8 and the
> > four posted in this thread?
> >
> > I also observed that I could make the kernel spit up "Process hung for more
> > than 120s!" messages if I happened to be running ffsb on a real disk during a
> > heavy directio write load.  I'll poke around on that a little more and write
> > back when I have more details.
> 
> Did the hang happen only with dioread_nolock or it also happened without
> the patches applied? It is not surprising to see such messages on slow disk
> since the processes are all waiting for IOs.

To clarify: Nothing hung; I simply got the "hung task" warning.  It
happened only with the patches applied, though for all I know without the
patches applied the tasks could be starving for 119s.

> > For poweroff testing, could one simulate a power failure by running IO
> > workloads in a VM and then SIGKILLing the VM?  I don't remember seeing any sort
> > of powerfail test suite from the Googlers, but my mail client has been drinking
> > out of firehoses lately. ;)
> As far as I know, these numbers are not posted yet but will come out soon.

Uh... I was more curious if anyone had a testing suite, not results necessarily.

--D

[-- Attachment #2: djwong-readwrite.ffsb --]
[-- Type: text/plain, Size: 1433 bytes --]

# djwong playground

time=300
alignio=1
directio=1

#callout=/usr/local/src/ffsb-6.0-rc2/ltc_tests/dwrite_all

[filesystem0]
	location=/mnt/ffsb1
	num_files=1000
	num_dirs=10
	reuse=1

	# File sizes range from 1kB to 1MB.
#	size_weight 1KB 10
#	size_weight 2KB 15
#	size_weight 4KB 16
#	size_weight 8KB 16
#	size_weight 16KB 15
#	size_weight 32KB 10
#	size_weight 64KB 8
#	size_weight 128KB 4
#	size_weight 256KB 3
#	size_weight 512KB 2
#	size_weight 1MB 1
	size_weight 16MB 1

#	size_weight 1GB 1
#	size_weight 2GB 1
#	size_weight 4GB 1
[end0]

[threadgroup0]
	num_threads=%THREADS%

	readall_weight=4
#	writeall_weight=4
#	create_weight=4
#	delete_weight=4
#	append_weight=4
	read_weight=4
#	write_weight=4

#	write_size=4MB
#	write_blocksize=4KB

	read_size=4MB
	read_blocksize=4KB

	[stats]
		enable_stats=0
		enable_range=0

		msec_range    0.00      0.01
		msec_range    0.01      0.02
		msec_range    0.02      0.05
		msec_range    0.05      0.10
		msec_range    0.10      0.20
		msec_range    0.20      0.50
		msec_range    0.50      1.00
		msec_range    1.00      2.00
		msec_range    2.00      5.00
		msec_range    5.00     10.00
		msec_range   10.00     20.00
		msec_range   20.00     50.00
		msec_range   50.00    100.00
		msec_range  100.00    200.00
		msec_range  200.00    500.00
		msec_range  500.00   1000.00
		msec_range 1000.00   2000.00
		msec_range 2000.00   5000.00
		msec_range 5000.00  10000.00
	[end]
[end0]

[-- Attachment #3: readwrite.sh --]
[-- Type: application/x-sh, Size: 252 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2010-02-23 16:54 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-15 19:30 [PATCH v4 0/3] dioread_nolock patch Theodore Ts'o
2010-01-15 19:30 ` [PATCH v4 1/3] ext4: mechanical change on dio get_block code in prepare for it to be used by buffer write Theodore Ts'o
2010-01-17 14:36   ` Aneesh Kumar K. V
2010-01-17 16:19     ` Eric Sandeen
2010-01-17 16:42       ` Aneesh Kumar K. V
2010-01-18  3:57       ` tytso
2010-01-15 19:30 ` [PATCH v4 2/3] ext4: use ext4_get_block_write in " Theodore Ts'o
2010-01-16  2:17   ` tytso
2010-01-17 14:21   ` Aneesh Kumar K. V
2010-01-18  5:25     ` Jiaying Zhang
2010-01-15 19:30 ` [PATCH v4 3/3] ext4: Use direct_IO_no_locking in ext4 dio read Theodore Ts'o
2010-01-17 14:19   ` Aneesh Kumar K. V
2010-01-15 19:39 ` [PATCH v4 0/3] dioread_nolock patch Ric Wheeler
2010-01-15 19:52 ` Eric Sandeen
2010-01-15 20:15   ` tytso
2010-01-15 20:17     ` Eric Sandeen
2010-01-15 21:47       ` Michael Rubin
2010-01-22 20:47         ` Valerie Aurora
2010-02-20  0:56           ` Michael Rubin
2010-02-23  0:36             ` Andreas Dilger
2010-02-16 21:07 ` Darrick J. Wong
2010-02-17 19:34   ` Jiaying Zhang
2010-02-19 21:25     ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).