[PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
       [not found] <1250697884-22288-1-git-send-email-jack@suse.cz>
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:18   ` Christoph Hellwig
  2009-08-19 16:04 ` [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode Jan Kara
  2009-08-19 16:04 ` [PATCH 14/17] xfs: Use new syncing helper Jan Kara
  2 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: Jan Kara, Joel Becker, xfs, hch, ocfs2-devel

generic_file_direct_write() and generic_file_buffered_write() called
generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
this is superfluous since generic_file_aio_write() does the syncing as well.
Also XFS and OCFS2 which call these functions directly handle syncing
themselves. So let's have a single place where syncing happens:
generic_file_aio_write().

CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/filemap.c |   25 -------------------------
 1 files changed, 0 insertions(+), 25 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 554a396..3bee198 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2187,20 +2187,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 		}
 		*ppos = end;
 	}
-
-	/*
-	 * Sync the fs metadata but not the minor inode changes and
-	 * of course not the data as we did direct DMA for the IO.
-	 * i_mutex is held, which protects generic_osync_inode() from
-	 * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
-	 */
 out:
-	if ((written >= 0 || written == -EIOCBQUEUED) &&
-	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
-		if (err < 0)
-			written = err;
-	}
 	return written;
 }
 EXPORT_SYMBOL(generic_file_direct_write);
@@ -2332,8 +2319,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
-	const struct address_space_operations *a_ops = mapping->a_ops;
-	struct inode *inode = mapping->host;
 	ssize_t status;
 	struct iov_iter i;
 
@@ -2343,16 +2328,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 	if (likely(status >= 0)) {
 		written += status;
 		*ppos = pos + status;
-
-		/*
-		 * For now, when the user asks for O_SYNC, we'll actually give
-		 * O_DSYNC
-		 */
-		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-			if (!a_ops->writepage || !is_sync_kiocb(iocb))
-				status = generic_osync_inode(inode, mapping,
-						OSYNC_METADATA|OSYNC_DATA);
-		}
   	}
 	
 	/*
-- 
1.6.0.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
       [not found] <1250697884-22288-1-git-send-email-jack@suse.cz>
  2009-08-19 16:04 ` [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write() Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:26   ` Christoph Hellwig
  2009-08-19 16:04 ` [PATCH 14/17] xfs: Use new syncing helper Jan Kara
  2 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML
  Cc: tytso, Jan Kara, linux-ntfs-dev, Joel Becker, xfs, hch,
	Anton Altaparmakov, OGAWA Hirofumi, Evgeniy Polyakov, linux-ext4,
	ocfs2-devel

Introduce new function for generic inode syncing (generic_sync_file) and use
it from fsync() path. Introduce also new helper for syncing after a sync
write (generic_write_sync) using the generic function.

Use these new helpers for syncing from generic VFS functions.

CC: Evgeniy Polyakov <zbr@ioremap.net>
CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
CC: Anton Altaparmakov <aia21@cantab.net>
CC: linux-ntfs-dev@lists.sourceforge.net
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
CC: linux-ext4@vger.kernel.org
CC: tytso@mit.edu
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/splice.c        |   22 ++++----------
 fs/sync.c          |   81 +++++++++++++++++++++++++++++++++++++++++++---------
 include/linux/fs.h |    7 ++++
 mm/filemap.c       |   16 ++++------
 4 files changed, 86 insertions(+), 40 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 73766d2..8190237 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -976,25 +976,15 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 
 	if (ret > 0) {
 		unsigned long nr_pages;
+		int err;
 
-		*ppos += ret;
 		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 
-		/*
-		 * If file or inode is SYNC and we actually wrote some data,
-		 * sync it.
-		 */
-		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
-			int err;
-
-			mutex_lock(&inode->i_mutex);
-			err = generic_osync_inode(inode, mapping,
-						  OSYNC_METADATA|OSYNC_DATA);
-			mutex_unlock(&inode->i_mutex);
-
-			if (err)
-				ret = err;
-		}
+		err = generic_write_sync(out, *ppos, ret);
+		if (err)
+			ret = err;
+		else
+			*ppos += ret;
 		balance_dirty_pages_ratelimited_nr(mapping, nr_pages);
 	}
 
diff --git a/fs/sync.c b/fs/sync.c
index 3422ba6..fc320aa 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -176,23 +176,30 @@ int file_fsync(struct file *filp, struct dentry *dentry, int datasync)
 }
 
 /**
- * vfs_fsync - perform a fsync or fdatasync on a file
+ * generic_sync_file - helper to sync data & metadata to disk
  * @file:		file to sync
  * @dentry:		dentry of @file
- * @data:		only perform a fdatasync operation
+ * @start:		offset in bytes of the beginning of data range to sync
+ * @end:		offset in bytes of the end of data range (inclusive)
+ * @what:		what should be synced
  *
- * Write back data and metadata for @file to disk.  If @datasync is
- * set only metadata needed to access modified file data is written.
+ * What the function exactly does is controlled by the @what parameter:
  *
- * In case this function is called from nfsd @file may be %NULL and
- * only @dentry is set.  This can only happen when the filesystem
- * implements the export_operations API.
+ * If SYNC_SUBMIT_DATA is set, the function submits all pages in the given
+ * range to disk.
+ *
+ * The function always calls ->fsync() callback of the filesystem. If
+ * SYNC_INODE is not set, we pass down the fact that it is just a datasync.
+ *
+ * If SYNC_WAIT_DATA is set, the function waits for writeback to finish
+ * in the given range.
  */
-int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
+int generic_sync_file(struct file *file, struct dentry *dentry, loff_t start,
+		      loff_t end, int what)
 {
 	const struct file_operations *fop;
 	struct address_space *mapping;
-	int err, ret;
+	int err, ret = 0;
 
 	/*
 	 * Get mapping and operations from the file in case we have
@@ -212,23 +219,50 @@ int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
 		goto out;
 	}
 
-	ret = filemap_fdatawrite(mapping);
+	if (what & SYNC_SUBMIT_DATA)
+		ret = filemap_fdatawrite_range(mapping, start, end);
 
 	/*
 	 * We need to protect against concurrent writers, which could cause
 	 * livelocks in fsync_buffers_list().
 	 */
 	mutex_lock(&mapping->host->i_mutex);
-	err = fop->fsync(file, dentry, datasync);
+	err = fop->fsync(file, dentry, !(what & SYNC_INODE));
 	if (!ret)
 		ret = err;
 	mutex_unlock(&mapping->host->i_mutex);
-	err = filemap_fdatawait(mapping);
-	if (!ret)
-		ret = err;
+
+	if (what & SYNC_WAIT_DATA) {
+		err = filemap_fdatawait_range(mapping, start, end);
+		if (!ret)
+			ret = err;
+	}
 out:
 	return ret;
 }
+EXPORT_SYMBOL(generic_sync_file);
+
+/**
+ * vfs_fsync - perform a fsync or fdatasync on a file
+ * @file:		file to sync
+ * @dentry:		dentry of @file
+ * @datasync:		only perform a fdatasync operation
+ *
+ * Write back data and metadata for @file to disk.  If @datasync is
+ * set only metadata needed to access modified file data is written.
+ *
+ * In case this function is called from nfsd @file may be %NULL and
+ * only @dentry is set.  This can only happen when the filesystem
+ * implements the export_operations API.
+ */
+int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
+{
+	int what = SYNC_SUBMIT_DATA | SYNC_WAIT_DATA;
+
+	if (!datasync)
+		what |= SYNC_INODE;
+	return generic_sync_file(file, dentry, 0, LLONG_MAX, what);
+}
 EXPORT_SYMBOL(vfs_fsync);
 
 static int do_fsync(unsigned int fd, int datasync)
@@ -254,6 +288,25 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
 	return do_fsync(fd, 1);
 }
 
+/**
+ * generic_write_sync - perform syncing after a write if file / inode is sync
+ * @file:	file to which the write happened
+ * @pos:	offset where the write started
+ * @count:	length of the write
+ *
+ * This is just a simple wrapper about our general syncing function.
+ * FIXME: Make it inline?
+ */
+int generic_write_sync(struct file *file, loff_t pos, loff_t count)
+{
+	if (!(file->f_flags & O_SYNC) && !IS_SYNC(file->f_mapping->host))
+		return 0;
+	return generic_sync_file(file, file->f_path.dentry, pos,
+				 pos + count - 1,
+				 SYNC_SUBMIT_DATA | SYNC_WAIT_DATA);
+}
+EXPORT_SYMBOL(generic_write_sync);
+
 /*
  * sys_sync_file_range() permits finely controlled syncing over a segment of
  * a file in the range offset .. (offset+nbytes-1) inclusive.  If nbytes is
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1edd159..0f13049 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2088,7 +2088,14 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
 extern int filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end);
 
+/* Flags for generic_sync_file */
+#define SYNC_INODE		1
+#define SYNC_SUBMIT_DATA	2
+#define SYNC_WAIT_DATA		4
+extern int generic_sync_file(struct file *file, struct dentry *dentry,
+			   loff_t start, loff_t end, int what);
 extern int vfs_fsync(struct file *file, struct dentry *dentry, int datasync);
+extern int generic_write_sync(struct file *file, loff_t pos, loff_t count);
 extern void sync_supers(void);
 extern void emergency_sync(void);
 extern void emergency_remount(void);
diff --git a/mm/filemap.c b/mm/filemap.c
index dffdc60..a29b35b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -39,11 +39,10 @@
 /*
  * FIXME: remove all knowledge of the buffer layer from the core VM
  */
-#include <linux/buffer_head.h> /* for generic_osync_inode */
+#include <linux/buffer_head.h> /* for try_to_free_buffers */
 
 #include <asm/mman.h>
 
-
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
  * though.
@@ -2452,18 +2451,16 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 		const struct iovec *iov, unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
 	ssize_t ret;
 
 	BUG_ON(iocb->ki_pos != pos);
 
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+	if (ret > 0) {
 		ssize_t err;
 
-		err = sync_page_range(inode, mapping, pos, ret);
+		err = generic_write_sync(file, pos, ret);
 		if (err < 0)
 			ret = err;
 	}
@@ -2486,8 +2483,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
+	struct inode *inode = file->f_mapping->host;
 	ssize_t ret;
 
 	BUG_ON(iocb->ki_pos != pos);
@@ -2496,10 +2492,10 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
 
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+	if (ret > 0) {
 		ssize_t err;
 
-		err = sync_page_range(inode, mapping, pos, ret);
+		err = generic_write_sync(file, pos, ret);
 		if (err < 0)
 			ret = err;
 	}
-- 
1.6.0.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 14/17] xfs: Use new syncing helper
       [not found] <1250697884-22288-1-git-send-email-jack@suse.cz>
  2009-08-19 16:04 ` [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write() Jan Kara
  2009-08-19 16:04 ` [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:33   ` Christoph Hellwig
  2 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, xfs

Use new generic_write_sync() helper from xfs_write().

CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/linux-2.6/xfs_lrw.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
index 7078974..aeb5a39 100644
--- a/fs/xfs/linux-2.6/xfs_lrw.c
+++ b/fs/xfs/linux-2.6/xfs_lrw.c
@@ -817,7 +817,7 @@ write_retry:
 		xfs_iunlock(xip, iolock);
 		if (need_i_mutex)
 			mutex_unlock(&inode->i_mutex);
-		error2 = sync_page_range(inode, mapping, pos, ret);
+		error2 = generic_write_sync(file, pos, ret);
 		if (!error)
 			error = error2;
 		if (need_i_mutex)
-- 
1.6.0.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
  2009-08-19 16:04 ` [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write() Jan Kara
@ 2009-08-19 16:18   ` Christoph Hellwig
  2009-08-20 13:31     ` Jan Kara
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:18 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, xfs, hch, Joel Becker, ocfs2-devel

On Wed, Aug 19, 2009 at 06:04:30PM +0200, Jan Kara wrote:
> generic_file_direct_write() and generic_file_buffered_write() called
> generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
> this is superfluous since generic_file_aio_write() does the syncing as well.
> Also XFS and OCFS2 which call these functions directly handle syncing
> themselves. So let's have a single place where syncing happens:
> generic_file_aio_write().

Yeah, this is something that never made any sense to me.

> @@ -2187,20 +2187,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
>  		}
>  		*ppos = end;
>  	}
> -
> -	/*
> -	 * Sync the fs metadata but not the minor inode changes and
> -	 * of course not the data as we did direct DMA for the IO.
> -	 * i_mutex is held, which protects generic_osync_inode() from
> -	 * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
> -	 */
>  out:
> -	if ((written >= 0 || written == -EIOCBQUEUED) &&
> -	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> -		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
> -		if (err < 0)
> -			written = err;
> -	}
>  	return written;

Here we check (written >= 0 || written == -EIOCBQUEUED), but
generic_file_aio_write only cares about positive return values.  We
defintively do have a change here for partial AIO requests.  The
question is if the previous behaviour made in sense.  If do have an
O_SYNC aio+dio request we would have to flush out the metadata after the
request has completed and not here.

> @@ -2343,16 +2328,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
>  	if (likely(status >= 0)) {
>  		written += status;
>  		*ppos = pos + status;
> -
> -		/*
> -		 * For now, when the user asks for O_SYNC, we'll actually give
> -		 * O_DSYNC
> -		 */
> -		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> -			if (!a_ops->writepage || !is_sync_kiocb(iocb))
> -				status = generic_osync_inode(inode, mapping,
> -						OSYNC_METADATA|OSYNC_DATA);
> -		}
>    	}

No problem with -EIOCBQUEUED here, but we change from doing
generic_osync_inode with OSYNC_DATA which does a full writeout of the
data to sync_page_range which only does the range writeout here.  That
should be fine (as we only need to sync that range), but should probably
be documented in the patch description.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-19 16:04 ` [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode Jan Kara
@ 2009-08-19 16:26   ` Christoph Hellwig
  2009-08-20 12:15     ` Jan Kara
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, linux-ntfs-dev, LKML, Joel Becker, hch,
	Anton Altaparmakov, OGAWA Hirofumi, Evgeniy Polyakov, xfs,
	ocfs2-devel

Looks good to me.  Eventually we should use those SYNC_ flags also all
through the fsync codepath, but I'll see if I can incorporate that in my
planned fsync rewrite.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 14/17] xfs: Use new syncing helper
  2009-08-19 16:04 ` [PATCH 14/17] xfs: Use new syncing helper Jan Kara
@ 2009-08-19 16:33   ` Christoph Hellwig
  2009-08-20 12:22     ` Jan Kara
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:33 UTC (permalink / raw)
  To: Jan Kara; +Cc: hch, LKML, xfs

On Wed, Aug 19, 2009 at 06:04:41PM +0200, Jan Kara wrote:
> diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
> index 7078974..aeb5a39 100644
> --- a/fs/xfs/linux-2.6/xfs_lrw.c
> +++ b/fs/xfs/linux-2.6/xfs_lrw.c
> @@ -817,7 +817,7 @@ write_retry:
>  		xfs_iunlock(xip, iolock);
>  		if (need_i_mutex)
>  			mutex_unlock(&inode->i_mutex);
> -		error2 = sync_page_range(inode, mapping, pos, ret);
> +		error2 = generic_write_sync(file, pos, ret);

I don't think this is very optimal for XFS.  This first starts an
asynchronous writeout of the range in generic_write_sync,
then calls into ->fsync which waits for all I/O on the file to finish,
then forces the log inside xfs_fsync,  then waits for the range again in
generic_write_sync, and after this code calls into
xfs_write_sync_logforce which forces to log _again_.

We should be fine just doing your new filemap_fdatawrite_range helper
here and let xfs_write_sync_logforce do the work.  Long term we should
just get rid of this stupid submit writes before syncing the iode, then
wait crap which is a pain for all modern filesystems.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-19 16:26   ` Christoph Hellwig
@ 2009-08-20 12:15     ` Jan Kara
  2009-08-20 16:27       ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2009-08-20 12:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tytso, linux-ext4, Jan Kara, linux-ntfs-dev, LKML, Joel Becker,
	Anton Altaparmakov, OGAWA Hirofumi, Evgeniy Polyakov, xfs,
	ocfs2-devel

On Wed 19-08-09 12:26:38, Christoph Hellwig wrote:
> Looks good to me.  Eventually we should use those SYNC_ flags also all
> through the fsync codepath, but I'll see if I can incorporate that in my
> planned fsync rewrite.
  Yes, I thought I'll leave that for later. BTW it should be fairly easy to
teach generic_sync_file() to do fdatawait() before calling ->fsync() if the
filesystem sets some flag in inode->i_mapping (or somewhere else) as is
needed for XFS, btrfs, etc.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 14/17] xfs: Use new syncing helper
  2009-08-19 16:33   ` Christoph Hellwig
@ 2009-08-20 12:22     ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2009-08-20 12:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, LKML, xfs

On Wed 19-08-09 12:33:15, Christoph Hellwig wrote:
> On Wed, Aug 19, 2009 at 06:04:41PM +0200, Jan Kara wrote:
> > diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
> > index 7078974..aeb5a39 100644
> > --- a/fs/xfs/linux-2.6/xfs_lrw.c
> > +++ b/fs/xfs/linux-2.6/xfs_lrw.c
> > @@ -817,7 +817,7 @@ write_retry:
> >  		xfs_iunlock(xip, iolock);
> >  		if (need_i_mutex)
> >  			mutex_unlock(&inode->i_mutex);
> > -		error2 = sync_page_range(inode, mapping, pos, ret);
> > +		error2 = generic_write_sync(file, pos, ret);
> 
> I don't think this is very optimal for XFS.  This first starts an
> asynchronous writeout of the range in generic_write_sync,
> then calls into ->fsync which waits for all I/O on the file to finish,
> then forces the log inside xfs_fsync,  then waits for the range again in
> generic_write_sync, and after this code calls into
> xfs_write_sync_logforce which forces to log _again_.
> 
> We should be fine just doing your new filemap_fdatawrite_range helper
> here and let xfs_write_sync_logforce do the work.  Long term we should
> just get rid of this stupid submit writes before syncing the iode, then
> wait crap which is a pain for all modern filesystems.
  OK, I'll happily change this. I was just doing the straightforward
conversion and sync_page_range() essentially did what generic_write_sync()
does since it called fdatawrite(), write_inode_now(inode, 1) (here XFS was
forced to wait for pages and force the log), fdatawait().

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
  2009-08-19 16:18   ` Christoph Hellwig
@ 2009-08-20 13:31     ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2009-08-20 13:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, LKML, xfs, Joel Becker, ocfs2-devel

On Wed 19-08-09 12:18:53, Christoph Hellwig wrote:
> On Wed, Aug 19, 2009 at 06:04:30PM +0200, Jan Kara wrote:
> > generic_file_direct_write() and generic_file_buffered_write() called
> > generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
> > this is superfluous since generic_file_aio_write() does the syncing as well.
> > Also XFS and OCFS2 which call these functions directly handle syncing
> > themselves. So let's have a single place where syncing happens:
> > generic_file_aio_write().
> 
> Yeah, this is something that never made any sense to me.
> 
> > @@ -2187,20 +2187,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
> >  		}
> >  		*ppos = end;
> >  	}
> > -
> > -	/*
> > -	 * Sync the fs metadata but not the minor inode changes and
> > -	 * of course not the data as we did direct DMA for the IO.
> > -	 * i_mutex is held, which protects generic_osync_inode() from
> > -	 * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
> > -	 */
> >  out:
> > -	if ((written >= 0 || written == -EIOCBQUEUED) &&
> > -	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> > -		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
> > -		if (err < 0)
> > -			written = err;
> > -	}
> >  	return written;
> 
> Here we check (written >= 0 || written == -EIOCBQUEUED), but
> generic_file_aio_write only cares about positive return values.  We
> defintively do have a change here for partial AIO requests.
  Ah, that's a good point.

> The question is if the previous behaviour made in sense.  If do have an
> O_SYNC aio+dio request we would have to flush out the metadata after the
> request has completed and not here.
  Yes, that would be a correct behavior but I see no good way of doing
that. Flushing on EIOCBQUEUED mostly works for simple filesystems like
ext[23] where we don't do anything from end_io callback. The only risk
is if we crash after the transaction commits but before the direct IO is
done (we could expose stale data) but I'm not sure that is an issue with
real HW.
  Anyway, the question is what we should do about it.  For now, I'd call
generic_write_sync() even in case EIOCBQUEUED is returned to preserve old
behavior (EIOCBQUEUED does not seem to be returned from buffered write path
so we really just preserve the old behavior).

> > @@ -2343,16 +2328,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
> >  	if (likely(status >= 0)) {
> >  		written += status;
> >  		*ppos = pos + status;
> > -
> > -		/*
> > -		 * For now, when the user asks for O_SYNC, we'll actually give
> > -		 * O_DSYNC
> > -		 */
> > -		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> > -			if (!a_ops->writepage || !is_sync_kiocb(iocb))
> > -				status = generic_osync_inode(inode, mapping,
> > -						OSYNC_METADATA|OSYNC_DATA);
> > -		}
> >    	}
> 
> No problem with -EIOCBQUEUED here, but we change from doing
> generic_osync_inode with OSYNC_DATA which does a full writeout of the
> data to sync_page_range which only does the range writeout here.  That
> should be fine (as we only need to sync that range), but should probably
> be documented in the patch description.
  Yes, will do.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-20 12:15     ` Jan Kara
@ 2009-08-20 16:27       ` Christoph Hellwig
  2009-08-21 15:23         ` Jan Kara
  2009-08-26 18:22         ` Christoph Hellwig
  0 siblings, 2 replies; 17+ messages in thread
From: Christoph Hellwig @ 2009-08-20 16:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, linux-ntfs-dev, LKML, Joel Becker,
	Christoph Hellwig, Anton Altaparmakov, OGAWA Hirofumi,
	Evgeniy Polyakov, xfs, ocfs2-devel

On Thu, Aug 20, 2009 at 02:15:31PM +0200, Jan Kara wrote:
> On Wed 19-08-09 12:26:38, Christoph Hellwig wrote:
> > Looks good to me.  Eventually we should use those SYNC_ flags also all
> > through the fsync codepath, but I'll see if I can incorporate that in my
> > planned fsync rewrite.
>   Yes, I thought I'll leave that for later. BTW it should be fairly easy to
> teach generic_sync_file() to do fdatawait() before calling ->fsync() if the
> filesystem sets some flag in inode->i_mapping (or somewhere else) as is
> needed for XFS, btrfs, etc.

Maybe you can help brain storming, but I still can't see any way in that
the

  - write data
  - write inode
  - wait for data

actually is a benefit in terms of semantics (I agree that it could be
faster in theory, but even that is debatable with todays seek latencies
in disks)

Think about a simple non-journaling filesystem like ext2:

 (1) block get allocated during ->write before putting data in
      - this dirties the inode because we update i_block/i_size/etc
 (2) we call fsync (or the O_SNC handling code for that matter)
      - we start writeout of the data, which takes forever because the
	file is very large
      - then we write out the inode, including the i_size/i_blocks
	update
      - due to some reason this gets reordered before the data writeout
	finishes (without that happening there would be no benefit to
	this ordering anyway)
 (3) no we call filemap_fdatawait to wait for data I/O to finish


Now the system crashes between (2) and (3).  After that we we do have
stale data in the inode in the area not written yet.

Is there some case between that simple filesystem and the i_size update
from I/O completion handler in XFS/ext4 where this behaviour actually
buys us anything?  Any ext3 magic maybe?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-20 16:27       ` Christoph Hellwig
@ 2009-08-21 15:23         ` Jan Kara
  2009-08-21 15:32           ` Christoph Hellwig
  2009-08-26 18:22         ` Christoph Hellwig
  1 sibling, 1 reply; 17+ messages in thread
From: Jan Kara @ 2009-08-21 15:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tytso, linux-ext4, Jan Kara, linux-ntfs-dev, LKML, Joel Becker,
	Anton Altaparmakov, OGAWA Hirofumi, Evgeniy Polyakov, xfs,
	ocfs2-devel

On Thu 20-08-09 12:27:29, Christoph Hellwig wrote:
> On Thu, Aug 20, 2009 at 02:15:31PM +0200, Jan Kara wrote:
> > On Wed 19-08-09 12:26:38, Christoph Hellwig wrote:
> > > Looks good to me.  Eventually we should use those SYNC_ flags also all
> > > through the fsync codepath, but I'll see if I can incorporate that in my
> > > planned fsync rewrite.
> >   Yes, I thought I'll leave that for later. BTW it should be fairly easy to
> > teach generic_sync_file() to do fdatawait() before calling ->fsync() if the
> > filesystem sets some flag in inode->i_mapping (or somewhere else) as is
> > needed for XFS, btrfs, etc.
> 
> Maybe you can help brain storming, but I still can't see any way in that
> the
> 
>   - write data
>   - write inode
>   - wait for data
> 
> actually is a benefit in terms of semantics (I agree that it could be
> faster in theory, but even that is debatable with todays seek latencies
> in disks)
> 
> Think about a simple non-journaling filesystem like ext2:
> 
>  (1) block get allocated during ->write before putting data in
>       - this dirties the inode because we update i_block/i_size/etc
>  (2) we call fsync (or the O_SNC handling code for that matter)
>       - we start writeout of the data, which takes forever because the
> 	file is very large
>       - then we write out the inode, including the i_size/i_blocks
> 	update
>       - due to some reason this gets reordered before the data writeout
> 	finishes (without that happening there would be no benefit to
> 	this ordering anyway)
>  (3) no we call filemap_fdatawait to wait for data I/O to finish
> 
> Now the system crashes between (2) and (3).  After that we we do have
> stale data in the inode in the area not written yet.
  Yes, that's true.

> Is there some case between that simple filesystem and the i_size update
> from I/O completion handler in XFS/ext4 where this behaviour actually
> buys us anything?  Any ext3 magic maybe?
  Hmm, I can imagine it would buy us something in two cases (but looking at
the code, neither is implemented in such a way that it would really help
us in any way):
1) when an inode and it's data are stored in one block (e.g. OCFS2 or UDF) do
   this.
2) when we journal data
  In the first case we would wait for block with data to be written only to
submit it again because inode was still dirty.
  In the second case, it would make sence if we waited for transaction
commit in fdatawait() because only then data is really on disk. But I don't
know about a fs which would do it - ext3 in data=journal mode just adds
page buffers to the current transaction in writepage and never sets
PageWriteback so fdatawait() is nop for it. The page is pinned in memory
only by the fact that its buffer heads are part of a transaction and thus
cannot be freed.
  So currently I don't know about real cases where fdatawait after ->fsync()
would buy us anything...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-21 15:23         ` Jan Kara
@ 2009-08-21 15:32           ` Christoph Hellwig
  2009-08-21 15:48             ` Jan Kara
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2009-08-21 15:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, linux-ntfs-dev, LKML, Joel Becker,
	Christoph Hellwig, Anton Altaparmakov, OGAWA Hirofumi,
	Evgeniy Polyakov, xfs, ocfs2-devel

On Fri, Aug 21, 2009 at 05:23:39PM +0200, Jan Kara wrote:
>   Hmm, I can imagine it would buy us something in two cases (but looking at
> the code, neither is implemented in such a way that it would really help
> us in any way):
> 1) when an inode and it's data are stored in one block (e.g. OCFS2 or UDF) do
>    this.

>   In the first case we would wait for block with data to be written only to
> submit it again because inode was still dirty.

But we would not actually store in the pagecache or at least never
writeback it via the pacache in that case, but always just write it as
part of the inode.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-21 15:32           ` Christoph Hellwig
@ 2009-08-21 15:48             ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2009-08-21 15:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tytso, linux-ext4, Jan Kara, linux-ntfs-dev, LKML, Joel Becker,
	Anton Altaparmakov, OGAWA Hirofumi, Evgeniy Polyakov, xfs,
	ocfs2-devel

On Fri 21-08-09 11:32:35, Christoph Hellwig wrote:
> On Fri, Aug 21, 2009 at 05:23:39PM +0200, Jan Kara wrote:
> >   Hmm, I can imagine it would buy us something in two cases (but looking at
> > the code, neither is implemented in such a way that it would really help
> > us in any way):
> > 1) when an inode and it's data are stored in one block (e.g. OCFS2 or UDF) do
> >    this.
> 
> >   In the first case we would wait for block with data to be written only to
> > submit it again because inode was still dirty.
> 
> But we would not actually store in the pagecache or at least never
> writeback it via the pacache in that case, but always just write it as
> part of the inode.
  Yes, probably. Actually, that's what UDF does. It's just that then the
sequence fdatawrite(), fdatawait() does not really send the data to disk.
But that probably does not matter.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
       [not found] <1250874001-15483-1-git-send-email-jack@suse.cz>
@ 2009-08-21 16:59 ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2009-08-21 16:59 UTC (permalink / raw)
  To: LKML; +Cc: Jan Kara, Joel Becker, xfs, linux-fsdevel, hch, ocfs2-devel

generic_file_direct_write() and generic_file_buffered_write() called
generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
this is superfluous since generic_file_aio_write() does the syncing as well.
Also XFS and OCFS2 which call these functions directly handle syncing
themselves. So let's have a single place where syncing happens:
generic_file_aio_write().

We slightly change the behavior by syncing only the range of file to which the
write happened for buffered writes but that should be all that is required.

CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/filemap.c |   31 ++++---------------------------
 1 files changed, 4 insertions(+), 27 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 554a396..b523e42 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2187,20 +2187,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 		}
 		*ppos = end;
 	}
-
-	/*
-	 * Sync the fs metadata but not the minor inode changes and
-	 * of course not the data as we did direct DMA for the IO.
-	 * i_mutex is held, which protects generic_osync_inode() from
-	 * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
-	 */
 out:
-	if ((written >= 0 || written == -EIOCBQUEUED) &&
-	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
-		if (err < 0)
-			written = err;
-	}
 	return written;
 }
 EXPORT_SYMBOL(generic_file_direct_write);
@@ -2332,8 +2319,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
-	const struct address_space_operations *a_ops = mapping->a_ops;
-	struct inode *inode = mapping->host;
 	ssize_t status;
 	struct iov_iter i;
 
@@ -2343,16 +2328,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 	if (likely(status >= 0)) {
 		written += status;
 		*ppos = pos + status;
-
-		/*
-		 * For now, when the user asks for O_SYNC, we'll actually give
-		 * O_DSYNC
-		 */
-		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-			if (!a_ops->writepage || !is_sync_kiocb(iocb))
-				status = generic_osync_inode(inode, mapping,
-						OSYNC_METADATA|OSYNC_DATA);
-		}
   	}
 	
 	/*
@@ -2514,7 +2489,8 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+	if ((ret > 0 || ret == -EIOCBQUEUED) &&
+	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 		ssize_t err;
 
 		err = sync_page_range_nolock(inode, mapping, pos, ret);
@@ -2550,7 +2526,8 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
 
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+	if ((ret > 0 || ret == -EIOCBQUEUED) &&
+	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 		ssize_t err;
 
 		err = sync_page_range(inode, mapping, pos, ret);
-- 
1.6.0.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
       [not found] <1250875447-15622-1-git-send-email-jack@suse.cz>
@ 2009-08-21 17:23 ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2009-08-21 17:23 UTC (permalink / raw)
  To: LKML; +Cc: Jan Kara, Joel Becker, xfs, linux-fsdevel, hch, ocfs2-devel

generic_file_direct_write() and generic_file_buffered_write() called
generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
this is superfluous since generic_file_aio_write() does the syncing as well.
Also XFS and OCFS2 which call these functions directly handle syncing
themselves. So let's have a single place where syncing happens:
generic_file_aio_write().

We slightly change the behavior by syncing only the range of file to which the
write happened for buffered writes but that should be all that is required.

CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/filemap.c |   35 ++++++-----------------------------
 1 files changed, 6 insertions(+), 29 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 554a396..f863e1d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2187,20 +2187,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 		}
 		*ppos = end;
 	}
-
-	/*
-	 * Sync the fs metadata but not the minor inode changes and
-	 * of course not the data as we did direct DMA for the IO.
-	 * i_mutex is held, which protects generic_osync_inode() from
-	 * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
-	 */
 out:
-	if ((written >= 0 || written == -EIOCBQUEUED) &&
-	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
-		if (err < 0)
-			written = err;
-	}
 	return written;
 }
 EXPORT_SYMBOL(generic_file_direct_write);
@@ -2332,8 +2319,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
-	const struct address_space_operations *a_ops = mapping->a_ops;
-	struct inode *inode = mapping->host;
 	ssize_t status;
 	struct iov_iter i;
 
@@ -2343,16 +2328,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 	if (likely(status >= 0)) {
 		written += status;
 		*ppos = pos + status;
-
-		/*
-		 * For now, when the user asks for O_SYNC, we'll actually give
-		 * O_DSYNC
-		 */
-		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-			if (!a_ops->writepage || !is_sync_kiocb(iocb))
-				status = generic_osync_inode(inode, mapping,
-						OSYNC_METADATA|OSYNC_DATA);
-		}
   	}
 	
 	/*
@@ -2514,11 +2489,12 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+	if ((ret > 0 || ret == -EIOCBQUEUED) &&
+	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 		ssize_t err;
 
 		err = sync_page_range_nolock(inode, mapping, pos, ret);
-		if (err < 0)
+		if (err < 0 && ret > 0)
 			ret = err;
 	}
 	return ret;
@@ -2550,11 +2526,12 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
 
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+	if ((ret > 0 || ret == -EIOCBQUEUED) &&
+	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 		ssize_t err;
 
 		err = sync_page_range(inode, mapping, pos, ret);
-		if (err < 0)
+		if (err < 0 && ret > 0)
 			ret = err;
 	}
 	return ret;
-- 
1.6.0.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-20 16:27       ` Christoph Hellwig
  2009-08-21 15:23         ` Jan Kara
@ 2009-08-26 18:22         ` Christoph Hellwig
  2009-08-27  0:04           ` Christoph Hellwig
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2009-08-26 18:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, linux-ntfs-dev, LKML, Joel Becker,
	Christoph Hellwig, Anton Altaparmakov, OGAWA Hirofumi,
	Evgeniy Polyakov, xfs, ocfs2-devel

On Thu, Aug 20, 2009 at 12:27:29PM -0400, Christoph Hellwig wrote:
> Maybe you can help brain storming, but I still can't see any way in that
> the
> 
>   - write data
>   - write inode
>   - wait for data
> 
> actually is a benefit in terms of semantics (I agree that it could be
> faster in theory, but even that is debatable with todays seek latencies
> in disks)

Btw, another reason why our current default is actively harmful:

	barriers

With volatile write caches we do have to flush the disk write cache in
->fsync, either implicitly by a metadata operation, or explicitly if
only data changed.  Unless the filesystems waits itself for the data
to hit the disk like XFS or btrfs will be issue the cache flush
potentially before the data write has actually reached the disk cache.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-26 18:22         ` Christoph Hellwig
@ 2009-08-27  0:04           ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2009-08-27  0:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, linux-ntfs-dev, LKML, Joel Becker,
	Christoph Hellwig, Anton Altaparmakov, OGAWA Hirofumi,
	Evgeniy Polyakov, xfs, ocfs2-devel

On Wed, Aug 26, 2009 at 02:22:36PM -0400, Christoph Hellwig wrote:
> On Thu, Aug 20, 2009 at 12:27:29PM -0400, Christoph Hellwig wrote:
> > Maybe you can help brain storming, but I still can't see any way in that
> > the
> > 
> >   - write data
> >   - write inode
> >   - wait for data
> > 
> > actually is a benefit in terms of semantics (I agree that it could be
> > faster in theory, but even that is debatable with todays seek latencies
> > in disks)
> 
> Btw, another reason why our current default is actively harmful:
> 
> 	barriers
> 
> With volatile write caches we do have to flush the disk write cache in
> ->fsync, either implicitly by a metadata operation, or explicitly if
> only data changed.  Unless the filesystems waits itself for the data
> to hit the disk like XFS or btrfs will be issue the cache flush
> potentially before the data write has actually reached the disk cache.

Ok, this one failed the reality check - no matter how hard I tried
I could not reproduce that case in my test harness.  It turns out
cache flush request are quite sensibly treated as barriers by the block
layer and thus we drain the queue before issueing the cache flush.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2009-08-27  0:03 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1250697884-22288-1-git-send-email-jack@suse.cz>
2009-08-19 16:04 ` [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write() Jan Kara
2009-08-19 16:18   ` Christoph Hellwig
2009-08-20 13:31     ` Jan Kara
2009-08-19 16:04 ` [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode Jan Kara
2009-08-19 16:26   ` Christoph Hellwig
2009-08-20 12:15     ` Jan Kara
2009-08-20 16:27       ` Christoph Hellwig
2009-08-21 15:23         ` Jan Kara
2009-08-21 15:32           ` Christoph Hellwig
2009-08-21 15:48             ` Jan Kara
2009-08-26 18:22         ` Christoph Hellwig
2009-08-27  0:04           ` Christoph Hellwig
2009-08-19 16:04 ` [PATCH 14/17] xfs: Use new syncing helper Jan Kara
2009-08-19 16:33   ` Christoph Hellwig
2009-08-20 12:22     ` Jan Kara
     [not found] <1250874001-15483-1-git-send-email-jack@suse.cz>
2009-08-21 16:59 ` [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write() Jan Kara
     [not found] <1250875447-15622-1-git-send-email-jack@suse.cz>
2009-08-21 17:23 ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox