[PATCH 0/17] Make O_SYNC handling use standard syncing path

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/17] Make O_SYNC handling use standard syncing path
@ 2009-08-19 16:04 Jan Kara
  2009-08-19 16:04 ` [PATCH 01/17] vfs: Introduce filemap_fdatawait_range Jan Kara
                   ` (17 more replies)
  0 siblings, 18 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch

  Hi,

  below comes a patch series that unifies O_SYNC handling with code doing
fsync(). After this, we have just one place forcing a single file to disk so
filesystems like ext3 / ext4 don't have to force a transaction commit in
ext?_file_write for O_SYNC files / IS_SYNC inodes.
  The code is also cleaner this way (actually about 150 lines shorter), we
don't sync the inode several times as it happened previously etc.
  The patchset touches quite some filesystems. Hopefully I've added all the
relevant people to CC for the relevant patches. I've tested via blktrace that
for ext?, xfs, ocfs2 and fat, the file is indeed forced to disk as expected.
Maintainers of pohmelfs and NTFS might want to check that everything works
as expected for them after applying the patches.
  Comments / review welcome.

								Honza

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 01/17] vfs: Introduce filemap_fdatawait_range
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:10   ` Christoph Hellwig
  2009-08-19 16:04 ` [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments Jan Kara
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 66+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara

This simple helper saves some filesystems conversion from byte offset
to page numbers and also makes the fdata* interface more complete.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h |    2 ++
 mm/filemap.c       |   20 ++++++++++++++++++++
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 67888a9..cb365ad 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2076,6 +2076,8 @@ extern int write_inode_now(struct inode *, int);
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
 extern int filemap_fdatawait(struct address_space *);
+extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
+				   loff_t lend);
 extern int filemap_write_and_wait(struct address_space *mapping);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
 				        loff_t lstart, loff_t lend);
diff --git a/mm/filemap.c b/mm/filemap.c
index ccea3b6..65b2e50 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -307,6 +307,26 @@ int wait_on_page_writeback_range(struct address_space *mapping,
 }
 
 /**
+ * filemap_fdatawait_range - wait for all under-writeback pages to complete in a given range
+ * @mapping: address space structure to wait for
+ * @start:	offset in bytes where the range starts
+ * @end:	offset in bytes where the range ends (inclusive)
+ *
+ * Walk the list of under-writeback pages of the given address space
+ * in the given range and wait for all of them.
+ *
+ * This is just a simple wrapper so that callers don't have to convert offsets
+ * to page indexes themselves
+ */
+int filemap_fdatawait_range(struct address_space *mapping, loff_t start,
+			    loff_t end)
+{
+	return wait_on_page_writeback_range(mapping, start >> PAGE_CACHE_SHIFT,
+					    end >> PAGE_CACHE_SHIFT);
+}
+EXPORT_SYMBOL(filemap_fdatawait_range);
+
+/**
  * sync_page_range - write and wait on all pages in the passed range
  * @inode:	target inode
  * @mapping:	target address_space
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
  2009-08-19 16:04 ` [PATCH 01/17] vfs: Introduce filemap_fdatawait_range Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:11   ` Christoph Hellwig
  2009-08-19 20:22   ` Evgeniy Polyakov
  2009-08-19 16:04 ` [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write() Jan Kara
                   ` (15 subsequent siblings)
  17 siblings, 2 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, Evgeniy Polyakov, ocfs2-devel, Joel Becker

Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
comments to write helpers explaining how they should be used and export
__generic_file_aio_write() since it will be used by some filesystems.

CC: Evgeniy Polyakov <zbr@ioremap.net>
CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h |    2 +
 mm/filemap.c       |   57 +++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index cb365ad..1edd159 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2195,6 +2195,8 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size);
 int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk);
 extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t);
+extern ssize_t __generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long,
+		loff_t *);
 extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t);
 extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct iovec *,
 		unsigned long, loff_t);
diff --git a/mm/filemap.c b/mm/filemap.c
index 65b2e50..554a396 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2368,9 +2368,27 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
 
-static ssize_t
-__generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov,
-				unsigned long nr_segs, loff_t *ppos)
+/**
+ * __generic_file_aio_write - write data to a file
+ * @iocb:	IO state structure (file, offset, etc.)
+ * @iov:	vector with data to write
+ * @nr_segs:	number of segments in the vector
+ * @ppos:	position where to write
+ *
+ * This function does all the work needed for actually writing data to a
+ * file. It does all basic checks, removes SUID from the file, updates
+ * modification times and calls proper subroutines depending on whether we
+ * do direct IO or a standard buffered write.
+ *
+ * It expects i_mutex to be grabbed unless we work on a block device or similar
+ * object which does not need locking at all.
+ *
+ * This function does *not* take care of syncing data in case of O_SYNC write.
+ * A caller has to handle it. This is mainly due to the fact that we want to
+ * avoid syncing under i_mutex.
+ */
+ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
+				 unsigned long nr_segs, loff_t *ppos)
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space * mapping = file->f_mapping;
@@ -2467,7 +2485,23 @@ out:
 	current->backing_dev_info = NULL;
 	return written ? written : err;
 }
+EXPORT_SYMBOL(__generic_file_aio_write);
+
 
+/**
+ * generic_file_aio_write_nolock - write data, usually to a device
+ * @iocb:	IO state structure
+ * @iov:	vector with data to write
+ * @nr_segs:	number of segments in the vector
+ * @pos:	position in file where to write
+ *
+ * This is a wrapper around __generic_file_aio_write() which takes care of
+ * syncing the file in case of O_SYNC file. It does not take i_mutex for the
+ * write itself but may do so during syncing. It is meant for users like block
+ * devices which do not need i_mutex during write. If your filesystem needs to
+ * do a write but already holds i_mutex, use __generic_file_aio_write()
+ * directly and then sync the file like generic_file_aio_write().
+ */
 ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 		const struct iovec *iov, unsigned long nr_segs, loff_t pos)
 {
@@ -2478,8 +2512,7 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 
 	BUG_ON(iocb->ki_pos != pos);
 
-	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
-			&iocb->ki_pos);
+	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 
 	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 		ssize_t err;
@@ -2492,6 +2525,17 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 }
 EXPORT_SYMBOL(generic_file_aio_write_nolock);
 
+/**
+ * generic_file_aio_write - write data to a file
+ * @iocb:	IO state structure
+ * @iov:	vector with data to write
+ * @nr_segs:	number of segments in the vector
+ * @pos:	position in file where to write
+ *
+ * This is a wrapper around __generic_file_aio_write() to be used by most
+ * filesystems. It takes care of syncing the file in case of O_SYNC file
+ * and acquires i_mutex as needed.
+ */
 ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos)
 {
@@ -2503,8 +2547,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	BUG_ON(iocb->ki_pos != pos);
 
 	mutex_lock(&inode->i_mutex);
-	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
-			&iocb->ki_pos);
+	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
 
 	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
  2009-08-19 16:04 ` [PATCH 01/17] vfs: Introduce filemap_fdatawait_range Jan Kara
  2009-08-19 16:04 ` [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:18   ` Christoph Hellwig
  2009-08-19 16:04 ` [PATCH 04/17] pohmelfs: Use __generic_file_aio_write instead of generic_file_aio_write_nolock Jan Kara
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 66+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, ocfs2-devel, Joel Becker, Felix Blyakher, xfs

generic_file_direct_write() and generic_file_buffered_write() called
generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
this is superfluous since generic_file_aio_write() does the syncing as well.
Also XFS and OCFS2 which call these functions directly handle syncing
themselves. So let's have a single place where syncing happens:
generic_file_aio_write().

CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/filemap.c |   25 -------------------------
 1 files changed, 0 insertions(+), 25 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 554a396..3bee198 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2187,20 +2187,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 		}
 		*ppos = end;
 	}
-
-	/*
-	 * Sync the fs metadata but not the minor inode changes and
-	 * of course not the data as we did direct DMA for the IO.
-	 * i_mutex is held, which protects generic_osync_inode() from
-	 * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
-	 */
 out:
-	if ((written >= 0 || written == -EIOCBQUEUED) &&
-	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
-		if (err < 0)
-			written = err;
-	}
 	return written;
 }
 EXPORT_SYMBOL(generic_file_direct_write);
@@ -2332,8 +2319,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
-	const struct address_space_operations *a_ops = mapping->a_ops;
-	struct inode *inode = mapping->host;
 	ssize_t status;
 	struct iov_iter i;
 
@@ -2343,16 +2328,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 	if (likely(status >= 0)) {
 		written += status;
 		*ppos = pos + status;
-
-		/*
-		 * For now, when the user asks for O_SYNC, we'll actually give
-		 * O_DSYNC
-		 */
-		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-			if (!a_ops->writepage || !is_sync_kiocb(iocb))
-				status = generic_osync_inode(inode, mapping,
-						OSYNC_METADATA|OSYNC_DATA);
-		}
   	}
 	
 	/*
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 04/17] pohmelfs: Use __generic_file_aio_write instead of generic_file_aio_write_nolock
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (2 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write() Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:04 ` [PATCH 05/17] ocfs2: " Jan Kara
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, Evgeniy Polyakov

Use new helper __generic_file_aio_write(). Since the fs takes care of syncing
by itself afterwards, there are no more changes needed.

CC: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 drivers/staging/pohmelfs/inode.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c
index 7b60579..17801a5 100644
--- a/drivers/staging/pohmelfs/inode.c
+++ b/drivers/staging/pohmelfs/inode.c
@@ -921,7 +921,7 @@ ssize_t pohmelfs_write(struct file *file, const char __user *buf,
 	if (ret)
 		goto err_out_unlock;
 
-	ret = generic_file_aio_write_nolock(&kiocb, &iov, 1, pos);
+	ret = __generic_file_aio_write(&kiocb, &iov, 1, &kiocb.ki_pos);
 	*ppos = kiocb.ki_pos;
 
 	mutex_unlock(&inode->i_mutex);
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 05/17] ocfs2: Use __generic_file_aio_write instead of generic_file_aio_write_nolock
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (3 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 04/17] pohmelfs: Use __generic_file_aio_write instead of generic_file_aio_write_nolock Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:04 ` [PATCH 06/17] vfs: Remove sync_page_range_nolock Jan Kara
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, ocfs2-devel, Joel Becker

Use the new helper. We have to submit data pages ourselves in case of O_SYNC
write because __generic_file_aio_write does not do it for us. OCFS2 developpers
might think about moving the sync out of i_mutex which seems to be easily
possible but that's out of scope of this patch.

CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/file.c |   22 ++++++++++++----------
 1 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 62442e4..1c71f0a 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1870,8 +1870,7 @@ relock:
 			goto out_dio;
 		}
 	} else {
-		written = generic_file_aio_write_nolock(iocb, iov, nr_segs,
-							*ppos);
+		written = __generic_file_aio_write(iocb, iov, nr_segs, ppos);
 	}
 
 out_dio:
@@ -1879,18 +1878,21 @@ out_dio:
 	BUG_ON(ret == -EIOCBQUEUED && !(file->f_flags & O_DIRECT));
 
 	if ((file->f_flags & O_SYNC && !direct_io) || IS_SYNC(inode)) {
-		/*
-		 * The generic write paths have handled getting data
-		 * to disk, but since we don't make use of the dirty
-		 * inode list, a manual journal commit is necessary
-		 * here.
-		 */
-		if (old_size != i_size_read(inode) ||
-		    old_clusters != OCFS2_I(inode)->ip_clusters) {
+		ret = filemap_fdatawrite_range(file->f_mapping, pos,
+					       pos + count - 1);
+		if (ret < 0)
+			written = ret;
+
+		if (!ret && (old_size != i_size_read(inode) ||
+		    old_clusters != OCFS2_I(inode)->ip_clusters)) {
 			ret = jbd2_journal_force_commit(osb->journal->j_journal);
 			if (ret < 0)
 				written = ret;
 		}
+
+		if (!ret)
+			ret = filemap_fdatawait_range(file->f_mapping, pos,
+						      pos + count - 1);
 	}
 
 	/* 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 06/17] vfs: Remove sync_page_range_nolock
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (4 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 05/17] ocfs2: " Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:21   ` Christoph Hellwig
  2009-08-19 16:04 ` [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode Jan Kara
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 66+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara

The last user of sync_page_range_nolock() is generic_file_aio_write_nolock().
Now we have converted all its callers to not hold i_mutex and so we can
afford to call sync_page_range() instead of sync_page_range_nolock() from
there. This non-necessarily acquires i_mutex when syncing block devices but
that's happening in fsync() path as well and block_fsync() may just drop +
reacquire i_mutex if we care enough.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/writeback.h |    2 --
 mm/filemap.c              |   31 +------------------------------
 2 files changed, 1 insertions(+), 32 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 3224820..46d6064 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -159,8 +159,6 @@ int write_cache_pages(struct address_space *mapping,
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
 int sync_page_range(struct inode *inode, struct address_space *mapping,
 			loff_t pos, loff_t count);
-int sync_page_range_nolock(struct inode *inode, struct address_space *mapping,
-			   loff_t pos, loff_t count);
 void set_page_dirty_balance(struct page *page, int page_mkwrite);
 void writeback_set_ratelimit(void);
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 3bee198..dffdc60 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -362,35 +362,6 @@ int sync_page_range(struct inode *inode, struct address_space *mapping,
 EXPORT_SYMBOL(sync_page_range);
 
 /**
- * sync_page_range_nolock - write & wait on all pages in the passed range without locking
- * @inode:	target inode
- * @mapping:	target address_space
- * @pos:	beginning offset in pages to write
- * @count:	number of bytes to write
- *
- * Note: Holding i_mutex across sync_page_range_nolock() is not a good idea
- * as it forces O_SYNC writers to different parts of the same file
- * to be serialised right until io completion.
- */
-int sync_page_range_nolock(struct inode *inode, struct address_space *mapping,
-			   loff_t pos, loff_t count)
-{
-	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
-	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
-	int ret;
-
-	if (!mapping_cap_writeback_dirty(mapping) || !count)
-		return 0;
-	ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
-	if (ret == 0)
-		ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
-	if (ret == 0)
-		ret = wait_on_page_writeback_range(mapping, start, end);
-	return ret;
-}
-EXPORT_SYMBOL(sync_page_range_nolock);
-
-/**
  * filemap_fdatawait - wait for all under-writeback pages to complete
  * @mapping: address space structure to wait for
  *
@@ -2492,7 +2463,7 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 		ssize_t err;
 
-		err = sync_page_range_nolock(inode, mapping, pos, ret);
+		err = sync_page_range(inode, mapping, pos, ret);
 		if (err < 0)
 			ret = err;
 	}
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (5 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 06/17] vfs: Remove sync_page_range_nolock Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:26   ` Christoph Hellwig
  2009-08-19 16:04 ` [PATCH 08/17] ext2: Update comment about generic_osync_inode Jan Kara
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 66+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML
  Cc: hch, Jan Kara, Evgeniy Polyakov, ocfs2-devel, Joel Becker,
	Felix Blyakher, xfs, Anton Altaparmakov, linux-ntfs-dev,
	OGAWA Hirofumi, linux-ext4, tytso

Introduce new function for generic inode syncing (generic_sync_file) and use
it from fsync() path. Introduce also new helper for syncing after a sync
write (generic_write_sync) using the generic function.

Use these new helpers for syncing from generic VFS functions.

CC: Evgeniy Polyakov <zbr@ioremap.net>
CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
CC: Anton Altaparmakov <aia21@cantab.net>
CC: linux-ntfs-dev@lists.sourceforge.net
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
CC: linux-ext4@vger.kernel.org
CC: tytso@mit.edu
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/splice.c        |   22 ++++----------
 fs/sync.c          |   81 +++++++++++++++++++++++++++++++++++++++++++---------
 include/linux/fs.h |    7 ++++
 mm/filemap.c       |   16 ++++------
 4 files changed, 86 insertions(+), 40 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 73766d2..8190237 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -976,25 +976,15 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 
 	if (ret > 0) {
 		unsigned long nr_pages;
+		int err;
 
-		*ppos += ret;
 		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 
-		/*
-		 * If file or inode is SYNC and we actually wrote some data,
-		 * sync it.
-		 */
-		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
-			int err;
-
-			mutex_lock(&inode->i_mutex);
-			err = generic_osync_inode(inode, mapping,
-						  OSYNC_METADATA|OSYNC_DATA);
-			mutex_unlock(&inode->i_mutex);
-
-			if (err)
-				ret = err;
-		}
+		err = generic_write_sync(out, *ppos, ret);
+		if (err)
+			ret = err;
+		else
+			*ppos += ret;
 		balance_dirty_pages_ratelimited_nr(mapping, nr_pages);
 	}
 
diff --git a/fs/sync.c b/fs/sync.c
index 3422ba6..fc320aa 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -176,23 +176,30 @@ int file_fsync(struct file *filp, struct dentry *dentry, int datasync)
 }
 
 /**
- * vfs_fsync - perform a fsync or fdatasync on a file
+ * generic_sync_file - helper to sync data & metadata to disk
  * @file:		file to sync
  * @dentry:		dentry of @file
- * @data:		only perform a fdatasync operation
+ * @start:		offset in bytes of the beginning of data range to sync
+ * @end:		offset in bytes of the end of data range (inclusive)
+ * @what:		what should be synced
  *
- * Write back data and metadata for @file to disk.  If @datasync is
- * set only metadata needed to access modified file data is written.
+ * What the function exactly does is controlled by the @what parameter:
  *
- * In case this function is called from nfsd @file may be %NULL and
- * only @dentry is set.  This can only happen when the filesystem
- * implements the export_operations API.
+ * If SYNC_SUBMIT_DATA is set, the function submits all pages in the given
+ * range to disk.
+ *
+ * The function always calls ->fsync() callback of the filesystem. If
+ * SYNC_INODE is not set, we pass down the fact that it is just a datasync.
+ *
+ * If SYNC_WAIT_DATA is set, the function waits for writeback to finish
+ * in the given range.
  */
-int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
+int generic_sync_file(struct file *file, struct dentry *dentry, loff_t start,
+		      loff_t end, int what)
 {
 	const struct file_operations *fop;
 	struct address_space *mapping;
-	int err, ret;
+	int err, ret = 0;
 
 	/*
 	 * Get mapping and operations from the file in case we have
@@ -212,23 +219,50 @@ int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
 		goto out;
 	}
 
-	ret = filemap_fdatawrite(mapping);
+	if (what & SYNC_SUBMIT_DATA)
+		ret = filemap_fdatawrite_range(mapping, start, end);
 
 	/*
 	 * We need to protect against concurrent writers, which could cause
 	 * livelocks in fsync_buffers_list().
 	 */
 	mutex_lock(&mapping->host->i_mutex);
-	err = fop->fsync(file, dentry, datasync);
+	err = fop->fsync(file, dentry, !(what & SYNC_INODE));
 	if (!ret)
 		ret = err;
 	mutex_unlock(&mapping->host->i_mutex);
-	err = filemap_fdatawait(mapping);
-	if (!ret)
-		ret = err;
+
+	if (what & SYNC_WAIT_DATA) {
+		err = filemap_fdatawait_range(mapping, start, end);
+		if (!ret)
+			ret = err;
+	}
 out:
 	return ret;
 }
+EXPORT_SYMBOL(generic_sync_file);
+
+/**
+ * vfs_fsync - perform a fsync or fdatasync on a file
+ * @file:		file to sync
+ * @dentry:		dentry of @file
+ * @datasync:		only perform a fdatasync operation
+ *
+ * Write back data and metadata for @file to disk.  If @datasync is
+ * set only metadata needed to access modified file data is written.
+ *
+ * In case this function is called from nfsd @file may be %NULL and
+ * only @dentry is set.  This can only happen when the filesystem
+ * implements the export_operations API.
+ */
+int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
+{
+	int what = SYNC_SUBMIT_DATA | SYNC_WAIT_DATA;
+
+	if (!datasync)
+		what |= SYNC_INODE;
+	return generic_sync_file(file, dentry, 0, LLONG_MAX, what);
+}
 EXPORT_SYMBOL(vfs_fsync);
 
 static int do_fsync(unsigned int fd, int datasync)
@@ -254,6 +288,25 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
 	return do_fsync(fd, 1);
 }
 
+/**
+ * generic_write_sync - perform syncing after a write if file / inode is sync
+ * @file:	file to which the write happened
+ * @pos:	offset where the write started
+ * @count:	length of the write
+ *
+ * This is just a simple wrapper about our general syncing function.
+ * FIXME: Make it inline?
+ */
+int generic_write_sync(struct file *file, loff_t pos, loff_t count)
+{
+	if (!(file->f_flags & O_SYNC) && !IS_SYNC(file->f_mapping->host))
+		return 0;
+	return generic_sync_file(file, file->f_path.dentry, pos,
+				 pos + count - 1,
+				 SYNC_SUBMIT_DATA | SYNC_WAIT_DATA);
+}
+EXPORT_SYMBOL(generic_write_sync);
+
 /*
  * sys_sync_file_range() permits finely controlled syncing over a segment of
  * a file in the range offset .. (offset+nbytes-1) inclusive.  If nbytes is
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1edd159..0f13049 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2088,7 +2088,14 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
 extern int filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end);
 
+/* Flags for generic_sync_file */
+#define SYNC_INODE		1
+#define SYNC_SUBMIT_DATA	2
+#define SYNC_WAIT_DATA		4
+extern int generic_sync_file(struct file *file, struct dentry *dentry,
+			   loff_t start, loff_t end, int what);
 extern int vfs_fsync(struct file *file, struct dentry *dentry, int datasync);
+extern int generic_write_sync(struct file *file, loff_t pos, loff_t count);
 extern void sync_supers(void);
 extern void emergency_sync(void);
 extern void emergency_remount(void);
diff --git a/mm/filemap.c b/mm/filemap.c
index dffdc60..a29b35b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -39,11 +39,10 @@
 /*
  * FIXME: remove all knowledge of the buffer layer from the core VM
  */
-#include <linux/buffer_head.h> /* for generic_osync_inode */
+#include <linux/buffer_head.h> /* for try_to_free_buffers */
 
 #include <asm/mman.h>
 
-
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
  * though.
@@ -2452,18 +2451,16 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 		const struct iovec *iov, unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
 	ssize_t ret;
 
 	BUG_ON(iocb->ki_pos != pos);
 
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+	if (ret > 0) {
 		ssize_t err;
 
-		err = sync_page_range(inode, mapping, pos, ret);
+		err = generic_write_sync(file, pos, ret);
 		if (err < 0)
 			ret = err;
 	}
@@ -2486,8 +2483,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
+	struct inode *inode = file->f_mapping->host;
 	ssize_t ret;
 
 	BUG_ON(iocb->ki_pos != pos);
@@ -2496,10 +2492,10 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
 
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+	if (ret > 0) {
 		ssize_t err;
 
-		err = sync_page_range(inode, mapping, pos, ret);
+		err = generic_write_sync(file, pos, ret);
 		if (err < 0)
 			ret = err;
 	}
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 08/17] ext2: Update comment about generic_osync_inode
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (6 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:04 ` [PATCH 09/17] ext3: Remove syncing logic from ext3_file_write Jan Kara
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, linux-ext4

We rely on generic_write_sync() now.

CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/inode.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index e271303..1c1638f 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -482,7 +482,7 @@ static int ext2_alloc_branch(struct inode *inode,
 		unlock_buffer(bh);
 		mark_buffer_dirty_inode(bh, inode);
 		/* We used to sync bh here if IS_SYNC(inode).
-		 * But we now rely upon generic_osync_inode()
+		 * But we now rely upon generic_write_sync()
 		 * and b_inode_buffers.  But not for directories.
 		 */
 		if (S_ISDIR(inode->i_mode) && IS_DIRSYNC(inode))
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 09/17] ext3: Remove syncing logic from ext3_file_write
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (7 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 08/17] ext2: Update comment about generic_osync_inode Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:04 ` [PATCH 10/17] ext4: Remove syncing logic from ext4_file_write Jan Kara
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, linux-ext4

Syncing is now properly done by generic_file_aio_write() so no special logic is
needed in ext3.

CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/file.c |   61 +-------------------------------------------------------
 1 files changed, 1 insertions(+), 60 deletions(-)

diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 5b49704..51fee5f 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -51,71 +51,12 @@ static int ext3_release_file (struct inode * inode, struct file * filp)
 	return 0;
 }
 
-static ssize_t
-ext3_file_write(struct kiocb *iocb, const struct iovec *iov,
-		unsigned long nr_segs, loff_t pos)
-{
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file->f_path.dentry->d_inode;
-	ssize_t ret;
-	int err;
-
-	ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
-
-	/*
-	 * Skip flushing if there was an error, or if nothing was written.
-	 */
-	if (ret <= 0)
-		return ret;
-
-	/*
-	 * If the inode is IS_SYNC, or is O_SYNC and we are doing data
-	 * journalling then we need to make sure that we force the transaction
-	 * to disk to keep all metadata uptodate synchronously.
-	 */
-	if (file->f_flags & O_SYNC) {
-		/*
-		 * If we are non-data-journaled, then the dirty data has
-		 * already been flushed to backing store by generic_osync_inode,
-		 * and the inode has been flushed too if there have been any
-		 * modifications other than mere timestamp updates.
-		 *
-		 * Open question --- do we care about flushing timestamps too
-		 * if the inode is IS_SYNC?
-		 */
-		if (!ext3_should_journal_data(inode))
-			return ret;
-
-		goto force_commit;
-	}
-
-	/*
-	 * So we know that there has been no forced data flush.  If the inode
-	 * is marked IS_SYNC, we need to force one ourselves.
-	 */
-	if (!IS_SYNC(inode))
-		return ret;
-
-	/*
-	 * Open question #2 --- should we force data to disk here too?  If we
-	 * don't, the only impact is that data=writeback filesystems won't
-	 * flush data to disk automatically on IS_SYNC, only metadata (but
-	 * historically, that is what ext2 has done.)
-	 */
-
-force_commit:
-	err = ext3_force_commit(inode->i_sb);
-	if (err)
-		return err;
-	return ret;
-}
-
 const struct file_operations ext3_file_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
 	.aio_read	= generic_file_aio_read,
-	.aio_write	= ext3_file_write,
+	.aio_write	= generic_file_aio_write,
 	.unlocked_ioctl	= ext3_ioctl,
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext3_compat_ioctl,
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 10/17] ext4: Remove syncing logic from ext4_file_write
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (8 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 09/17] ext3: Remove syncing logic from ext3_file_write Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:04 ` [PATCH 11/17] fat: Opencode sync_page_range_nolock() Jan Kara
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, linux-ext4, tytso

The syncing is now properly handled by generic_file_aio_write() so
no special ext4 code is needed.

CC: linux-ext4@vger.kernel.org
CC: tytso@mit.edu
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c |   53 ++---------------------------------------------------
 1 files changed, 2 insertions(+), 51 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 3f1873f..aafe432 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -58,10 +58,7 @@ static ssize_t
 ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos)
 {
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file->f_path.dentry->d_inode;
-	ssize_t ret;
-	int err;
+	struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
 
 	/*
 	 * If we have encountered a bitmap-format file, the size limit
@@ -81,53 +78,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 		}
 	}
 
-	ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
-	/*
-	 * Skip flushing if there was an error, or if nothing was written.
-	 */
-	if (ret <= 0)
-		return ret;
-
-	/*
-	 * If the inode is IS_SYNC, or is O_SYNC and we are doing data
-	 * journalling then we need to make sure that we force the transaction
-	 * to disk to keep all metadata uptodate synchronously.
-	 */
-	if (file->f_flags & O_SYNC) {
-		/*
-		 * If we are non-data-journaled, then the dirty data has
-		 * already been flushed to backing store by generic_osync_inode,
-		 * and the inode has been flushed too if there have been any
-		 * modifications other than mere timestamp updates.
-		 *
-		 * Open question --- do we care about flushing timestamps too
-		 * if the inode is IS_SYNC?
-		 */
-		if (!ext4_should_journal_data(inode))
-			return ret;
-
-		goto force_commit;
-	}
-
-	/*
-	 * So we know that there has been no forced data flush.  If the inode
-	 * is marked IS_SYNC, we need to force one ourselves.
-	 */
-	if (!IS_SYNC(inode))
-		return ret;
-
-	/*
-	 * Open question #2 --- should we force data to disk here too?  If we
-	 * don't, the only impact is that data=writeback filesystems won't
-	 * flush data to disk automatically on IS_SYNC, only metadata (but
-	 * historically, that is what ext2 has done.)
-	 */
-
-force_commit:
-	err = ext4_force_commit(inode->i_sb);
-	if (err)
-		return err;
-	return ret;
+	return generic_file_aio_write(iocb, iov, nr_segs, pos);
 }
 
 static struct vm_operations_struct ext4_file_vm_ops = {
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 11/17] fat: Opencode sync_page_range_nolock()
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (9 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 10/17] ext4: Remove syncing logic from ext4_file_write Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:04 ` [PATCH 12/17] ntfs: Use new syncing helpers and update comments Jan Kara
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, OGAWA Hirofumi

fat_cont_expand() is the only user of sync_page_range_nolock(). It's also the
only user of generic_osync_inode() which does not have a file open.  So
opencode needed actions for FAT so that we can convert generic_osync_inode() to
a standard syncing path.

Update a comment about generic_osync_inode().

CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fat/file.c |   22 ++++++++++++++++++++--
 fs/fat/misc.c |    4 ++--
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/fs/fat/file.c b/fs/fat/file.c
index f042b96..e8c159d 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -176,8 +176,26 @@ static int fat_cont_expand(struct inode *inode, loff_t size)
 
 	inode->i_ctime = inode->i_mtime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
-	if (IS_SYNC(inode))
-		err = sync_page_range_nolock(inode, mapping, start, count);
+	if (IS_SYNC(inode)) {
+		int err2;
+
+		/*
+		 * Opencode syncing since we don't have a file open to use
+		 * standard fsync path.
+		 */
+		err = filemap_fdatawrite_range(mapping, start,
+					       start + count - 1);
+		err2 = sync_mapping_buffers(mapping);
+		if (!err)
+			err = err2;
+		err2 = write_inode_now(inode, 1);
+		if (!err)
+			err = err2;
+		if (!err) {
+			err =  filemap_fdatawait_range(mapping, start,
+						       start + count - 1);
+		}
+	}
 out:
 	return err;
 }
diff --git a/fs/fat/misc.c b/fs/fat/misc.c
index a6c2047..4e35be8 100644
--- a/fs/fat/misc.c
+++ b/fs/fat/misc.c
@@ -119,8 +119,8 @@ int fat_chain_add(struct inode *inode, int new_dclus, int nr_cluster)
 		MSDOS_I(inode)->i_start = new_dclus;
 		MSDOS_I(inode)->i_logstart = new_dclus;
 		/*
-		 * Since generic_osync_inode() synchronize later if
-		 * this is not directory, we don't here.
+		 * Since generic_write_sync() synchronizes regular files later,
+		 * we sync here only directories.
 		 */
 		if (S_ISDIR(inode->i_mode) && IS_DIRSYNC(inode)) {
 			ret = fat_sync_inode(inode);
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 12/17] ntfs: Use new syncing helpers and update comments
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (10 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 11/17] fat: Opencode sync_page_range_nolock() Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:04 ` [PATCH 13/17] ocfs2: Update syncing after splicing to match generic version Jan Kara
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, Anton Altaparmakov, linux-ntfs-dev

Use new syncing helpers in .write and .aio_write functions. Also
remove superfluous syncing in ntfs_file_buffered_write() and update
comments about generic_osync_inode().

CC: Anton Altaparmakov <aia21@cantab.net>
CC: linux-ntfs-dev@lists.sourceforge.net
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ntfs/file.c |   16 ++++------------
 fs/ntfs/mft.c  |   13 ++++++-------
 2 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 3140a44..4350d49 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2076,14 +2076,6 @@ err_out:
 	*ppos = pos;
 	if (cached_page)
 		page_cache_release(cached_page);
-	/* For now, when the user asks for O_SYNC, we actually give O_DSYNC. */
-	if (likely(!status)) {
-		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(vi))) {
-			if (!mapping->a_ops->writepage || !is_sync_kiocb(iocb))
-				status = generic_osync_inode(vi, mapping,
-						OSYNC_METADATA|OSYNC_DATA);
-		}
-  	}
 	pagevec_lru_add_file(&lru_pvec);
 	ntfs_debug("Done.  Returning %s (written 0x%lx, status %li).",
 			written ? "written" : "status", (unsigned long)written,
@@ -2145,8 +2137,8 @@ static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	mutex_lock(&inode->i_mutex);
 	ret = ntfs_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-		int err = sync_page_range(inode, mapping, pos, ret);
+	if (ret > 0) {
+		int err = generic_write_sync(file, pos, ret);
 		if (err < 0)
 			ret = err;
 	}
@@ -2173,8 +2165,8 @@ static ssize_t ntfs_file_writev(struct file *file, const struct iovec *iov,
 	if (ret == -EIOCBQUEUED)
 		ret = wait_on_sync_kiocb(&kiocb);
 	mutex_unlock(&inode->i_mutex);
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-		int err = sync_page_range(inode, mapping, *ppos - ret, ret);
+	if (ret > 0) {
+		int err = generic_write_sync(file, *ppos - ret, ret);
 		if (err < 0)
 			ret = err;
 	}
diff --git a/fs/ntfs/mft.c b/fs/ntfs/mft.c
index 23bf684..1caa0ef 100644
--- a/fs/ntfs/mft.c
+++ b/fs/ntfs/mft.c
@@ -384,13 +384,12 @@ unm_err_out:
  * it is dirty in the inode meta data rather than the data page cache of the
  * inode, and thus there are no data pages that need writing out.  Therefore, a
  * full mark_inode_dirty() is overkill.  A mark_inode_dirty_sync(), on the
- * other hand, is not sufficient, because I_DIRTY_DATASYNC needs to be set to
- * ensure ->write_inode is called from generic_osync_inode() and this needs to
- * happen or the file data would not necessarily hit the device synchronously,
- * even though the vfs inode has the O_SYNC flag set.  Also, I_DIRTY_DATASYNC
- * simply "feels" better than just I_DIRTY_SYNC, since the file data has not
- * actually hit the block device yet, which is not what I_DIRTY_SYNC on its own
- * would suggest.
+ * other hand, is not sufficient, because ->write_inode needs to be called even
+ * in case of fdatasync. This needs to happen or the file data would not
+ * necessarily hit the device synchronously, even though the vfs inode has the
+ * O_SYNC flag set.  Also, I_DIRTY_DATASYNC simply "feels" better than just
+ * I_DIRTY_SYNC, since the file data has not actually hit the block device yet,
+ * which is not what I_DIRTY_SYNC on its own would suggest.
  */
 void __mark_mft_record_dirty(ntfs_inode *ni)
 {
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 13/17] ocfs2: Update syncing after splicing to match generic version
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (11 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 12/17] ntfs: Use new syncing helpers and update comments Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-21  1:36   ` [Ocfs2-devel] " Joel Becker
  2009-08-19 16:04 ` [PATCH 14/17] xfs: Use new syncing helper Jan Kara
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 66+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, Joel Becker, ocfs2-devel

Update ocfs2 specific splicing code to use generic syncing helper.

CC: Joel Becker <Joel.Becker@oracle.com>
CC: ocfs2-devel@oss.oracle.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/file.c |   27 ++++++---------------------
 1 files changed, 6 insertions(+), 21 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 1c71f0a..bd7fdf8 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1990,31 +1990,16 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe,
 
 	if (ret > 0) {
 		unsigned long nr_pages;
+		int err;
 
-		*ppos += ret;
 		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 
-		/*
-		 * If file or inode is SYNC and we actually wrote some data,
-		 * sync it.
-		 */
-		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
-			int err;
-
-			mutex_lock(&inode->i_mutex);
-			err = ocfs2_rw_lock(inode, 1);
-			if (err < 0) {
-				mlog_errno(err);
-			} else {
-				err = generic_osync_inode(inode, mapping,
-						  OSYNC_METADATA|OSYNC_DATA);
-				ocfs2_rw_unlock(inode, 1);
-			}
-			mutex_unlock(&inode->i_mutex);
+		err = generic_write_sync(out, *ppos, ret);
+		if (err)
+			ret = err;
+		else
+			*ppos += ret;
 
-			if (err)
-				ret = err;
-		}
 		balance_dirty_pages_ratelimited_nr(mapping, nr_pages);
 	}
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 14/17] xfs: Use new syncing helper
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (12 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 13/17] ocfs2: Update syncing after splicing to match generic version Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:33   ` Christoph Hellwig
  2009-08-19 16:04 ` [PATCH 15/17] pohmelfs: " Jan Kara
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 66+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, Felix Blyakher, xfs

Use new generic_write_sync() helper from xfs_write().

CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/linux-2.6/xfs_lrw.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
index 7078974..aeb5a39 100644
--- a/fs/xfs/linux-2.6/xfs_lrw.c
+++ b/fs/xfs/linux-2.6/xfs_lrw.c
@@ -817,7 +817,7 @@ write_retry:
 		xfs_iunlock(xip, iolock);
 		if (need_i_mutex)
 			mutex_unlock(&inode->i_mutex);
-		error2 = sync_page_range(inode, mapping, pos, ret);
+		error2 = generic_write_sync(file, pos, ret);
 		if (!error)
 			error = error2;
 		if (need_i_mutex)
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 15/17] pohmelfs: Use new syncing helper
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (13 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 14/17] xfs: Use new syncing helper Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:04 ` [PATCH 16/17] nfs: Remove reference to generic_osync_inode from a comment Jan Kara
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, Evgeniy Polyakov

Use new generic_write_sync() helper instead of sync_page_range().

CC: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 drivers/staging/pohmelfs/inode.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c
index 17801a5..a2e5eed 100644
--- a/drivers/staging/pohmelfs/inode.c
+++ b/drivers/staging/pohmelfs/inode.c
@@ -927,10 +927,10 @@ ssize_t pohmelfs_write(struct file *file, const char __user *buf,
 	mutex_unlock(&inode->i_mutex);
 	WARN_ON(ret < 0);
 
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+	if (ret > 0) {
 		ssize_t err;
 
-		err = sync_page_range(inode, mapping, pos, ret);
+		err = generic_write_sync(file, pos, ret);
 		if (err < 0)
 			ret = err;
 		WARN_ON(ret < 0);
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 16/17] nfs: Remove reference to generic_osync_inode from a comment
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (14 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 15/17] pohmelfs: " Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:04 ` [PATCH 17/17] vfs: Remove generic_osync_inode() and sync_page_range() Jan Kara
       [not found] ` <20090820221221.GA14440@infradead.org>
  17 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, linux-nfs, Neil Brown, J. Bruce Fields

generic_file_direct_write() no longer calls generic_osync_inode() so remove the
comment.

CC: linux-nfs@vger.kernel.org
CC: Neil Brown <neilb@suse.de>
CC: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/nfs/direct.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 489fc01..fd5fcc9 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -934,9 +934,6 @@ out:
  * back into its cache.  We let the server do generic write
  * parameter checking and report problems.
  *
- * We also avoid an unnecessary invocation of generic_osync_inode(),
- * as it is fairly meaningless to sync the metadata of an NFS file.
- *
  * We eliminate local atime updates, see direct read above.
  *
  * We avoid unnecessary page cache invalidations for normal cached
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 17/17] vfs: Remove generic_osync_inode() and sync_page_range()
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (15 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 16/17] nfs: Remove reference to generic_osync_inode from a comment Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
       [not found] ` <20090820221221.GA14440@infradead.org>
  17 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara

Remove these two functions since nobody uses them anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c         |   54 ---------------------------------------------
 include/linux/fs.h        |    5 ----
 include/linux/writeback.h |    2 -
 mm/filemap.c              |   35 -----------------------------
 4 files changed, 0 insertions(+), 96 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index c54226b..d8dbef0 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -737,57 +737,3 @@ int sync_inode(struct inode *inode, struct writeback_control *wbc)
 	return ret;
 }
 EXPORT_SYMBOL(sync_inode);
-
-/**
- * generic_osync_inode - flush all dirty data for a given inode to disk
- * @inode: inode to write
- * @mapping: the address_space that should be flushed
- * @what:  what to write and wait upon
- *
- * This can be called by file_write functions for files which have the
- * O_SYNC flag set, to flush dirty writes to disk.
- *
- * @what is a bitmask, specifying which part of the inode's data should be
- * written and waited upon.
- *
- *    OSYNC_DATA:     i_mapping's dirty data
- *    OSYNC_METADATA: the buffers at i_mapping->private_list
- *    OSYNC_INODE:    the inode itself
- */
-
-int generic_osync_inode(struct inode *inode, struct address_space *mapping, int what)
-{
-	int err = 0;
-	int need_write_inode_now = 0;
-	int err2;
-
-	if (what & OSYNC_DATA)
-		err = filemap_fdatawrite(mapping);
-	if (what & (OSYNC_METADATA|OSYNC_DATA)) {
-		err2 = sync_mapping_buffers(mapping);
-		if (!err)
-			err = err2;
-	}
-	if (what & OSYNC_DATA) {
-		err2 = filemap_fdatawait(mapping);
-		if (!err)
-			err = err2;
-	}
-
-	spin_lock(&inode_lock);
-	if ((inode->i_state & I_DIRTY) &&
-	    ((what & OSYNC_INODE) || (inode->i_state & I_DIRTY_DATASYNC)))
-		need_write_inode_now = 1;
-	spin_unlock(&inode_lock);
-
-	if (need_write_inode_now) {
-		err2 = write_inode_now(inode, 1);
-		if (!err)
-			err = err2;
-	}
-	else
-		inode_sync_wait(inode);
-
-	return err;
-}
-EXPORT_SYMBOL(generic_osync_inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0f13049..9c33da7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1458,11 +1458,6 @@ int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
 #define DT_SOCK		12
 #define DT_WHT		14
 
-#define OSYNC_METADATA	(1<<0)
-#define OSYNC_DATA	(1<<1)
-#define OSYNC_INODE	(1<<2)
-int generic_osync_inode(struct inode *, struct address_space *, int);
-
 /*
  * This is the "filldir" function type, used by readdir() to let
  * the kernel specify what kind of dirent layout it wants to have.
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 46d6064..1446694 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -157,8 +157,6 @@ int write_cache_pages(struct address_space *mapping,
 		      struct writeback_control *wbc, writepage_t writepage,
 		      void *data);
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
-int sync_page_range(struct inode *inode, struct address_space *mapping,
-			loff_t pos, loff_t count);
 void set_page_dirty_balance(struct page *page, int page_mkwrite);
 void writeback_set_ratelimit(void);
 
diff --git a/mm/filemap.c b/mm/filemap.c
index a29b35b..1e68a34 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -326,41 +326,6 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start,
 EXPORT_SYMBOL(filemap_fdatawait_range);
 
 /**
- * sync_page_range - write and wait on all pages in the passed range
- * @inode:	target inode
- * @mapping:	target address_space
- * @pos:	beginning offset in pages to write
- * @count:	number of bytes to write
- *
- * Write and wait upon all the pages in the passed range.  This is a "data
- * integrity" operation.  It waits upon in-flight writeout before starting and
- * waiting upon new writeout.  If there was an IO error, return it.
- *
- * We need to re-take i_mutex during the generic_osync_inode list walk because
- * it is otherwise livelockable.
- */
-int sync_page_range(struct inode *inode, struct address_space *mapping,
-			loff_t pos, loff_t count)
-{
-	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
-	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
-	int ret;
-
-	if (!mapping_cap_writeback_dirty(mapping) || !count)
-		return 0;
-	ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
-	if (ret == 0) {
-		mutex_lock(&inode->i_mutex);
-		ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
-		mutex_unlock(&inode->i_mutex);
-	}
-	if (ret == 0)
-		ret = wait_on_page_writeback_range(mapping, start, end);
-	return ret;
-}
-EXPORT_SYMBOL(sync_page_range);
-
-/**
  * filemap_fdatawait - wait for all under-writeback pages to complete
  * @mapping: address space structure to wait for
  *
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/17] vfs: Introduce filemap_fdatawait_range
  2009-08-19 16:04 ` [PATCH 01/17] vfs: Introduce filemap_fdatawait_range Jan Kara
@ 2009-08-19 16:10   ` Christoph Hellwig
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:10 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch

On Wed, Aug 19, 2009 at 06:04:28PM +0200, Jan Kara wrote:
> This simple helper saves some filesystems conversion from byte offset
> to page numbers and also makes the fdata* interface more complete.

Looks good to me,


Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
  2009-08-19 16:04 ` [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments Jan Kara
@ 2009-08-19 16:11   ` Christoph Hellwig
  2009-08-20 12:04     ` Jan Kara
  2009-08-19 20:22   ` Evgeniy Polyakov
  1 sibling, 1 reply; 66+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:11 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, Evgeniy Polyakov, ocfs2-devel, Joel Becker

On Wed, Aug 19, 2009 at 06:04:29PM +0200, Jan Kara wrote:
> Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
> comments to write helpers explaining how they should be used and export
> __generic_file_aio_write() since it will be used by some filesystems.

The renaming and commenting looks good to me, and the export is fine
with me given that we'll actually use it later one.


Reviewed-by: Christoph Hellwig <hch@lst.de>

Can we maybe get rid of generic_file_aio_write_nolock after you patch
series?


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
  2009-08-19 16:04 ` [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write() Jan Kara
@ 2009-08-19 16:18   ` Christoph Hellwig
  2009-08-20 13:31     ` Jan Kara
  0 siblings, 1 reply; 66+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:18 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, ocfs2-devel, Joel Becker, Felix Blyakher, xfs

On Wed, Aug 19, 2009 at 06:04:30PM +0200, Jan Kara wrote:
> generic_file_direct_write() and generic_file_buffered_write() called
> generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
> this is superfluous since generic_file_aio_write() does the syncing as well.
> Also XFS and OCFS2 which call these functions directly handle syncing
> themselves. So let's have a single place where syncing happens:
> generic_file_aio_write().

Yeah, this is something that never made any sense to me.

> @@ -2187,20 +2187,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
>  		}
>  		*ppos = end;
>  	}
> -
> -	/*
> -	 * Sync the fs metadata but not the minor inode changes and
> -	 * of course not the data as we did direct DMA for the IO.
> -	 * i_mutex is held, which protects generic_osync_inode() from
> -	 * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
> -	 */
>  out:
> -	if ((written >= 0 || written == -EIOCBQUEUED) &&
> -	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> -		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
> -		if (err < 0)
> -			written = err;
> -	}
>  	return written;

Here we check (written >= 0 || written == -EIOCBQUEUED), but
generic_file_aio_write only cares about positive return values.  We
defintively do have a change here for partial AIO requests.  The
question is if the previous behaviour made in sense.  If do have an
O_SYNC aio+dio request we would have to flush out the metadata after the
request has completed and not here.

> @@ -2343,16 +2328,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
>  	if (likely(status >= 0)) {
>  		written += status;
>  		*ppos = pos + status;
> -
> -		/*
> -		 * For now, when the user asks for O_SYNC, we'll actually give
> -		 * O_DSYNC
> -		 */
> -		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> -			if (!a_ops->writepage || !is_sync_kiocb(iocb))
> -				status = generic_osync_inode(inode, mapping,
> -						OSYNC_METADATA|OSYNC_DATA);
> -		}
>    	}

No problem with -EIOCBQUEUED here, but we change from doing
generic_osync_inode with OSYNC_DATA which does a full writeout of the
data to sync_page_range which only does the range writeout here.  That
should be fine (as we only need to sync that range), but should probably
be documented in the patch description.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/17] vfs: Remove sync_page_range_nolock
  2009-08-19 16:04 ` [PATCH 06/17] vfs: Remove sync_page_range_nolock Jan Kara
@ 2009-08-19 16:21   ` Christoph Hellwig
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:21 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch

On Wed, Aug 19, 2009 at 06:04:33PM +0200, Jan Kara wrote:
> The last user of sync_page_range_nolock() is generic_file_aio_write_nolock().
> Now we have converted all its callers to not hold i_mutex and so we can
> afford to call sync_page_range() instead of sync_page_range_nolock() from
> there. This non-necessarily acquires i_mutex when syncing block devices but
> that's happening in fsync() path as well and block_fsync() may just drop +
> reacquire i_mutex if we care enough.

Looks good to me, but to stop people from using
generic_file_aio_write_nolock accidentally I would rename it to
blkdev_aio_write, move it to fs/block_dev.c and stop exporting it.  That
way any out of tree or in the merge queue filesystems is forced to use
the appropinquate __generic_file_aio_write.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-19 16:04 ` [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode Jan Kara
@ 2009-08-19 16:26   ` Christoph Hellwig
  2009-08-20 12:15     ` Jan Kara
  0 siblings, 1 reply; 66+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, hch, Evgeniy Polyakov, ocfs2-devel, Joel Becker,
	Felix Blyakher, xfs, Anton Altaparmakov, linux-ntfs-dev,
	OGAWA Hirofumi, linux-ext4, tytso

Looks good to me.  Eventually we should use those SYNC_ flags also all
through the fsync codepath, but I'll see if I can incorporate that in my
planned fsync rewrite.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 14/17] xfs: Use new syncing helper
  2009-08-19 16:04 ` [PATCH 14/17] xfs: Use new syncing helper Jan Kara
@ 2009-08-19 16:33   ` Christoph Hellwig
  2009-08-20 12:22     ` Jan Kara
  0 siblings, 1 reply; 66+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:33 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, Felix Blyakher, xfs

On Wed, Aug 19, 2009 at 06:04:41PM +0200, Jan Kara wrote:
> diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
> index 7078974..aeb5a39 100644
> --- a/fs/xfs/linux-2.6/xfs_lrw.c
> +++ b/fs/xfs/linux-2.6/xfs_lrw.c
> @@ -817,7 +817,7 @@ write_retry:
>  		xfs_iunlock(xip, iolock);
>  		if (need_i_mutex)
>  			mutex_unlock(&inode->i_mutex);
> -		error2 = sync_page_range(inode, mapping, pos, ret);
> +		error2 = generic_write_sync(file, pos, ret);

I don't think this is very optimal for XFS.  This first starts an
asynchronous writeout of the range in generic_write_sync,
then calls into ->fsync which waits for all I/O on the file to finish,
then forces the log inside xfs_fsync,  then waits for the range again in
generic_write_sync, and after this code calls into
xfs_write_sync_logforce which forces to log _again_.

We should be fine just doing your new filemap_fdatawrite_range helper
here and let xfs_write_sync_logforce do the work.  Long term we should
just get rid of this stupid submit writes before syncing the iode, then
wait crap which is a pain for all modern filesystems.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
  2009-08-19 16:04 ` [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments Jan Kara
  2009-08-19 16:11   ` Christoph Hellwig
@ 2009-08-19 20:22   ` Evgeniy Polyakov
  2009-08-20 12:31     ` Jan Kara
  1 sibling, 1 reply; 66+ messages in thread
From: Evgeniy Polyakov @ 2009-08-19 20:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, ocfs2-devel, Joel Becker

Hi.

On Wed, Aug 19, 2009 at 06:04:29PM +0200, Jan Kara (jack@suse.cz) wrote:
> Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
> comments to write helpers explaining how they should be used and export
> __generic_file_aio_write() since it will be used by some filesystems.

Looks good to me. And comments are definitely good.
Ack.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
  2009-08-19 16:11   ` Christoph Hellwig
@ 2009-08-20 12:04     ` Jan Kara
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-20 12:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, LKML, Evgeniy Polyakov, ocfs2-devel, Joel Becker

On Wed 19-08-09 12:11:45, Christoph Hellwig wrote:
> On Wed, Aug 19, 2009 at 06:04:29PM +0200, Jan Kara wrote:
> > Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
> > comments to write helpers explaining how they should be used and export
> > __generic_file_aio_write() since it will be used by some filesystems.
> 
> The renaming and commenting looks good to me, and the export is fine
> with me given that we'll actually use it later one.
> 
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Can we maybe get rid of generic_file_aio_write_nolock after you patch
> series?
  In principle we could. It's just that block devices don't need i_mutex
for serializing writes so I found it dumb to serialize writes to the block
device needlessly. But we could move generic_file_aio_write_nolock to
blockdev.c and name it like blockdev_aio_write() since noone else uses
it after my series. I guess I'll do that.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-19 16:26   ` Christoph Hellwig
@ 2009-08-20 12:15     ` Jan Kara
  2009-08-20 16:27       ` Christoph Hellwig
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Kara @ 2009-08-20 12:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, LKML, Evgeniy Polyakov, ocfs2-devel, Joel Becker,
	Felix Blyakher, xfs, Anton Altaparmakov, linux-ntfs-dev,
	OGAWA Hirofumi, linux-ext4, tytso

On Wed 19-08-09 12:26:38, Christoph Hellwig wrote:
> Looks good to me.  Eventually we should use those SYNC_ flags also all
> through the fsync codepath, but I'll see if I can incorporate that in my
> planned fsync rewrite.
  Yes, I thought I'll leave that for later. BTW it should be fairly easy to
teach generic_sync_file() to do fdatawait() before calling ->fsync() if the
filesystem sets some flag in inode->i_mapping (or somewhere else) as is
needed for XFS, btrfs, etc.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 14/17] xfs: Use new syncing helper
  2009-08-19 16:33   ` Christoph Hellwig
@ 2009-08-20 12:22     ` Jan Kara
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-20 12:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, LKML, Felix Blyakher, xfs

On Wed 19-08-09 12:33:15, Christoph Hellwig wrote:
> On Wed, Aug 19, 2009 at 06:04:41PM +0200, Jan Kara wrote:
> > diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
> > index 7078974..aeb5a39 100644
> > --- a/fs/xfs/linux-2.6/xfs_lrw.c
> > +++ b/fs/xfs/linux-2.6/xfs_lrw.c
> > @@ -817,7 +817,7 @@ write_retry:
> >  		xfs_iunlock(xip, iolock);
> >  		if (need_i_mutex)
> >  			mutex_unlock(&inode->i_mutex);
> > -		error2 = sync_page_range(inode, mapping, pos, ret);
> > +		error2 = generic_write_sync(file, pos, ret);
> 
> I don't think this is very optimal for XFS.  This first starts an
> asynchronous writeout of the range in generic_write_sync,
> then calls into ->fsync which waits for all I/O on the file to finish,
> then forces the log inside xfs_fsync,  then waits for the range again in
> generic_write_sync, and after this code calls into
> xfs_write_sync_logforce which forces to log _again_.
> 
> We should be fine just doing your new filemap_fdatawrite_range helper
> here and let xfs_write_sync_logforce do the work.  Long term we should
> just get rid of this stupid submit writes before syncing the iode, then
> wait crap which is a pain for all modern filesystems.
  OK, I'll happily change this. I was just doing the straightforward
conversion and sync_page_range() essentially did what generic_write_sync()
does since it called fdatawrite(), write_inode_now(inode, 1) (here XFS was
forced to wait for pages and force the log), fdatawait().

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
  2009-08-19 20:22   ` Evgeniy Polyakov
@ 2009-08-20 12:31     ` Jan Kara
  2009-08-20 13:30       ` Evgeniy Polyakov
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Kara @ 2009-08-20 12:31 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Jan Kara, LKML, hch, ocfs2-devel, Joel Becker

On Thu 20-08-09 00:22:08, Evgeniy Polyakov wrote:
> Hi.
> 
> On Wed, Aug 19, 2009 at 06:04:29PM +0200, Jan Kara (jack@suse.cz) wrote:
> > Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
> > comments to write helpers explaining how they should be used and export
> > __generic_file_aio_write() since it will be used by some filesystems.
> 
> Looks good to me. And comments are definitely good.
> Ack.
  Thanks for the review. BTW: I've CCed you mainly due to pohmelfs changes.
Will you have time to look at them?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
  2009-08-20 12:31     ` Jan Kara
@ 2009-08-20 13:30       ` Evgeniy Polyakov
  2009-08-20 13:52         ` Jan Kara
  0 siblings, 1 reply; 66+ messages in thread
From: Evgeniy Polyakov @ 2009-08-20 13:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, ocfs2-devel, Joel Becker

On Thu, Aug 20, 2009 at 02:31:37PM +0200, Jan Kara (jack@suse.cz) wrote:
>   Thanks for the review. BTW: I've CCed you mainly due to pohmelfs changes.
> Will you have time to look at them?

The one where you replaced one name with another? :)
Yes, I did.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
  2009-08-19 16:18   ` Christoph Hellwig
@ 2009-08-20 13:31     ` Jan Kara
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-20 13:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, LKML, ocfs2-devel, Joel Becker, Felix Blyakher, xfs

On Wed 19-08-09 12:18:53, Christoph Hellwig wrote:
> On Wed, Aug 19, 2009 at 06:04:30PM +0200, Jan Kara wrote:
> > generic_file_direct_write() and generic_file_buffered_write() called
> > generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
> > this is superfluous since generic_file_aio_write() does the syncing as well.
> > Also XFS and OCFS2 which call these functions directly handle syncing
> > themselves. So let's have a single place where syncing happens:
> > generic_file_aio_write().
> 
> Yeah, this is something that never made any sense to me.
> 
> > @@ -2187,20 +2187,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
> >  		}
> >  		*ppos = end;
> >  	}
> > -
> > -	/*
> > -	 * Sync the fs metadata but not the minor inode changes and
> > -	 * of course not the data as we did direct DMA for the IO.
> > -	 * i_mutex is held, which protects generic_osync_inode() from
> > -	 * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
> > -	 */
> >  out:
> > -	if ((written >= 0 || written == -EIOCBQUEUED) &&
> > -	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> > -		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
> > -		if (err < 0)
> > -			written = err;
> > -	}
> >  	return written;
> 
> Here we check (written >= 0 || written == -EIOCBQUEUED), but
> generic_file_aio_write only cares about positive return values.  We
> defintively do have a change here for partial AIO requests.
  Ah, that's a good point.

> The question is if the previous behaviour made in sense.  If do have an
> O_SYNC aio+dio request we would have to flush out the metadata after the
> request has completed and not here.
  Yes, that would be a correct behavior but I see no good way of doing
that. Flushing on EIOCBQUEUED mostly works for simple filesystems like
ext[23] where we don't do anything from end_io callback. The only risk
is if we crash after the transaction commits but before the direct IO is
done (we could expose stale data) but I'm not sure that is an issue with
real HW.
  Anyway, the question is what we should do about it.  For now, I'd call
generic_write_sync() even in case EIOCBQUEUED is returned to preserve old
behavior (EIOCBQUEUED does not seem to be returned from buffered write path
so we really just preserve the old behavior).

> > @@ -2343,16 +2328,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
> >  	if (likely(status >= 0)) {
> >  		written += status;
> >  		*ppos = pos + status;
> > -
> > -		/*
> > -		 * For now, when the user asks for O_SYNC, we'll actually give
> > -		 * O_DSYNC
> > -		 */
> > -		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> > -			if (!a_ops->writepage || !is_sync_kiocb(iocb))
> > -				status = generic_osync_inode(inode, mapping,
> > -						OSYNC_METADATA|OSYNC_DATA);
> > -		}
> >    	}
> 
> No problem with -EIOCBQUEUED here, but we change from doing
> generic_osync_inode with OSYNC_DATA which does a full writeout of the
> data to sync_page_range which only does the range writeout here.  That
> should be fine (as we only need to sync that range), but should probably
> be documented in the patch description.
  Yes, will do.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
  2009-08-20 13:30       ` Evgeniy Polyakov
@ 2009-08-20 13:52         ` Jan Kara
  2009-08-20 13:58           ` Evgeniy Polyakov
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Kara @ 2009-08-20 13:52 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Jan Kara, LKML, hch, ocfs2-devel, Joel Becker

On Thu 20-08-09 17:30:28, Evgeniy Polyakov wrote:
> On Thu, Aug 20, 2009 at 02:31:37PM +0200, Jan Kara (jack@suse.cz) wrote:
> >   Thanks for the review. BTW: I've CCed you mainly due to pohmelfs changes.
> > Will you have time to look at them?
> 
> The one where you replaced one name with another? :)
> Yes, I did.
  Well, there were two patches. Both were similarly simple but there could
be some subtleties I've missed :). So can I add your Acked-by also to those
two pohmelfs patches?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
  2009-08-20 13:52         ` Jan Kara
@ 2009-08-20 13:58           ` Evgeniy Polyakov
  0 siblings, 0 replies; 66+ messages in thread
From: Evgeniy Polyakov @ 2009-08-20 13:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, ocfs2-devel, Joel Becker

On Thu, Aug 20, 2009 at 03:52:45PM +0200, Jan Kara (jack@suse.cz) wrote:
>   Well, there were two patches. Both were similarly simple but there could

As of sync behaviour - pohmelfs will follow the usual VFS behaviour, so
if you managed to break this, we still will follow the mainstream :)

> be some subtleties I've missed :). So can I add your Acked-by also to those
> two pohmelfs patches?

Sure.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-20 12:15     ` Jan Kara
@ 2009-08-20 16:27       ` Christoph Hellwig
  2009-08-21 15:23         ` Jan Kara
  2009-08-26 18:22         ` Christoph Hellwig
  0 siblings, 2 replies; 66+ messages in thread
From: Christoph Hellwig @ 2009-08-20 16:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, LKML, Evgeniy Polyakov, ocfs2-devel,
	Joel Becker, Felix Blyakher, xfs, Anton Altaparmakov,
	linux-ntfs-dev, OGAWA Hirofumi, linux-ext4, tytso

On Thu, Aug 20, 2009 at 02:15:31PM +0200, Jan Kara wrote:
> On Wed 19-08-09 12:26:38, Christoph Hellwig wrote:
> > Looks good to me.  Eventually we should use those SYNC_ flags also all
> > through the fsync codepath, but I'll see if I can incorporate that in my
> > planned fsync rewrite.
>   Yes, I thought I'll leave that for later. BTW it should be fairly easy to
> teach generic_sync_file() to do fdatawait() before calling ->fsync() if the
> filesystem sets some flag in inode->i_mapping (or somewhere else) as is
> needed for XFS, btrfs, etc.

Maybe you can help brain storming, but I still can't see any way in that
the

  - write data
  - write inode
  - wait for data

actually is a benefit in terms of semantics (I agree that it could be
faster in theory, but even that is debatable with todays seek latencies
in disks)

Think about a simple non-journaling filesystem like ext2:

 (1) block get allocated during ->write before putting data in
      - this dirties the inode because we update i_block/i_size/etc
 (2) we call fsync (or the O_SNC handling code for that matter)
      - we start writeout of the data, which takes forever because the
	file is very large
      - then we write out the inode, including the i_size/i_blocks
	update
      - due to some reason this gets reordered before the data writeout
	finishes (without that happening there would be no benefit to
	this ordering anyway)
 (3) no we call filemap_fdatawait to wait for data I/O to finish


Now the system crashes between (2) and (3).  After that we we do have
stale data in the inode in the area not written yet.

Is there some case between that simple filesystem and the i_size update
from I/O completion handler in XFS/ext4 where this behaviour actually
buys us anything?  Any ext3 magic maybe?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Ocfs2-devel] [PATCH 13/17] ocfs2: Update syncing after splicing to match generic version
  2009-08-19 16:04 ` [PATCH 13/17] ocfs2: Update syncing after splicing to match generic version Jan Kara
@ 2009-08-21  1:36   ` Joel Becker
  2009-08-21 14:30     ` Jan Kara
  0 siblings, 1 reply; 66+ messages in thread
From: Joel Becker @ 2009-08-21  1:36 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, ocfs2-devel, mfasheh

On Wed, Aug 19, 2009 at 06:04:40PM +0200, Jan Kara wrote:
> Update ocfs2 specific splicing code to use generic syncing helper.
> 
> CC: Joel Becker <Joel.Becker@oracle.com>
> CC: ocfs2-devel@oss.oracle.com
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ocfs2/file.c |   27 ++++++---------------------
>  1 files changed, 6 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 1c71f0a..bd7fdf8 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -1990,31 +1990,16 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe,
>  
>  	if (ret > 0) {
>  		unsigned long nr_pages;
> +		int err;
>  
> -		*ppos += ret;
>  		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>  
> -		/*
> -		 * If file or inode is SYNC and we actually wrote some data,
> -		 * sync it.
> -		 */
> -		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
> -			int err;
> -
> -			mutex_lock(&inode->i_mutex);
> -			err = ocfs2_rw_lock(inode, 1);
> -			if (err < 0) {
> -				mlog_errno(err);
> -			} else {
> -				err = generic_osync_inode(inode, mapping,
> -						  OSYNC_METADATA|OSYNC_DATA);
> -				ocfs2_rw_unlock(inode, 1);
> -			}
> -			mutex_unlock(&inode->i_mutex);
> +		err = generic_write_sync(out, *ppos, ret);
> +		if (err)
> +			ret = err;
> +		else
> +			*ppos += ret;

	You've removed the rw_lock around the sync.  Any reason why?

Joel

-- 

"A narcissist is someone better looking than you are."  
         - Gore Vidal

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Ocfs2-devel] [PATCH 13/17] ocfs2: Update syncing after splicing to match generic version
  2009-08-21  1:36   ` [Ocfs2-devel] " Joel Becker
@ 2009-08-21 14:30     ` Jan Kara
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-21 14:30 UTC (permalink / raw)
  To: Joel Becker; +Cc: Jan Kara, LKML, hch, ocfs2-devel, mfasheh

On Thu 20-08-09 18:36:17, Joel Becker wrote:
> On Wed, Aug 19, 2009 at 06:04:40PM +0200, Jan Kara wrote:
> > Update ocfs2 specific splicing code to use generic syncing helper.
> > 
> > CC: Joel Becker <Joel.Becker@oracle.com>
> > CC: ocfs2-devel@oss.oracle.com
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/ocfs2/file.c |   27 ++++++---------------------
> >  1 files changed, 6 insertions(+), 21 deletions(-)
> > 
> > diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> > index 1c71f0a..bd7fdf8 100644
> > --- a/fs/ocfs2/file.c
> > +++ b/fs/ocfs2/file.c
> > @@ -1990,31 +1990,16 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe,
> >  
> >  	if (ret > 0) {
> >  		unsigned long nr_pages;
> > +		int err;
> >  
> > -		*ppos += ret;
> >  		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> >  
> > -		/*
> > -		 * If file or inode is SYNC and we actually wrote some data,
> > -		 * sync it.
> > -		 */
> > -		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
> > -			int err;
> > -
> > -			mutex_lock(&inode->i_mutex);
> > -			err = ocfs2_rw_lock(inode, 1);
> > -			if (err < 0) {
> > -				mlog_errno(err);
> > -			} else {
> > -				err = generic_osync_inode(inode, mapping,
> > -						  OSYNC_METADATA|OSYNC_DATA);
> > -				ocfs2_rw_unlock(inode, 1);
> > -			}
> > -			mutex_unlock(&inode->i_mutex);
> > +		err = generic_write_sync(out, *ppos, ret);
> > +		if (err)
> > +			ret = err;
> > +		else
> > +			*ppos += ret;
> 
> 	You've removed the rw_lock around the sync.  Any reason why?
  Ah, I should have written in the changelog: generic_write_sync() will
acquire i_mutex so to preserve lock ordering (i_mutex -> rw_lock) we cannot
hold it while calling generic_write_sync(). Furthermore, I didn't see point
for holding it while calling generic_write_sync() since we don't hold it
in fsync() path either.
  I'll add something like this to the changelog.

									Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-20 16:27       ` Christoph Hellwig
@ 2009-08-21 15:23         ` Jan Kara
  2009-08-21 15:32           ` Christoph Hellwig
  2009-08-26 18:22         ` Christoph Hellwig
  1 sibling, 1 reply; 66+ messages in thread
From: Jan Kara @ 2009-08-21 15:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, LKML, Evgeniy Polyakov, ocfs2-devel, Joel Becker,
	Felix Blyakher, xfs, Anton Altaparmakov, linux-ntfs-dev,
	OGAWA Hirofumi, linux-ext4, tytso

On Thu 20-08-09 12:27:29, Christoph Hellwig wrote:
> On Thu, Aug 20, 2009 at 02:15:31PM +0200, Jan Kara wrote:
> > On Wed 19-08-09 12:26:38, Christoph Hellwig wrote:
> > > Looks good to me.  Eventually we should use those SYNC_ flags also all
> > > through the fsync codepath, but I'll see if I can incorporate that in my
> > > planned fsync rewrite.
> >   Yes, I thought I'll leave that for later. BTW it should be fairly easy to
> > teach generic_sync_file() to do fdatawait() before calling ->fsync() if the
> > filesystem sets some flag in inode->i_mapping (or somewhere else) as is
> > needed for XFS, btrfs, etc.
> 
> Maybe you can help brain storming, but I still can't see any way in that
> the
> 
>   - write data
>   - write inode
>   - wait for data
> 
> actually is a benefit in terms of semantics (I agree that it could be
> faster in theory, but even that is debatable with todays seek latencies
> in disks)
> 
> Think about a simple non-journaling filesystem like ext2:
> 
>  (1) block get allocated during ->write before putting data in
>       - this dirties the inode because we update i_block/i_size/etc
>  (2) we call fsync (or the O_SNC handling code for that matter)
>       - we start writeout of the data, which takes forever because the
> 	file is very large
>       - then we write out the inode, including the i_size/i_blocks
> 	update
>       - due to some reason this gets reordered before the data writeout
> 	finishes (without that happening there would be no benefit to
> 	this ordering anyway)
>  (3) no we call filemap_fdatawait to wait for data I/O to finish
> 
> Now the system crashes between (2) and (3).  After that we we do have
> stale data in the inode in the area not written yet.
  Yes, that's true.

> Is there some case between that simple filesystem and the i_size update
> from I/O completion handler in XFS/ext4 where this behaviour actually
> buys us anything?  Any ext3 magic maybe?
  Hmm, I can imagine it would buy us something in two cases (but looking at
the code, neither is implemented in such a way that it would really help
us in any way):
1) when an inode and it's data are stored in one block (e.g. OCFS2 or UDF) do
   this.
2) when we journal data
  In the first case we would wait for block with data to be written only to
submit it again because inode was still dirty.
  In the second case, it would make sence if we waited for transaction
commit in fdatawait() because only then data is really on disk. But I don't
know about a fs which would do it - ext3 in data=journal mode just adds
page buffers to the current transaction in writepage and never sets
PageWriteback so fdatawait() is nop for it. The page is pinned in memory
only by the fact that its buffer heads are part of a transaction and thus
cannot be freed.
  So currently I don't know about real cases where fdatawait after ->fsync()
would buy us anything...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-21 15:23         ` Jan Kara
@ 2009-08-21 15:32           ` Christoph Hellwig
  2009-08-21 15:48             ` Jan Kara
  0 siblings, 1 reply; 66+ messages in thread
From: Christoph Hellwig @ 2009-08-21 15:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, LKML, Evgeniy Polyakov, ocfs2-devel,
	Joel Becker, Felix Blyakher, xfs, Anton Altaparmakov,
	linux-ntfs-dev, OGAWA Hirofumi, linux-ext4, tytso

On Fri, Aug 21, 2009 at 05:23:39PM +0200, Jan Kara wrote:
>   Hmm, I can imagine it would buy us something in two cases (but looking at
> the code, neither is implemented in such a way that it would really help
> us in any way):
> 1) when an inode and it's data are stored in one block (e.g. OCFS2 or UDF) do
>    this.

>   In the first case we would wait for block with data to be written only to
> submit it again because inode was still dirty.

But we would not actually store in the pagecache or at least never
writeback it via the pacache in that case, but always just write it as
part of the inode.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-21 15:32           ` Christoph Hellwig
@ 2009-08-21 15:48             ` Jan Kara
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-21 15:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, LKML, Evgeniy Polyakov, ocfs2-devel, Joel Becker,
	Felix Blyakher, xfs, Anton Altaparmakov, linux-ntfs-dev,
	OGAWA Hirofumi, linux-ext4, tytso

On Fri 21-08-09 11:32:35, Christoph Hellwig wrote:
> On Fri, Aug 21, 2009 at 05:23:39PM +0200, Jan Kara wrote:
> >   Hmm, I can imagine it would buy us something in two cases (but looking at
> > the code, neither is implemented in such a way that it would really help
> > us in any way):
> > 1) when an inode and it's data are stored in one block (e.g. OCFS2 or UDF) do
> >    this.
> 
> >   In the first case we would wait for block with data to be written only to
> > submit it again because inode was still dirty.
> 
> But we would not actually store in the pagecache or at least never
> writeback it via the pacache in that case, but always just write it as
> part of the inode.
  Yes, probably. Actually, that's what UDF does. It's just that then the
sequence fdatawrite(), fdatawait() does not really send the data to disk.
But that probably does not matter.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
  2009-08-21 16:59 [PATCH 01/17] vfs: Introduce filemap_fdatawait_range Jan Kara
@ 2009-08-21 16:59 ` Jan Kara
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-21 16:59 UTC (permalink / raw)
  To: LKML; +Cc: hch, linux-fsdevel, Jan Kara, ocfs2-devel, Joel Becker

Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
comments to write helpers explaining how they should be used and export
__generic_file_aio_write() since it will be used by some filesystems.

CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h |    2 +
 mm/filemap.c       |   57 +++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index cb365ad..1edd159 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2195,6 +2195,8 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size);
 int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk);
 extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t);
+extern ssize_t __generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long,
+		loff_t *);
 extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t);
 extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct iovec *,
 		unsigned long, loff_t);
diff --git a/mm/filemap.c b/mm/filemap.c
index 65b2e50..554a396 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2368,9 +2368,27 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
 
-static ssize_t
-__generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov,
-				unsigned long nr_segs, loff_t *ppos)
+/**
+ * __generic_file_aio_write - write data to a file
+ * @iocb:	IO state structure (file, offset, etc.)
+ * @iov:	vector with data to write
+ * @nr_segs:	number of segments in the vector
+ * @ppos:	position where to write
+ *
+ * This function does all the work needed for actually writing data to a
+ * file. It does all basic checks, removes SUID from the file, updates
+ * modification times and calls proper subroutines depending on whether we
+ * do direct IO or a standard buffered write.
+ *
+ * It expects i_mutex to be grabbed unless we work on a block device or similar
+ * object which does not need locking at all.
+ *
+ * This function does *not* take care of syncing data in case of O_SYNC write.
+ * A caller has to handle it. This is mainly due to the fact that we want to
+ * avoid syncing under i_mutex.
+ */
+ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
+				 unsigned long nr_segs, loff_t *ppos)
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space * mapping = file->f_mapping;
@@ -2467,7 +2485,23 @@ out:
 	current->backing_dev_info = NULL;
 	return written ? written : err;
 }
+EXPORT_SYMBOL(__generic_file_aio_write);
+
 
+/**
+ * generic_file_aio_write_nolock - write data, usually to a device
+ * @iocb:	IO state structure
+ * @iov:	vector with data to write
+ * @nr_segs:	number of segments in the vector
+ * @pos:	position in file where to write
+ *
+ * This is a wrapper around __generic_file_aio_write() which takes care of
+ * syncing the file in case of O_SYNC file. It does not take i_mutex for the
+ * write itself but may do so during syncing. It is meant for users like block
+ * devices which do not need i_mutex during write. If your filesystem needs to
+ * do a write but already holds i_mutex, use __generic_file_aio_write()
+ * directly and then sync the file like generic_file_aio_write().
+ */
 ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 		const struct iovec *iov, unsigned long nr_segs, loff_t pos)
 {
@@ -2478,8 +2512,7 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 
 	BUG_ON(iocb->ki_pos != pos);
 
-	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
-			&iocb->ki_pos);
+	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 
 	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 		ssize_t err;
@@ -2492,6 +2525,17 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 }
 EXPORT_SYMBOL(generic_file_aio_write_nolock);
 
+/**
+ * generic_file_aio_write - write data to a file
+ * @iocb:	IO state structure
+ * @iov:	vector with data to write
+ * @nr_segs:	number of segments in the vector
+ * @pos:	position in file where to write
+ *
+ * This is a wrapper around __generic_file_aio_write() to be used by most
+ * filesystems. It takes care of syncing the file in case of O_SYNC file
+ * and acquires i_mutex as needed.
+ */
 ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos)
 {
@@ -2503,8 +2547,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	BUG_ON(iocb->ki_pos != pos);
 
 	mutex_lock(&inode->i_mutex);
-	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
-			&iocb->ki_pos);
+	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
 
 	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
  2009-08-21 17:23 [PATCH 0/17] Make O_SYNC handling use standard syncing path (Version 2) Jan Kara
@ 2009-08-21 17:23 ` Jan Kara
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-08-21 17:23 UTC (permalink / raw)
  To: LKML; +Cc: hch, linux-fsdevel, Jan Kara, ocfs2-devel, Joel Becker

Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
comments to write helpers explaining how they should be used and export
__generic_file_aio_write() since it will be used by some filesystems.

CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h |    2 +
 mm/filemap.c       |   57 +++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index cb365ad..1edd159 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2195,6 +2195,8 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size);
 int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk);
 extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t);
+extern ssize_t __generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long,
+		loff_t *);
 extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t);
 extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct iovec *,
 		unsigned long, loff_t);
diff --git a/mm/filemap.c b/mm/filemap.c
index 65b2e50..554a396 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2368,9 +2368,27 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
 
-static ssize_t
-__generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov,
-				unsigned long nr_segs, loff_t *ppos)
+/**
+ * __generic_file_aio_write - write data to a file
+ * @iocb:	IO state structure (file, offset, etc.)
+ * @iov:	vector with data to write
+ * @nr_segs:	number of segments in the vector
+ * @ppos:	position where to write
+ *
+ * This function does all the work needed for actually writing data to a
+ * file. It does all basic checks, removes SUID from the file, updates
+ * modification times and calls proper subroutines depending on whether we
+ * do direct IO or a standard buffered write.
+ *
+ * It expects i_mutex to be grabbed unless we work on a block device or similar
+ * object which does not need locking at all.
+ *
+ * This function does *not* take care of syncing data in case of O_SYNC write.
+ * A caller has to handle it. This is mainly due to the fact that we want to
+ * avoid syncing under i_mutex.
+ */
+ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
+				 unsigned long nr_segs, loff_t *ppos)
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space * mapping = file->f_mapping;
@@ -2467,7 +2485,23 @@ out:
 	current->backing_dev_info = NULL;
 	return written ? written : err;
 }
+EXPORT_SYMBOL(__generic_file_aio_write);
+
 
+/**
+ * generic_file_aio_write_nolock - write data, usually to a device
+ * @iocb:	IO state structure
+ * @iov:	vector with data to write
+ * @nr_segs:	number of segments in the vector
+ * @pos:	position in file where to write
+ *
+ * This is a wrapper around __generic_file_aio_write() which takes care of
+ * syncing the file in case of O_SYNC file. It does not take i_mutex for the
+ * write itself but may do so during syncing. It is meant for users like block
+ * devices which do not need i_mutex during write. If your filesystem needs to
+ * do a write but already holds i_mutex, use __generic_file_aio_write()
+ * directly and then sync the file like generic_file_aio_write().
+ */
 ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 		const struct iovec *iov, unsigned long nr_segs, loff_t pos)
 {
@@ -2478,8 +2512,7 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 
 	BUG_ON(iocb->ki_pos != pos);
 
-	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
-			&iocb->ki_pos);
+	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 
 	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 		ssize_t err;
@@ -2492,6 +2525,17 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 }
 EXPORT_SYMBOL(generic_file_aio_write_nolock);
 
+/**
+ * generic_file_aio_write - write data to a file
+ * @iocb:	IO state structure
+ * @iov:	vector with data to write
+ * @nr_segs:	number of segments in the vector
+ * @pos:	position in file where to write
+ *
+ * This is a wrapper around __generic_file_aio_write() to be used by most
+ * filesystems. It takes care of syncing the file in case of O_SYNC file
+ * and acquires i_mutex as needed.
+ */
 ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos)
 {
@@ -2503,8 +2547,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	BUG_ON(iocb->ki_pos != pos);
 
 	mutex_lock(&inode->i_mutex);
-	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
-			&iocb->ki_pos);
+	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
 
 	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-20 16:27       ` Christoph Hellwig
  2009-08-21 15:23         ` Jan Kara
@ 2009-08-26 18:22         ` Christoph Hellwig
  2009-08-27  0:04           ` Christoph Hellwig
  1 sibling, 1 reply; 66+ messages in thread
From: Christoph Hellwig @ 2009-08-26 18:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, LKML, Evgeniy Polyakov, ocfs2-devel,
	Joel Becker, Felix Blyakher, xfs, Anton Altaparmakov,
	linux-ntfs-dev, OGAWA Hirofumi, linux-ext4, tytso

On Thu, Aug 20, 2009 at 12:27:29PM -0400, Christoph Hellwig wrote:
> Maybe you can help brain storming, but I still can't see any way in that
> the
> 
>   - write data
>   - write inode
>   - wait for data
> 
> actually is a benefit in terms of semantics (I agree that it could be
> faster in theory, but even that is debatable with todays seek latencies
> in disks)

Btw, another reason why our current default is actively harmful:

	barriers

With volatile write caches we do have to flush the disk write cache in
->fsync, either implicitly by a metadata operation, or explicitly if
only data changed.  Unless the filesystems waits itself for the data
to hit the disk like XFS or btrfs will be issue the cache flush
potentially before the data write has actually reached the disk cache.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-26 18:22         ` Christoph Hellwig
@ 2009-08-27  0:04           ` Christoph Hellwig
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2009-08-27  0:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, LKML, Evgeniy Polyakov, ocfs2-devel,
	Joel Becker, Felix Blyakher, xfs, Anton Altaparmakov,
	linux-ntfs-dev, OGAWA Hirofumi, linux-ext4, tytso

On Wed, Aug 26, 2009 at 02:22:36PM -0400, Christoph Hellwig wrote:
> On Thu, Aug 20, 2009 at 12:27:29PM -0400, Christoph Hellwig wrote:
> > Maybe you can help brain storming, but I still can't see any way in that
> > the
> > 
> >   - write data
> >   - write inode
> >   - wait for data
> > 
> > actually is a benefit in terms of semantics (I agree that it could be
> > faster in theory, but even that is debatable with todays seek latencies
> > in disks)
> 
> Btw, another reason why our current default is actively harmful:
> 
> 	barriers
> 
> With volatile write caches we do have to flush the disk write cache in
> ->fsync, either implicitly by a metadata operation, or explicitly if
> only data changed.  Unless the filesystems waits itself for the data
> to hit the disk like XFS or btrfs will be issue the cache flush
> potentially before the data write has actually reached the disk cache.

Ok, this one failed the reality check - no matter how hard I tried
I could not reproduce that case in my test harness.  It turns out
cache flush request are quite sensibly treated as barriers by the block
layer and thus we drain the queue before issueing the cache flush.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
       [not found]                 ` <20090827143459.GB31453@shareable.org>
@ 2009-08-27 17:10                   ` Christoph Hellwig
  2009-08-27 17:24                     ` Ulrich Drepper
  0 siblings, 1 reply; 66+ messages in thread
From: Christoph Hellwig @ 2009-08-27 17:10 UTC (permalink / raw)
  To: Jamie Lokier, Ulrich Drepper; +Cc: linux-fsdevel, linux-kernel

On Thu, Aug 27, 2009 at 03:34:59PM +0100, Jamie Lokier wrote:
> Christoph Hellwig wrote:
> > Then again they already don't get what they expect and never did,
> > so if we clear document and communicate the O_SYNC (that is Linux
> > O_SYNC) requirement we might be able to go with this.
> 
> I'm thinking, while we're looking at this, that now is a really good
> time to split up O_SYNC and O_DSYNC.
> 
> We have separate fsync and fdatasync, so it should be quite tidy now.
> 
> Then we can document using O_DSYNC on Linux, which is fine for older
> versions because it has the same value as O_SYNC at the moment.

Technically we could easily make O_SYNC really mean O_SYNC and implement
a seaprate O_DSYNC at the kernel level.

The question is how to handle this at the libc level.  Currently glibc
defines O_DSYNC to be O_SYNC.  We would need to update glibc to pass
through O_DSYNC for newer kernels and make sure it falls back to O_SYNC
for olders.  I'm not sure how feasible this is, but maybe Ulrich has
some better ideas.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-27 17:10                   ` adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers Christoph Hellwig
@ 2009-08-27 17:24                     ` Ulrich Drepper
  2009-08-28 15:46                       ` Christoph Hellwig
  0 siblings, 1 reply; 66+ messages in thread
From: Ulrich Drepper @ 2009-08-27 17:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, linux-fsdevel, linux-kernel

On 08/27/2009 10:10 AM, Christoph Hellwig wrote:
> The question is how to handle this at the libc level.  Currently glibc
> defines O_DSYNC to be O_SYNC.  We would need to update glibc to pass
> through O_DSYNC for newer kernels and make sure it falls back to O_SYNC
> for olders.  I'm not sure how feasible this is, but maybe Ulrich has
> some better ideas.

The problem with O_* extensions is that the syscall doesn't fail if the 
flag is not handled.  This is a problem in the open implementation which 
can only be fixed with a new syscall.

Why cannot just go on and say we interpret O_SYNC like O_SYNC and 
O_SYNC|O_DSYNC like O_DSYNC.  The POSIX spec explicitly requires that 
the latter handled like O_SYNC.

We could handle it by allocating two bits, only one is handled in the 
kernel.  If the O_DSYNC definition for userlevel would be different from 
the kernel definition then the kernel could interpret O_SYNC|O_DSYNC 
like O_DSYNC.  The libc would then have to translate the userlevel 
O_DSYNC into the kernel O_DSYNC.  If the libc is too old for the kernel 
and the application, the userlevel flag would be passed to the kernel 
and nothing bad happens.

The cleaner alternative is to have a sys_newopen which checks for 
unknown flags and fails in that case.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-27 17:24                     ` Ulrich Drepper
@ 2009-08-28 15:46                       ` Christoph Hellwig
  2009-08-28 16:06                         ` Ulrich Drepper
                                           ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Christoph Hellwig @ 2009-08-28 15:46 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Christoph Hellwig, Jamie Lokier, linux-fsdevel, linux-kernel

On Thu, Aug 27, 2009 at 10:24:28AM -0700, Ulrich Drepper wrote:
> The problem with O_* extensions is that the syscall doesn't fail if the  
> flag is not handled.  This is a problem in the open implementation which  
> can only be fixed with a new syscall.
>
> Why cannot just go on and say we interpret O_SYNC like O_SYNC and  
> O_SYNC|O_DSYNC like O_DSYNC.  The POSIX spec explicitly requires that  
> the latter handled like O_SYNC.
>
> We could handle it by allocating two bits, only one is handled in the  
> kernel.  If the O_DSYNC definition for userlevel would be different from  
> the kernel definition then the kernel could interpret O_SYNC|O_DSYNC  
> like O_DSYNC.  The libc would then have to translate the userlevel  
> O_DSYNC into the kernel O_DSYNC.  If the libc is too old for the kernel  
> and the application, the userlevel flag would be passed to the kernel  
> and nothing bad happens.

What about hte following variant:

 - given that our current O_SYNC really is and always has been actuall
   Posix O_DSYNC keep the numerical value and rename it to O_DSYNC in
   the headers.
 - Add a new O_SYNC definition:

	#define O_SYNC		(O_DSYNC|O_REALLY_SYNC)

   and do full O_SYNC handling in new kernels if O_REALLY_SYNC is
   present.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 15:46                       ` Christoph Hellwig
@ 2009-08-28 16:06                         ` Ulrich Drepper
  2009-08-28 16:17                           ` Christoph Hellwig
  2009-08-28 16:44                         ` Jamie Lokier
  2009-08-28 23:06                         ` Jamie Lokier
  2 siblings, 1 reply; 66+ messages in thread
From: Ulrich Drepper @ 2009-08-28 16:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, linux-fsdevel, linux-kernel

On 08/28/2009 08:46 AM, Christoph Hellwig wrote:
>   - given that our current O_SYNC really is and always has been actuall
>     Posix O_DSYNC

If this is true, then this proposal would work, yes.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:06                         ` Ulrich Drepper
@ 2009-08-28 16:17                           ` Christoph Hellwig
  2009-08-28 16:33                             ` Ulrich Drepper
  0 siblings, 1 reply; 66+ messages in thread
From: Christoph Hellwig @ 2009-08-28 16:17 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Christoph Hellwig, Jamie Lokier, linux-fsdevel, linux-kernel

On Fri, Aug 28, 2009 at 09:06:35AM -0700, Ulrich Drepper wrote:
> On 08/28/2009 08:46 AM, Christoph Hellwig wrote:
>>   - given that our current O_SYNC really is and always has been actuall
>>     Posix O_DSYNC
>
> If this is true, then this proposal would work, yes.

I'll put it on my todo list.  While reading through the Posix specs
I came up with some questions that you might be able to answer:

 - O_RSYNC basically means we need to commit atime updates before a
   read returns, right?  It would be easy to implement 
   it in a slightly suboptimal fashion, but is there any point?


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:17                           ` Christoph Hellwig
@ 2009-08-28 16:33                             ` Ulrich Drepper
  2009-08-28 16:41                               ` Christoph Hellwig
  2009-08-28 16:46                               ` Jamie Lokier
  0 siblings, 2 replies; 66+ messages in thread
From: Ulrich Drepper @ 2009-08-28 16:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, linux-fsdevel, linux-kernel

On 08/28/2009 09:17 AM, Christoph Hellwig wrote:
> I'll put it on my todo list.

Any ABI change like this takes a long time to trickle down.

If this is agreed to be the correct approach then adding the O_* 
definitions earlier is better.  Even if it isn't yet implemented.  Then, 
once the kernel side is implemented, programs are ready to use it.  I 
cannot jump the gun and define the flags myself first.

>   - O_RSYNC basically means we need to commit atime updates before a
>     read returns, right?

No, that's not it.

O_RSYNC on its own just means the data is successfully transferred to 
the calling process (always the case).

O_RSYNC|O_DSYNC means that if a read request hits data that is currently 
in a cache and not yet on the medium, then the write to medium is 
successful before the read succeeds.

O_RSYNC|O_SYNC means the same plus the integrity of file meta 
information (access time etc).

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:33                             ` Ulrich Drepper
@ 2009-08-28 16:41                               ` Christoph Hellwig
  2009-08-28 20:51                                 ` Ulrich Drepper
  2009-08-28 16:46                               ` Jamie Lokier
  1 sibling, 1 reply; 66+ messages in thread
From: Christoph Hellwig @ 2009-08-28 16:41 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Christoph Hellwig, Jamie Lokier, linux-fsdevel, linux-kernel

On Fri, Aug 28, 2009 at 09:33:29AM -0700, Ulrich Drepper wrote:
> On 08/28/2009 09:17 AM, Christoph Hellwig wrote:
>> I'll put it on my todo list.
>
> Any ABI change like this takes a long time to trickle down.
>
> If this is agreed to be the correct approach then adding the O_*  
> definitions earlier is better.  Even if it isn't yet implemented.  Then,  
> once the kernel side is implemented, programs are ready to use it.  I  
> cannot jump the gun and define the flags myself first.

Yeah.  The implementation really is trivial in 2.6.32 - we basically
just need to change one function to check the new O_REALLY_SYNC flag
and pass down a 0 instead of a 1 to another routine in the generic
fs code, plus doing the same in a few filesystems opencoding it instead
of using the generic helpers.

So the logistics of doing the flags really is the biggest work here.
And I'm not entirely sure how to do it correctly.  Can we just switch
the current O_SYNC defintion in the kernel headers to O_DSYNC while
adding the new O_SYNC and everything will continue to work? 

>>   - O_RSYNC basically means we need to commit atime updates before a
>>     read returns, right?
>
> No, that's not it.
>
> O_RSYNC on its own just means the data is successfully transferred to  
> the calling process (always the case).
>
> O_RSYNC|O_DSYNC means that if a read request hits data that is currently  
> in a cache and not yet on the medium, then the write to medium is  
> successful before the read succeeds.

That includes a write from another process?  So O_RSYNC basically means
doing an range-fdatasync before the actual read request?

Again, we could implement this easily if we care enough.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 15:46                       ` Christoph Hellwig
  2009-08-28 16:06                         ` Ulrich Drepper
@ 2009-08-28 16:44                         ` Jamie Lokier
  2009-08-28 16:50                           ` Jamie Lokier
  2009-08-28 21:08                           ` Ulrich Drepper
  2009-08-28 23:06                         ` Jamie Lokier
  2 siblings, 2 replies; 66+ messages in thread
From: Jamie Lokier @ 2009-08-28 16:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ulrich Drepper, linux-fsdevel, linux-kernel

Christoph Hellwig wrote:
> On Thu, Aug 27, 2009 at 10:24:28AM -0700, Ulrich Drepper wrote:
> > The problem with O_* extensions is that the syscall doesn't fail if the  
> > flag is not handled.  This is a problem in the open implementation which  
> > can only be fixed with a new syscall.
> >
> > Why cannot just go on and say we interpret O_SYNC like O_SYNC and  
> > O_SYNC|O_DSYNC like O_DSYNC.  The POSIX spec explicitly requires that  
> > the latter handled like O_SYNC.
> >
> > We could handle it by allocating two bits, only one is handled in the  
> > kernel.  If the O_DSYNC definition for userlevel would be different from  
> > the kernel definition then the kernel could interpret O_SYNC|O_DSYNC  
> > like O_DSYNC.  The libc would then have to translate the userlevel  
> > O_DSYNC into the kernel O_DSYNC.  If the libc is too old for the kernel  
> > and the application, the userlevel flag would be passed to the kernel  
> > and nothing bad happens.
> 
> What about hte following variant:
> 
>  - given that our current O_SYNC really is and always has been actuall
>    Posix O_DSYNC keep the numerical value and rename it to O_DSYNC in
>    the headers.
>  - Add a new O_SYNC definition:
> 
> 	#define O_SYNC		(O_DSYNC|O_REALLY_SYNC)
> 
>    and do full O_SYNC handling in new kernels if O_REALLY_SYNC is
>    present.

That looks good for the kernel.

However, for userspace, there's an issue with applications which were
compiled with an old libc and used O_SYNC.  Most of them probably
expected O_SYNC behaviour but all they got was O_DSYNC, because Linux
didn't do it right.

But they *didn't know* that.

When using a newer kernel which actually implements O_SYNC behaviour,
I'm thinking those applications which asked for O_SYNC should get it,
even though they're still linked with an old libc.

That's because this thread is the first time I've heard that Linux
O_SYNC was really the weaker O_DSYNC in disguise, and judging from the
many Googlings I've done about O_SYNC in applications and on different
OS, it'll be news to other people too.

(I always thought the "#define O_DSYNC O_SYNC" was because Linux
didn't implement the weaker O_DSYNC).

(Oh, and Ulrich: Why is there a "#define O_RSYNC O_SYNC" in the Glibc
headers?  That doesn't make sense: O_RSYNC has nothing to do with
writing.)

To achieve that, libc could implement two versions of open() at the
same time as it updates header files.  The new libc's __old_open() would
do:

    /* Only O_DSYNC is set for apps built against old libc which
       were compiled
    if (flags & O_DSYNC)
        flags |= O_SYNC;

I'm not exactly sure how symbol versioning works, but perhaps the
header file in the new libc would need __REDIRECT_NTH to map open() to
__new_open(), which just calls the kernel.  This is to ensure .o and
.a files built with an old libc's headers but then linked to a new
libc will get __old_open().

Although libc's __new_open() could have this:

    /* Old kernels only look at O_DSYNC.  It's better than nothing. */
    if (flags & O_SYNC)
        flags |= O_DSYNC;

Imho, it's better to not do that, and instead have

    #define O_SYNC          (O_DSYNC|__O_SYNC_KERNEL)

as Chris suggests, in the libc header the same as the kernel header,
because that way applications which use the syscall() function or have
to invoke a syscall directly (I've seen clone-using code doing it),
won't spontaneously start losing their O_SYNCness on older kernels.
Unless there is some reason why "flags &= ~O_SYNC" is not permitted to
clear the O_DSYNC flag, or other reason why they must be separate flags.

-- Jamie

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:33                             ` Ulrich Drepper
  2009-08-28 16:41                               ` Christoph Hellwig
@ 2009-08-28 16:46                               ` Jamie Lokier
  2009-08-29  0:59                                 ` Jamie Lokier
  1 sibling, 1 reply; 66+ messages in thread
From: Jamie Lokier @ 2009-08-28 16:46 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

Ulrich Drepper wrote:
> >  - O_RSYNC basically means we need to commit atime updates before a
> >    read returns, right?
> 
> No, that's not it.
> 
> O_RSYNC on its own just means the data is successfully transferred to 
> the calling process (always the case).
> 
> O_RSYNC|O_DSYNC means that if a read request hits data that is currently 
> in a cache and not yet on the medium, then the write to medium is 
> successful before the read succeeds.
> 
> O_RSYNC|O_SYNC means the same plus the integrity of file meta 
> information (access time etc).

On several unixes, O_RSYNC means it will send the read to the
hardware, not relying on the cache.  This can be used to verify the
data which was written earlier, whether by O_DSYNC or fdatasync.

-- Jamie

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:44                         ` Jamie Lokier
@ 2009-08-28 16:50                           ` Jamie Lokier
  2009-08-28 21:08                           ` Ulrich Drepper
  1 sibling, 0 replies; 66+ messages in thread
From: Jamie Lokier @ 2009-08-28 16:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ulrich Drepper, linux-fsdevel, linux-kernel

Jamie Lokier wrote:
> That's because this thread is the first time I've heard that Linux
> O_SYNC was really the weaker O_DSYNC in disguise, and judging from the
> many Googlings I've done about O_SYNC in applications and on different
> OS, it'll be news to other people too.
> 
> (I always thought the "#define O_DSYNC O_SYNC" was because Linux
> didn't implement the weaker O_DSYNC).

It looks like we're not the only ones.  AIX has:

http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm

    Before the O_DSYNC open mode existed, AIX applied O_DSYNC semantics to
    O_SYNC. For binary compatibility reasons, this behavior still
    exists. If true O_SYNC behavior is required, then both O_DSYNC and
    O_SYNC open flags must be specified. Exporting the XPG_SUS_ENV=ON
    environment variable also enables true O_SYNC behavior.

-- Jamie

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:41                               ` Christoph Hellwig
@ 2009-08-28 20:51                                 ` Ulrich Drepper
  2009-08-28 21:08                                   ` Christoph Hellwig
  0 siblings, 1 reply; 66+ messages in thread
From: Ulrich Drepper @ 2009-08-28 20:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, linux-fsdevel, linux-kernel

On 08/28/2009 09:41 AM, Christoph Hellwig wrote:
> Yeah.  The implementation really is trivial in 2.6.32 - we basically
> just need to change one function to check the new O_REALLY_SYNC flag
> and pass down a 0 instead of a 1 to another routine in the generic
> fs code, plus doing the same in a few filesystems opencoding it instead
> of using the generic helpers.

I don't think you have to change anything.  As I wrote before, the 
kernel ignores unknown O_* flags.  It's usually a problem.  Here it is a 
positive thing.


> So the logistics of doing the flags really is the biggest work here.
> And I'm not entirely sure how to do it correctly.  Can we just switch
> the current O_SYNC defintion in the kernel headers to O_DSYNC while
> adding the new O_SYNC and everything will continue to work?

No, that's not a good idea.  This would mean a program compiled with 
newer headers is using O_SYNC which isn't known to old kernels and 
ignored.  Such programs will then not even get the current O_DSYNC benefits.


> That includes a write from another process?  So O_RSYNC basically means
> doing an range-fdatasync before the actual read request?

Yes.  You can easily see how this can be useful.


> Again, we could implement this easily if we care enough.

I think it can be useful at times.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 20:51                                 ` Ulrich Drepper
@ 2009-08-28 21:08                                   ` Christoph Hellwig
  2009-08-28 21:16                                     ` Trond Myklebust
  2009-08-30 16:44                                     ` Jamie Lokier
  0 siblings, 2 replies; 66+ messages in thread
From: Christoph Hellwig @ 2009-08-28 21:08 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Christoph Hellwig, Jamie Lokier, linux-fsdevel, linux-kernel,
	torvalds

On Fri, Aug 28, 2009 at 01:51:03PM -0700, Ulrich Drepper wrote:
> No, that's not a good idea.  This would mean a program compiled with  
> newer headers is using O_SYNC which isn't known to old kernels and  
> ignored.  Such programs will then not even get the current O_DSYNC 
> benefits.

Ok, let's agree on how to proceed:


once 2.6.31 is out we will do the following

 - do a global s/O_SYNC/O_DSYNC/g over the whole kernel tree
 - add a this to include/asm-generic/fcntl.h and in modified form
   to arch headers not using it:

#ifndef O_FULLSYNC
#define O_FULLSYNC	02000000
#endif

#ifndef O_RSYNC
#define O_RSYNC		04000000
#endif

#define O_SYNC	(O_FULLSYNC|O_DSYNC)

 - during the normal merge window I will add a real implementation for
   for O_FULLSYNC and O_RSYNC

P.S. better naming suggestions for O_FULLSYNC welcome

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:44                         ` Jamie Lokier
  2009-08-28 16:50                           ` Jamie Lokier
@ 2009-08-28 21:08                           ` Ulrich Drepper
  2009-08-30 16:58                             ` Jamie Lokier
  2009-08-30 17:48                             ` Jamie Lokier
  1 sibling, 2 replies; 66+ messages in thread
From: Ulrich Drepper @ 2009-08-28 21:08 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On 08/28/2009 09:44 AM, Jamie Lokier wrote:
> However, for userspace, there's an issue with applications which were
> compiled with an old libc and used O_SYNC.  Most of them probably
> expected O_SYNC behaviour but all they got was O_DSYNC, because Linux
> didn't do it right.

Right.  But these programs apparently can live with the broken 
semantics.  I don't worry too much about this.  If people really need 
the fixed O_SYNC semantics then let them recompile their code.


> When using a newer kernel which actually implements O_SYNC behaviour,
> I'm thinking those applications which asked for O_SYNC should get it,
> even though they're still linked with an old libc.

In general yes, but it's too expensive.  Again, existing programs expect 
the current behavior and can live with it.


> (Oh, and Ulrich: Why is there a "#define O_RSYNC O_SYNC" in the Glibc
> headers?  That doesn't make sense: O_RSYNC has nothing to do with
> writing.)

O_SYNC is a superset of O_RSYNC.  In the absence of a true O_RSYNC 
that's the next best thing.  Of course I didn't know the Linux O_SYNC is 
really O_DSYNC.  In that context the definition doesn't make sense.


> Although libc's __new_open() could have this:
>
>      /* Old kernels only look at O_DSYNC.  It's better than nothing. */
>      if (flags&  O_SYNC)
>          flags |= O_DSYNC;
>
> Imho, it's better to not do that, and instead have
>
>      #define O_SYNC          (O_DSYNC|__O_SYNC_KERNEL)

Why should it be better?  You're replacing something the compiler can do 
with zero cost with active code.


Again, these O_* constant changes are sufficient.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:08                                   ` Christoph Hellwig
@ 2009-08-28 21:16                                     ` Trond Myklebust
  2009-08-28 21:29                                       ` Christoph Hellwig
  2009-08-30 16:44                                     ` Jamie Lokier
  1 sibling, 1 reply; 66+ messages in thread
From: Trond Myklebust @ 2009-08-28 21:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ulrich Drepper, Jamie Lokier, linux-fsdevel, linux-kernel,
	torvalds

On Fri, 2009-08-28 at 17:08 -0400, Christoph Hellwig wrote:
> #define O_SYNC	(O_FULLSYNC|O_DSYNC)
> 
>  - during the normal merge window I will add a real implementation for
>    for O_FULLSYNC and O_RSYNC
> 
> P.S. better naming suggestions for O_FULLSYNC welcome

Basically you are just ensuring that the metadata changes are being
synced together with the data changes, so how about O_ISYNC (inode
sync)?



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:16                                     ` Trond Myklebust
@ 2009-08-28 21:29                                       ` Christoph Hellwig
  2009-08-28 21:43                                         ` Trond Myklebust
  0 siblings, 1 reply; 66+ messages in thread
From: Christoph Hellwig @ 2009-08-28 21:29 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Christoph Hellwig, Ulrich Drepper, Jamie Lokier, linux-fsdevel,
	linux-kernel, torvalds

On Fri, Aug 28, 2009 at 05:16:14PM -0400, Trond Myklebust wrote:
> On Fri, 2009-08-28 at 17:08 -0400, Christoph Hellwig wrote:
> > #define O_SYNC	(O_FULLSYNC|O_DSYNC)
> > 
> >  - during the normal merge window I will add a real implementation for
> >    for O_FULLSYNC and O_RSYNC
> > 
> > P.S. better naming suggestions for O_FULLSYNC welcome
> 
> Basically you are just ensuring that the metadata changes are being
> synced together with the data changes, so how about O_ISYNC (inode
> sync)?

Yeah.  Thinking about this a bit more we should define this flag
much more clearly.  In the obvious implementation it would not actually
do anything if it's set on it's own.  We would only check it if O_DSYNC
is already set to decided if we want to set the datasync argument to
->fsync to 0 or 1 for the generic filesystems (and similar things for
filesystems not using the generic helper).

If we deem that this is too unsafe we could make sure O_DSYNC always
gets set on this fag in ->open, but if we make sure O_SYNC is defined
like the one above in the kernel headers and glibc we should be fine.

Although in that case a name that doesn't suggest that it actually does
something useful would be better.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:29                                       ` Christoph Hellwig
@ 2009-08-28 21:43                                         ` Trond Myklebust
  2009-08-28 22:39                                           ` Christoph Hellwig
  0 siblings, 1 reply; 66+ messages in thread
From: Trond Myklebust @ 2009-08-28 21:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ulrich Drepper, Jamie Lokier, linux-fsdevel, linux-kernel,
	torvalds

On Fri, 2009-08-28 at 17:29 -0400, Christoph Hellwig wrote:
> On Fri, Aug 28, 2009 at 05:16:14PM -0400, Trond Myklebust wrote:
> > On Fri, 2009-08-28 at 17:08 -0400, Christoph Hellwig wrote:
> > > #define O_SYNC	(O_FULLSYNC|O_DSYNC)
> > > 
> > >  - during the normal merge window I will add a real implementation for
> > >    for O_FULLSYNC and O_RSYNC
> > > 
> > > P.S. better naming suggestions for O_FULLSYNC welcome
> > 
> > Basically you are just ensuring that the metadata changes are being
> > synced together with the data changes, so how about O_ISYNC (inode
> > sync)?
> 
> Yeah.  Thinking about this a bit more we should define this flag
> much more clearly.  In the obvious implementation it would not actually
> do anything if it's set on it's own.  We would only check it if O_DSYNC
> is already set to decided if we want to set the datasync argument to
> ->fsync to 0 or 1 for the generic filesystems (and similar things for
> filesystems not using the generic helper).
> 
> If we deem that this is too unsafe we could make sure O_DSYNC always
> gets set on this fag in ->open, but if we make sure O_SYNC is defined
> like the one above in the kernel headers and glibc we should be fine.
> 
> Although in that case a name that doesn't suggest that it actually does
> something useful would be better.

If you are going to automatically set O_DSYNC in open(), then
fcntl(F_SETFL) might get a bit nasty.

Imagine using it after the open in order to clear the O_ISYNC flag;
you'll still be left with the O_DSYNC (which you never set in the first
place). That would be confusing...

Cheers
  Trond


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:43                                         ` Trond Myklebust
@ 2009-08-28 22:39                                           ` Christoph Hellwig
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2009-08-28 22:39 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Christoph Hellwig, Ulrich Drepper, Jamie Lokier, linux-fsdevel,
	linux-kernel, torvalds

On Fri, Aug 28, 2009 at 05:43:05PM -0400, Trond Myklebust wrote:
> If you are going to automatically set O_DSYNC in open(), then
> fcntl(F_SETFL) might get a bit nasty.
> 
> Imagine using it after the open in order to clear the O_ISYNC flag;
> you'll still be left with the O_DSYNC (which you never set in the first
> place). That would be confusing...

Indeed, that's a killer argument for the first variant.  We just need
to make it extremly clear (manpage _and_ comments) that only O_SYNC is
an exposed user interface and that O_WHATEVER_SYNC is an implementation
detail.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 15:46                       ` Christoph Hellwig
  2009-08-28 16:06                         ` Ulrich Drepper
  2009-08-28 16:44                         ` Jamie Lokier
@ 2009-08-28 23:06                         ` Jamie Lokier
  2009-08-28 23:46                           ` Christoph Hellwig
  2 siblings, 1 reply; 66+ messages in thread
From: Jamie Lokier @ 2009-08-28 23:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ulrich Drepper, linux-fsdevel, linux-kernel

Christoph Hellwig wrote:
>  - given that our current O_SYNC really is and always has been actuall
>    Posix O_DSYNC

Are you sure about this?

>From http://www-01.ibm.com/support/docview.wss?uid=isg1IZ01704 :

    Error description

       LINUX O_DIRECT/O_SYNC TAKES TOO MANY IOS

    Problem summary

       On AIX, the O_SYNC and O_DSYNC are different values and
       performance improvement are available because the inode does
       not need to be flushed for mtime changes only.
       On Linux the flags are the same, so performance is lost.
       when databases open files with O_DIRECT and O_SYNC.

-- Jamie

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 23:06                         ` Jamie Lokier
@ 2009-08-28 23:46                           ` Christoph Hellwig
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2009-08-28 23:46 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Christoph Hellwig, Ulrich Drepper, linux-fsdevel, linux-kernel

On Sat, Aug 29, 2009 at 12:06:23AM +0100, Jamie Lokier wrote:
> Christoph Hellwig wrote:
> >  - given that our current O_SYNC really is and always has been actuall
> >    Posix O_DSYNC
> 
> Are you sure about this?
> 
> >From http://www-01.ibm.com/support/docview.wss?uid=isg1IZ01704 :
> 
>     Error description
> 
>        LINUX O_DIRECT/O_SYNC TAKES TOO MANY IOS

That is for GPFS, and out of tree filesystem with binary components.
It could be that they took linux O_SYNC for real O_SYNC.  Any filesystem
using the generic helpers in Linux has gotten the O_DSYNC semantics at
least as long as I have worked on Linux filesystems, which is getting
close to 10 years now.  I'll do some code archaelogy before we'll move
with this to be sure.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:46                               ` Jamie Lokier
@ 2009-08-29  0:59                                 ` Jamie Lokier
  0 siblings, 0 replies; 66+ messages in thread
From: Jamie Lokier @ 2009-08-29  0:59 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

Jamie Lokier wrote:
> Ulrich Drepper wrote:
> > >  - O_RSYNC basically means we need to commit atime updates before a
> > >    read returns, right?
> > 
> > No, that's not it.
> > 
> > O_RSYNC on its own just means the data is successfully transferred to 
> > the calling process (always the case).
> > 
> > O_RSYNC|O_DSYNC means that if a read request hits data that is currently 
> > in a cache and not yet on the medium, then the write to medium is 
> > successful before the read succeeds.
> > 
> > O_RSYNC|O_SYNC means the same plus the integrity of file meta 
> > information (access time etc).
> 
> On several unixes, O_RSYNC means it will send the read to the
> hardware, not relying on the cache.  This can be used to verify the
> data which was written earlier, whether by O_DSYNC or fdatasync.

I'm sure I read that in a couple of OS man pages, but I can't find it
again.  Maybe it was something more obscure than the mainstream
unices; maybe I imagined it.  Ho hum.  For now, forget I said anythng.

-- Jamie

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:08                                   ` Christoph Hellwig
  2009-08-28 21:16                                     ` Trond Myklebust
@ 2009-08-30 16:44                                     ` Jamie Lokier
  1 sibling, 0 replies; 66+ messages in thread
From: Jamie Lokier @ 2009-08-30 16:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ulrich Drepper, linux-fsdevel, linux-kernel, torvalds

Christoph Hellwig wrote:
> P.S. better naming suggestions for O_FULLSYNC welcome

O_FULLSYNC might get confused with MacOS X's F_FULLSYNC, which means
something else: fsync through hardware volatile write caches.

(Might we even want to provide O_FULLSYNC and O_FULLDATASYNC to mean
that, eventually?)

O_ISYNC is a bit misleading if we don't really offer "flush just the
inode state" by itself.

So it should at least start with underscores: __O_ISYNC.

How about __O_SYNC_NEW with

    #define O_SYNC     (O_DSYNC|__O_SYNC_NEW)

I think that tells people reading the headers a bit about what to
expect on older kernels too.

-- Jamie

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:08                           ` Ulrich Drepper
@ 2009-08-30 16:58                             ` Jamie Lokier
  2009-08-30 17:48                             ` Jamie Lokier
  1 sibling, 0 replies; 66+ messages in thread
From: Jamie Lokier @ 2009-08-30 16:58 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

Ulrich Drepper wrote:
> On 08/28/2009 09:44 AM, Jamie Lokier wrote:
> >(Oh, and Ulrich: Why is there a "#define O_RSYNC O_SYNC" in the Glibc
> >headers?  That doesn't make sense: O_RSYNC has nothing to do with
> >writing.)
> 
> O_SYNC is a superset of O_RSYNC.  In the absence of a true O_RSYNC 
> that's the next best thing.

That's an error - O_SYNC is not a superset of O_RSYNC.

O_SYNC (by itself) only affects writes.

O_RSYNC only affect reads.

In the absence of O_RSYNC support in the kernel, it's better to not
define O_RSYNC at all in userspace.  That tells applications they can
call fsync/fdatasync themselves before reading to get an equivalent
effect.

In fact O_RSYNC, when implemented correctly, can be used by
applications to get the effect of range-fsync/fdatasync when such
system calls aren't available (by reading a range), but not as
efficiently of course.  Defining O_RSYNC as O_SYNC fails to do that.

-- Jamie

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:08                           ` Ulrich Drepper
  2009-08-30 16:58                             ` Jamie Lokier
@ 2009-08-30 17:48                             ` Jamie Lokier
  1 sibling, 0 replies; 66+ messages in thread
From: Jamie Lokier @ 2009-08-30 17:48 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

Ulrich Drepper wrote:
> On 08/28/2009 09:44 AM, Jamie Lokier wrote:
> >Although libc's __new_open() could have this:
> >
> >     /* Old kernels only look at O_DSYNC.  It's better than nothing. */
> >     if (flags&  O_SYNC)
> >         flags |= O_DSYNC;
> >
> >Imho, it's better to not do that, and instead have
> >
> >     #define O_SYNC          (O_DSYNC|__O_SYNC_KERNEL)
> 
> Why should it be better?  You're replacing something the compiler can do 
> with zero cost with active code.

You misread; I said the zero cost thing is better.

The only reason you might use the active code is this:

    /* Upgrade O_DSYNC to O_SYNC. */

    flags = fcntl(fd, F_GETFL, 0);
    flags = (flags | O_SYNC) & ~O_DSYNC;
    fcntl(fd, F_SETFL, flags);

I'm not sure if that should work in POSIX.

-- Jamie

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2009-08-30 17:49 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
2009-08-19 16:04 ` [PATCH 01/17] vfs: Introduce filemap_fdatawait_range Jan Kara
2009-08-19 16:10   ` Christoph Hellwig
2009-08-19 16:04 ` [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments Jan Kara
2009-08-19 16:11   ` Christoph Hellwig
2009-08-20 12:04     ` Jan Kara
2009-08-19 20:22   ` Evgeniy Polyakov
2009-08-20 12:31     ` Jan Kara
2009-08-20 13:30       ` Evgeniy Polyakov
2009-08-20 13:52         ` Jan Kara
2009-08-20 13:58           ` Evgeniy Polyakov
2009-08-19 16:04 ` [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write() Jan Kara
2009-08-19 16:18   ` Christoph Hellwig
2009-08-20 13:31     ` Jan Kara
2009-08-19 16:04 ` [PATCH 04/17] pohmelfs: Use __generic_file_aio_write instead of generic_file_aio_write_nolock Jan Kara
2009-08-19 16:04 ` [PATCH 05/17] ocfs2: " Jan Kara
2009-08-19 16:04 ` [PATCH 06/17] vfs: Remove sync_page_range_nolock Jan Kara
2009-08-19 16:21   ` Christoph Hellwig
2009-08-19 16:04 ` [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode Jan Kara
2009-08-19 16:26   ` Christoph Hellwig
2009-08-20 12:15     ` Jan Kara
2009-08-20 16:27       ` Christoph Hellwig
2009-08-21 15:23         ` Jan Kara
2009-08-21 15:32           ` Christoph Hellwig
2009-08-21 15:48             ` Jan Kara
2009-08-26 18:22         ` Christoph Hellwig
2009-08-27  0:04           ` Christoph Hellwig
2009-08-19 16:04 ` [PATCH 08/17] ext2: Update comment about generic_osync_inode Jan Kara
2009-08-19 16:04 ` [PATCH 09/17] ext3: Remove syncing logic from ext3_file_write Jan Kara
2009-08-19 16:04 ` [PATCH 10/17] ext4: Remove syncing logic from ext4_file_write Jan Kara
2009-08-19 16:04 ` [PATCH 11/17] fat: Opencode sync_page_range_nolock() Jan Kara
2009-08-19 16:04 ` [PATCH 12/17] ntfs: Use new syncing helpers and update comments Jan Kara
2009-08-19 16:04 ` [PATCH 13/17] ocfs2: Update syncing after splicing to match generic version Jan Kara
2009-08-21  1:36   ` [Ocfs2-devel] " Joel Becker
2009-08-21 14:30     ` Jan Kara
2009-08-19 16:04 ` [PATCH 14/17] xfs: Use new syncing helper Jan Kara
2009-08-19 16:33   ` Christoph Hellwig
2009-08-20 12:22     ` Jan Kara
2009-08-19 16:04 ` [PATCH 15/17] pohmelfs: " Jan Kara
2009-08-19 16:04 ` [PATCH 16/17] nfs: Remove reference to generic_osync_inode from a comment Jan Kara
2009-08-19 16:04 ` [PATCH 17/17] vfs: Remove generic_osync_inode() and sync_page_range() Jan Kara
     [not found] ` <20090820221221.GA14440@infradead.org>
     [not found]   ` <20090821114010.GG12579@kernel.dk>
     [not found]     ` <20090821135403.GA6208@shareable.org>
     [not found]       ` <20090821142635.GB30617@infradead.org>
     [not found]         ` <20090821152459.GC6929@shareable.org>
     [not found]           ` <20090821174525.GA28861@infradead.org>
     [not found]             ` <20090822005006.GA22530@shareable.org>
     [not found]               ` <20090824023422.GA775@infradead.org>
     [not found]                 ` <20090827143459.GB31453@shareable.org>
2009-08-27 17:10                   ` adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers Christoph Hellwig
2009-08-27 17:24                     ` Ulrich Drepper
2009-08-28 15:46                       ` Christoph Hellwig
2009-08-28 16:06                         ` Ulrich Drepper
2009-08-28 16:17                           ` Christoph Hellwig
2009-08-28 16:33                             ` Ulrich Drepper
2009-08-28 16:41                               ` Christoph Hellwig
2009-08-28 20:51                                 ` Ulrich Drepper
2009-08-28 21:08                                   ` Christoph Hellwig
2009-08-28 21:16                                     ` Trond Myklebust
2009-08-28 21:29                                       ` Christoph Hellwig
2009-08-28 21:43                                         ` Trond Myklebust
2009-08-28 22:39                                           ` Christoph Hellwig
2009-08-30 16:44                                     ` Jamie Lokier
2009-08-28 16:46                               ` Jamie Lokier
2009-08-29  0:59                                 ` Jamie Lokier
2009-08-28 16:44                         ` Jamie Lokier
2009-08-28 16:50                           ` Jamie Lokier
2009-08-28 21:08                           ` Ulrich Drepper
2009-08-30 16:58                             ` Jamie Lokier
2009-08-30 17:48                             ` Jamie Lokier
2009-08-28 23:06                         ` Jamie Lokier
2009-08-28 23:46                           ` Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2009-08-21 16:59 [PATCH 01/17] vfs: Introduce filemap_fdatawait_range Jan Kara
2009-08-21 16:59 ` [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments Jan Kara
2009-08-21 17:23 [PATCH 0/17] Make O_SYNC handling use standard syncing path (Version 2) Jan Kara
2009-08-21 17:23 ` [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).