public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] fsync: wait for data writeout completion before calling ->fsync
Date: Thu, 3 Sep 2009 12:47:04 +0200	[thread overview]
Message-ID: <20090903104704.GD5060@duck.novell.com> (raw)
In-Reply-To: <20090902221838.GA5311@lst.de>

On Thu 03-09-09 00:18:38, Christoph Hellwig wrote:
> I think we should add this one ontop:
  Agreed. Added to the series. I'll now push the whole series to linux-next
via my fs tree.

								Honza

> 
> -- 
> Subject: [PATCH] fsync: wait for data writeout completion before calling ->fsync
> From: Christoph Hellwig <hch@lst.de>
> 
> Currenly vfs_fsync(_range) first calls filemap_fdatawrite to write out
> the data, the calls into ->fsync to write out the metadata and then finally
> calls filemap_fdatawait to wait for the data I/O to complete.  What sounds
> like a clever micro-optimization actually is nast trap for many filesystems.
> 
> For many modern filesystems i_size or other inode information is only
> updated on I/O completion and we need to wait for I/O to finish before
> we can write out the metadata.  For old fashionen filesystems that
> instanciate blocks during the actual write and also update the metadata
> at that point it opens up a large window were we could expose uninitialized
> blocks after a crash.  While a few filesystems that need it already wait
> for the I/O to finish inside their ->fsync methods it is rather suboptimal
> as it is done under the i_mutex and also always for the whole file instead
> of just a part as we could do for O_SYNC handling.
> 
> Here is a small audit of all fsync instances in the tree:
> 
>  - spufs_mfc_fsync:
>  - ps3flash_fsync: 
>  - vol_cdev_fsync:
>  - printer_fsync:
>  - fb_deferred_io_fsync:
>  - bad_file_fsync:
>  - simple_sync_file:
> 
> 	don't care - filesystems/drivers do't use the page cache or are
> 	purely in-memory.
> 
>  - simple_fsync:
>  - file_fsync:
>  - affs_file_fsync:
>  - fat_file_fsync:
>  - jfs_fsync:
>  - ubifs_fsync:
>  - reiserfs_dir_fsync:
>  - reiserfs_sync_file:
> 
> 	never touch pagecache themselves.  We need to wait before if we do
> 	not want to expose stale data after an allocation.
> 
>  - afs_fsync:
>  - fuse_fsync_common:
> 
> 	do the waiting writeback itself in awkward ways, would benefit from
> 	proper semantics
> 
>  - block_fsync:
> 
> 	Does a filemap_write_and_wait on the block device inode.  Because we
> 	now have f_mapping that is the same inode we call it on in vfs_fsync.
> 	So just removing it and letting the VFS do the work in one go would
> 	be an improvement.
> 
>  - btrfs_sync_file:
>  - cifs_fsync:
>  - xfs_file_fsync:
> 
> 	need the wait first and currently do it themselves. would benefit from
> 	doing it outside i_mutex.
> 
>  - coda_fsync:
>  - ecryptfs_fsync:
>  - exofs_file_fsync:
>  - shm_fsync:
> 
> 	only passes the fsync through to the lower layer
> 
>  - ext3_sync_file:
> 
> 	doesn't seem to care, comments are confusing.
> 
>  - ext4_sync_file:
> 
> 	would need the wait to work correctly for delalloc mode with late
> 	i_size updates.  Otherwise the ext3 comment applies.
> 
> 
> 	currently implemens it's own writeback and wait in an odd way,
> 	could benefit from doing it properly.
> 
>  - gfs2_fsync:
> 
> 	not needed for journaled data mode, but probably harmless there.
> 	Currently writes back data asynchronously itself.  Needs some
> 	major audit.
> 
>  - hostfs_fsync:
> 
> 	just calls fsync/datasync on the host FD.  Without the wait before
> 	data might not even be inflight yet if we're unlucky.
> 
>  - hpfs_file_fsync:
>  - ncp_fsync:
> 
> 	no-ops.  Dangerous before and after.
> 
>  - jffs2_fsync:
> 
> 	just calls jffs2_flush_wbuf_gc, not sure how this relates to data.
> 
>  - nfs_fsync_dir:
> 
> 	just increments stats, claims all directory operations are synchronous
> 
>  - nfs_file_fsync:
> 
> 	only writes out data???  Looks very odd.
> 
>  - nilfs_sync_file:
> 
> 	looks like it expects all data done, but not sure from the code
> 
>  - ntfs_dir_fsync:
>  - ntfs_file_fsync:
> 
> 	appear to do their own data writeback.  Very convoluted code.
> 
>  - ocfs2_sync_file:
> 
> 	does it's own data writeback, but no wait.  probably needs the wait.
> 
>  - smb_fsync:
> 
> 	according to a comment expects all pages written already, probably needs
> 	the wait before.
> 
> 
> This patch only changes vfs_fsync_range, removal of the wait in the methods
> that have it is left to the filesystem maintainers.  Note that most
> filesystems really do need an audit for their fsync methods given the
> gems found in this very brief audit.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: linux-2.6/fs/sync.c
> ===================================================================
> --- linux-2.6.orig/fs/sync.c	2009-09-02 15:03:41.073271287 -0300
> +++ linux-2.6/fs/sync.c	2009-09-02 15:04:34.401269249 -0300
> @@ -216,7 +216,7 @@ int vfs_fsync_range(struct file *file, s
>  		goto out;
>  	}
>  
> -	ret = filemap_fdatawrite_range(mapping, start, end);
> +	ret = filemap_write_and_wait_range(mapping, start, end);
>  
>  	/*
>  	 * We need to protect against concurrent writers, which could cause
> @@ -228,9 +228,6 @@ int vfs_fsync_range(struct file *file, s
>  		ret = err;
>  	mutex_unlock(&mapping->host->i_mutex);
>  
> -	err = filemap_fdatawait_range(mapping, start, end);
> -	if (!ret)
> -		ret = err;
>  out:
>  	return ret;
>  }
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

  parent reply	other threads:[~2009-09-03 10:47 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-02 13:59 [PATCH 0/16] Make O_SYNC handling use standard syncing path (version 4) Jan Kara
2009-09-02 13:59 ` [PATCH 01/16] vfs: Introduce filemap_fdatawait_range Jan Kara
2009-09-02 13:59 ` [PATCH 02/16] vfs: Export __generic_file_aio_write() and add some comments Jan Kara
2009-09-02 13:59 ` [PATCH 03/16] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write() Jan Kara
2009-09-02 13:59 ` [PATCH 04/16] pohmelfs: Use __generic_file_aio_write instead of generic_file_aio_write_nolock Jan Kara
2009-09-02 13:59 ` [PATCH 05/16] ocfs2: " Jan Kara
2009-09-02 13:59 ` [PATCH 06/16] vfs: Rename generic_file_aio_write_nolock Jan Kara
2009-09-02 21:47   ` Christoph Hellwig
2009-09-03 10:24     ` Jan Kara
2009-09-03 15:37       ` Christoph Hellwig
2009-09-02 13:59 ` [PATCH 07/16] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode Jan Kara
2009-09-02 13:59 ` [PATCH 08/16] ext2: Update comment about generic_osync_inode Jan Kara
2009-09-02 13:59 ` [PATCH 09/16] ext3: Remove syncing logic from ext3_file_write Jan Kara
2009-09-02 13:59 ` [PATCH 10/16] ext4: Remove syncing logic from ext4_file_write Jan Kara
2009-09-02 13:59 ` [PATCH 11/16] ntfs: Use new syncing helpers and update comments Jan Kara
2009-09-02 13:59 ` [PATCH 12/16] ocfs2: Update syncing after splicing to match generic version Jan Kara
2009-09-02 13:59 ` [PATCH 13/16] xfs: Convert sync_page_range() to simple filemap_write_and_wait_range() Jan Kara
2009-09-02 13:59 ` [PATCH 14/16] pohmelfs: Use new syncing helper Jan Kara
2009-09-02 13:59 ` [PATCH 15/16] fat: Opencode sync_page_range_nolock() Jan Kara
2009-09-02 13:59 ` [PATCH 16/16] vfs: Remove generic_osync_inode() and sync_page_range{_nolock}() Jan Kara
2009-09-02 14:16 ` [PATCH 0/16] Make O_SYNC handling use standard syncing path (version 4) Christoph Hellwig
2009-09-02 22:18 ` [PATCH] fsync: wait for data writeout completion before calling ->fsync Christoph Hellwig
2009-09-02 22:37   ` Joel Becker
2009-09-03 10:47   ` Jan Kara [this message]
2009-09-03 15:39     ` Christoph Hellwig
2009-09-10 20:25 ` [PATCH 18/16] implement posix O_SYNC and O_DSYNC semantics Christoph Hellwig
2009-09-10 20:38   ` Trond Myklebust
2009-09-10 20:40     ` Christoph Hellwig
2009-09-10 20:43       ` Trond Myklebust
2009-09-10 20:44         ` Christoph Hellwig
2009-09-10 23:07   ` Andreas Dilger
2009-09-10 23:18     ` Christoph Hellwig
2009-09-11 19:16   ` [PATCHv2 " Christoph Hellwig
2009-09-14 16:54     ` Jan Kara
2009-09-14 17:02       ` Christoph Hellwig
2009-09-15 13:12       ` [PATCH] " Christoph Hellwig
2009-09-15 14:10         ` Jan Kara
2009-09-15 14:50         ` Ulrich Drepper
2009-09-17 17:16           ` Christoph Hellwig
2009-09-17 21:03         ` Kyle McMartin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090903104704.GD5060@duck.novell.com \
    --to=jack@suse.cz \
    --cc=hch@lst.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox