* [PATCH 0/4] Fix fsync on ext3 and ext4 (v2)
@ 2009-10-27 12:48 Jan Kara
2009-10-27 12:48 ` [PATCH 1/4] ext3: Wait for proper transaction commit on fsync Jan Kara
` (4 more replies)
0 siblings, 5 replies; 22+ messages in thread
From: Jan Kara @ 2009-10-27 12:48 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso
Hi,
this is a second try for a patchset which makes ext3 and ext4 properly force
a transaction commit when needed. We now do not rely on buffer dirty bits
(which does not work as pdflush can just clear them without forcing a
transaction commit) but rather keep transaction ids that need to be committed
for each inode.
Since last version, I've implemented Aneesh's and Curt's comments and also
fixed a missing initialization of the fields. I've tested that now the patch
works correctly for uninitialized extents as well as for standard writes. If
noone objects, would you merge the ext4 part Ted? I'll take care of the ext3
patch.
Honza
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH 1/4] ext3: Wait for proper transaction commit on fsync 2009-10-27 12:48 [PATCH 0/4] Fix fsync on ext3 and ext4 (v2) Jan Kara @ 2009-10-27 12:48 ` Jan Kara 2009-10-27 12:48 ` [PATCH 2/4] ext4: Avoid issuing barriers on error recovery path Jan Kara ` (3 subsequent siblings) 4 siblings, 0 replies; 22+ messages in thread From: Jan Kara @ 2009-10-27 12:48 UTC (permalink / raw) To: linux-ext4; +Cc: tytso, Jan Kara We cannot rely on buffer dirty bits during fsync because pdflush can come before fsync is called and clear dirty bits without forcing a transaction commit. What we do is that we track which transaction has last changed the inode and which transaction last changed allocation and force it to disk on fsync. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext3/fsync.c | 36 ++++++++++++++++-------------------- fs/ext3/inode.c | 32 +++++++++++++++++++++++++++++++- fs/ext3/super.c | 2 ++ include/linux/ext3_fs_i.h | 8 ++++++++ 4 files changed, 57 insertions(+), 21 deletions(-) diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c index 451d166..8209f26 100644 --- a/fs/ext3/fsync.c +++ b/fs/ext3/fsync.c @@ -46,19 +46,21 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync) { struct inode *inode = dentry->d_inode; + struct ext3_inode_info *ei = EXT3_I(inode); + journal_t *journal = EXT3_SB(inode->i_sb)->s_journal; int ret = 0; + tid_t commit_tid; + + if (inode->i_sb->s_flags & MS_RDONLY) + return 0; J_ASSERT(ext3_journal_current_handle() == NULL); /* - * data=writeback: + * data=writeback,ordered: * The caller's filemap_fdatawrite()/wait will sync the data. - * sync_inode() will sync the metadata - * - * data=ordered: - * The caller's filemap_fdatawrite() will write the data and - * sync_inode() will write the inode if it is dirty. Then the caller's - * filemap_fdatawait() will wait on the pages. + * Metadata is in the journal, we wait for a proper transaction + * to commit here. * * data=journal: * filemap_fdatawrite won't do anything (the buffers are clean). @@ -73,22 +75,16 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync) goto out; } - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC)) - goto flush; + if (datasync) + commit_tid = atomic_read(&ei->i_datasync_tid); + else + commit_tid = atomic_read(&ei->i_sync_tid); - /* - * The VFS has written the file data. If the inode is unaltered - * then we need not start a commit. - */ - if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) { - struct writeback_control wbc = { - .sync_mode = WB_SYNC_ALL, - .nr_to_write = 0, /* sys_fsync did this */ - }; - ret = sync_inode(inode, &wbc); + if (log_start_commit(journal, commit_tid)) { + log_wait_commit(journal, commit_tid); goto out; } -flush: + /* * In case we didn't commit a transaction, we have to flush * disk caches manually so that data really is on persistent diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index acf1b14..74b6146 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -699,8 +699,9 @@ static int ext3_splice_branch(handle_t *handle, struct inode *inode, int err = 0; struct ext3_block_alloc_info *block_i; ext3_fsblk_t current_block; + struct ext3_inode_info *ei = EXT3_I(inode); - block_i = EXT3_I(inode)->i_block_alloc_info; + block_i = ei->i_block_alloc_info; /* * If we're splicing into a [td]indirect block (as opposed to the * inode) then we need to get write access to the [td]indirect block @@ -741,6 +742,8 @@ static int ext3_splice_branch(handle_t *handle, struct inode *inode, inode->i_ctime = CURRENT_TIME_SEC; ext3_mark_inode_dirty(handle, inode); + /* ext3_mark_inode_dirty already updated i_sync_tid */ + atomic_set(&ei->i_datasync_tid, handle->h_transaction->t_tid); /* had we spliced it onto indirect block? */ if (where->bh) { @@ -2750,6 +2753,8 @@ struct inode *ext3_iget(struct super_block *sb, unsigned long ino) struct ext3_inode_info *ei; struct buffer_head *bh; struct inode *inode; + journal_t *journal = EXT3_SB(sb)->s_journal; + transaction_t *transaction; long ret; int block; @@ -2827,6 +2832,30 @@ struct inode *ext3_iget(struct super_block *sb, unsigned long ino) ei->i_data[block] = raw_inode->i_block[block]; INIT_LIST_HEAD(&ei->i_orphan); + /* + * Set transaction id's of transactions that have to be committed + * to finish f[data]sync. We set them to currently running transaction + * as we cannot be sure that the inode or some of its metadata isn't + * part of the transaction - the inode could have been reclaimed and + * now it is reread from disk. + */ + if (journal) { + tid_t tid; + + spin_lock(&journal->j_state_lock); + if (journal->j_running_transaction) + transaction = journal->j_running_transaction; + else + transaction = journal->j_committing_transaction; + if (transaction) + tid = transaction->t_tid; + else + tid = journal->j_commit_sequence; + spin_unlock(&journal->j_state_lock); + atomic_set(&ei->i_sync_tid, tid); + atomic_set(&ei->i_datasync_tid, tid); + } + if (inode->i_ino >= EXT3_FIRST_INO(inode->i_sb) + 1 && EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) { /* @@ -3011,6 +3040,7 @@ again: err = rc; ei->i_state &= ~EXT3_STATE_NEW; + atomic_set(&ei->i_sync_tid, handle->h_transaction->t_tid); out_brelse: brelse (bh); ext3_std_error(inode->i_sb, err); diff --git a/fs/ext3/super.c b/fs/ext3/super.c index 7a520a8..427496c 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -466,6 +466,8 @@ static struct inode *ext3_alloc_inode(struct super_block *sb) return NULL; ei->i_block_alloc_info = NULL; ei->vfs_inode.i_version = 1; + atomic_set(&ei->i_datasync_tid, 0); + atomic_set(&ei->i_sync_tid, 0); return &ei->vfs_inode; } diff --git a/include/linux/ext3_fs_i.h b/include/linux/ext3_fs_i.h index ca1bfe9..93e7428 100644 --- a/include/linux/ext3_fs_i.h +++ b/include/linux/ext3_fs_i.h @@ -137,6 +137,14 @@ struct ext3_inode_info { * by other means, so we have truncate_mutex. */ struct mutex truncate_mutex; + + /* + * Transactions that contain inode's metadata needed to complete + * fsync and fdatasync, respectively. + */ + atomic_t i_sync_tid; + atomic_t i_datasync_tid; + struct inode vfs_inode; }; -- 1.6.0.2 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 2/4] ext4: Avoid issuing barriers on error recovery path 2009-10-27 12:48 [PATCH 0/4] Fix fsync on ext3 and ext4 (v2) Jan Kara 2009-10-27 12:48 ` [PATCH 1/4] ext3: Wait for proper transaction commit on fsync Jan Kara @ 2009-10-27 12:48 ` Jan Kara 2009-11-16 0:47 ` Theodore Tso 2009-10-27 12:48 ` [PATCH 3/4] ext4: Fix error handling ext4_ind_get_blocks Jan Kara ` (2 subsequent siblings) 4 siblings, 1 reply; 22+ messages in thread From: Jan Kara @ 2009-10-27 12:48 UTC (permalink / raw) To: linux-ext4; +Cc: tytso, Jan Kara There is no reason why we should issue IO barriers when we just hit an error in ext4_sync_file. So move out: label to just return and introduce flush: label to those places which really need it. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/fsync.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 2b15312..2bf9413 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -88,7 +88,7 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync) ret = sync_mapping_buffers(inode->i_mapping); if (datasync && !(inode->i_state & I_DIRTY_DATASYNC)) - goto out; + goto flush; /* * The VFS has written the file data. If the inode is unaltered @@ -103,8 +103,9 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync) if (ret == 0) ret = err; } -out: +flush: if (journal && (journal->j_flags & JBD2_BARRIER)) blkdev_issue_flush(inode->i_sb->s_bdev, NULL); +out: return ret; } -- 1.6.0.2 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 2/4] ext4: Avoid issuing barriers on error recovery path 2009-10-27 12:48 ` [PATCH 2/4] ext4: Avoid issuing barriers on error recovery path Jan Kara @ 2009-11-16 0:47 ` Theodore Tso 2009-11-16 2:40 ` [PATCH] ext4: Avoid issuing unnecessary barriers Theodore Ts'o 0 siblings, 1 reply; 22+ messages in thread From: Theodore Tso @ 2009-11-16 0:47 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4 On Tue, Oct 27, 2009 at 01:48:47PM +0100, Jan Kara wrote: > There is no reason why we should issue IO barriers when we > just hit an error in ext4_sync_file. So move out: label to > just return and introduce flush: label to those places which > really need it. > > Signed-off-by: Jan Kara <jack@suse.cz> Once the out: label is moved to just before the "return ret;", it's actually cleaner to simply replace all of the "goto out" with "return ret;" statements..... - Ted ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH] ext4: Avoid issuing unnecessary barriers 2009-11-16 0:47 ` Theodore Tso @ 2009-11-16 2:40 ` Theodore Ts'o 2009-11-16 10:29 ` Jan Kara 0 siblings, 1 reply; 22+ messages in thread From: Theodore Ts'o @ 2009-11-16 2:40 UTC (permalink / raw) To: Ext4 Developers List; +Cc: Theodore Ts'o, Jan Kara We don't to issue an I/O barrier on an error or if we force commit because we are doing data journaling. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> --- This patch should be equivalent to Jan's "ext4: Avoid issuing barriers on error recovery path", but it removes more lines than it adds. :-) fs/ext4/fsync.c | 8 +++----- 1 files changed, 3 insertions(+), 5 deletions(-) diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 2b15312..a3c2507 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -60,7 +60,7 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync) ret = flush_aio_dio_completed_IO(inode); if (ret < 0) - goto out; + return ret; /* * data=writeback: * The caller's filemap_fdatawrite()/wait will sync the data. @@ -79,10 +79,8 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync) * (they were dirtied by commit). But that's OK - the blocks are * safe in-journal, which is all fsync() needs to ensure. */ - if (ext4_should_journal_data(inode)) { - ret = ext4_force_commit(inode->i_sb); - goto out; - } + if (ext4_should_journal_data(inode)) + return ext4_force_commit(inode->i_sb); if (!journal) ret = sync_mapping_buffers(inode->i_mapping); -- 1.6.5.216.g5288a.dirty ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH] ext4: Avoid issuing unnecessary barriers 2009-11-16 2:40 ` [PATCH] ext4: Avoid issuing unnecessary barriers Theodore Ts'o @ 2009-11-16 10:29 ` Jan Kara 0 siblings, 0 replies; 22+ messages in thread From: Jan Kara @ 2009-11-16 10:29 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Ext4 Developers List, Jan Kara On Sun 15-11-09 21:40:42, Theodore Ts'o wrote: > We don't to issue an I/O barrier on an error or if we force commit > because we are doing data journaling. > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > Cc: Jan Kara <jack@suse.cz> > --- > This patch should be equivalent to Jan's "ext4: Avoid issuing barriers > on error recovery path", but it removes more lines than it adds. :-) Yeah, it looks fine. Acked-by: Jan Kara <jack@suse.cz> Honza > > fs/ext4/fsync.c | 8 +++----- > 1 files changed, 3 insertions(+), 5 deletions(-) > > diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c > index 2b15312..a3c2507 100644 > --- a/fs/ext4/fsync.c > +++ b/fs/ext4/fsync.c > @@ -60,7 +60,7 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync) > > ret = flush_aio_dio_completed_IO(inode); > if (ret < 0) > - goto out; > + return ret; > /* > * data=writeback: > * The caller's filemap_fdatawrite()/wait will sync the data. > @@ -79,10 +79,8 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync) > * (they were dirtied by commit). But that's OK - the blocks are > * safe in-journal, which is all fsync() needs to ensure. > */ > - if (ext4_should_journal_data(inode)) { > - ret = ext4_force_commit(inode->i_sb); > - goto out; > - } > + if (ext4_should_journal_data(inode)) > + return ext4_force_commit(inode->i_sb); > > if (!journal) > ret = sync_mapping_buffers(inode->i_mapping); > -- > 1.6.5.216.g5288a.dirty > -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 3/4] ext4: Fix error handling ext4_ind_get_blocks 2009-10-27 12:48 [PATCH 0/4] Fix fsync on ext3 and ext4 (v2) Jan Kara 2009-10-27 12:48 ` [PATCH 1/4] ext3: Wait for proper transaction commit on fsync Jan Kara 2009-10-27 12:48 ` [PATCH 2/4] ext4: Avoid issuing barriers on error recovery path Jan Kara @ 2009-10-27 12:48 ` Jan Kara 2009-11-16 2:57 ` Theodore Tso 2009-10-27 12:48 ` [PATCH 4/4] ext4: Wait for proper transaction commit on fsync Jan Kara 2009-11-05 13:02 ` [PATCH 0/4] Fix fsync on ext3 and ext4 (v2) Jan Kara 4 siblings, 1 reply; 22+ messages in thread From: Jan Kara @ 2009-10-27 12:48 UTC (permalink / raw) To: linux-ext4; +Cc: tytso, Jan Kara When an error happened in ext4_splice_branch we failed to notice that in ext4_ind_get_blocks and mapped the buffer anyway. Fix the problem by checking for error properly. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/inode.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5c5bc5d..9105f40 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1020,7 +1020,7 @@ static int ext4_ind_get_blocks(handle_t *handle, struct inode *inode, if (!err) err = ext4_splice_branch(handle, inode, iblock, partial, indirect_blks, count); - else + if (err) goto cleanup; set_buffer_new(bh_result); -- 1.6.0.2 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 3/4] ext4: Fix error handling ext4_ind_get_blocks 2009-10-27 12:48 ` [PATCH 3/4] ext4: Fix error handling ext4_ind_get_blocks Jan Kara @ 2009-11-16 2:57 ` Theodore Tso 0 siblings, 0 replies; 22+ messages in thread From: Theodore Tso @ 2009-11-16 2:57 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4 On Tue, Oct 27, 2009 at 01:48:48PM +0100, Jan Kara wrote: > When an error happened in ext4_splice_branch we failed to notice that > in ext4_ind_get_blocks and mapped the buffer anyway. Fix the problem > by checking for error properly. > > Signed-off-by: Jan Kara <jack@suse.cz> Added to the ext4 patch queue, thanks. - Ted ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 4/4] ext4: Wait for proper transaction commit on fsync 2009-10-27 12:48 [PATCH 0/4] Fix fsync on ext3 and ext4 (v2) Jan Kara ` (2 preceding siblings ...) 2009-10-27 12:48 ` [PATCH 3/4] ext4: Fix error handling ext4_ind_get_blocks Jan Kara @ 2009-10-27 12:48 ` Jan Kara 2009-11-04 14:57 ` Andi Kleen 2009-11-16 0:46 ` Theodore Tso 2009-11-05 13:02 ` [PATCH 0/4] Fix fsync on ext3 and ext4 (v2) Jan Kara 4 siblings, 2 replies; 22+ messages in thread From: Jan Kara @ 2009-10-27 12:48 UTC (permalink / raw) To: linux-ext4; +Cc: tytso, Jan Kara We cannot rely on buffer dirty bits during fsync because pdflush can come before fsync is called and clear dirty bits without forcing a transaction commit. What we do is that we track which transaction has last changed the inode and which transaction last changed allocation and force it to disk on fsync. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/ext4.h | 7 +++++++ fs/ext4/ext4_jbd2.h | 14 ++++++++++++++ fs/ext4/extents.c | 14 ++++++++++++-- fs/ext4/fsync.c | 40 +++++++++++++++++----------------------- fs/ext4/inode.c | 29 +++++++++++++++++++++++++++++ fs/ext4/super.c | 2 ++ 6 files changed, 81 insertions(+), 25 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 984ca0c..5639f30 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -702,6 +702,13 @@ struct ext4_inode_info { struct list_head i_aio_dio_complete_list; /* current io_end structure for async DIO write*/ ext4_io_end_t *cur_aio_dio; + + /* + * Transactions that contain inode's metadata needed to complete + * fsync and fdatasync, respectively. + */ + atomic_t i_sync_tid; + atomic_t i_datasync_tid; }; /* diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h index a286598..4389941 100644 --- a/fs/ext4/ext4_jbd2.h +++ b/fs/ext4/ext4_jbd2.h @@ -254,6 +254,20 @@ static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode) return 0; } +static inline void ext4_update_inode_fsync_trans(handle_t *handle, + struct inode *inode, + int datasync) +{ + if (ext4_handle_valid(handle)) { + atomic_set(&EXT4_I(inode)->i_sync_tid, + handle->h_transaction->t_tid); + if (datasync) { + atomic_set(&EXT4_I(inode)->i_datasync_tid, + handle->h_transaction->t_tid); + } + } +} + /* super.c */ int ext4_force_commit(struct super_block *sb); diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 10539e3..d3ba4a9 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -3057,6 +3057,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode, if (flags == EXT4_GET_BLOCKS_DIO_CONVERT_EXT) { ret = ext4_convert_unwritten_extents_dio(handle, inode, path); + if (ret >= 0) + ext4_update_inode_fsync_trans(handle, inode, 1); goto out2; } /* buffered IO case */ @@ -3084,6 +3086,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode, ret = ext4_ext_convert_to_initialized(handle, inode, path, iblock, max_blocks); + if (ret >= 0) + ext4_update_inode_fsync_trans(handle, inode, 1); out: if (ret <= 0) { err = ret; @@ -3316,10 +3320,16 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, allocated = ext4_ext_get_actual_len(&newex); set_buffer_new(bh_result); - /* Cache only when it is _not_ an uninitialized extent */ - if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0) + /* + * Cache the extent and update transaction to commit on fdatasync only + * when it is _not_ an uninitialized extent. + */ + if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0) { ext4_ext_put_in_cache(inode, iblock, allocated, newblock, EXT4_EXT_CACHE_EXTENT); + ext4_update_inode_fsync_trans(handle, inode, 1); + } else + ext4_update_inode_fsync_trans(handle, inode, 0); out: if (allocated > max_blocks) allocated = max_blocks; diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 2bf9413..03e0fe9 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -51,25 +51,26 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync) { struct inode *inode = dentry->d_inode; + struct ext4_inode_info *ei = EXT4_I(inode); journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; - int err, ret = 0; + int ret = 0; + tid_t commit_tid; J_ASSERT(ext4_journal_current_handle() == NULL); trace_ext4_sync_file(file, dentry, datasync); + if (inode->i_sb->s_flags & MS_RDONLY) + goto out; + ret = flush_aio_dio_completed_IO(inode); if (ret < 0) goto out; /* - * data=writeback: + * data=writeback,ordered: * The caller's filemap_fdatawrite()/wait will sync the data. - * sync_inode() will sync the metadata - * - * data=ordered: - * The caller's filemap_fdatawrite() will write the data and - * sync_inode() will write the inode if it is dirty. Then the caller's - * filemap_fdatawait() will wait on the pages. + * Metadata is in the journal, we wait for proper transaction to + * commit here. * * data=journal: * filemap_fdatawrite won't do anything (the buffers are clean). @@ -87,23 +88,16 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync) if (!journal) ret = sync_mapping_buffers(inode->i_mapping); - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC)) - goto flush; + if (datasync) + commit_tid = atomic_read(&ei->i_datasync_tid); + else + commit_tid = atomic_read(&ei->i_sync_tid); - /* - * The VFS has written the file data. If the inode is unaltered - * then we need not start a commit. - */ - if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) { - struct writeback_control wbc = { - .sync_mode = WB_SYNC_ALL, - .nr_to_write = 0, /* sys_fsync did this */ - }; - err = sync_inode(inode, &wbc); - if (ret == 0) - ret = err; + if (jbd2_log_start_commit(journal, commit_tid)) { + jbd2_log_wait_commit(journal, commit_tid); + goto out; } -flush: + if (journal && (journal->j_flags & JBD2_BARRIER)) blkdev_issue_flush(inode->i_sb->s_bdev, NULL); out: diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 9105f40..546f175 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1024,6 +1024,8 @@ static int ext4_ind_get_blocks(handle_t *handle, struct inode *inode, goto cleanup; set_buffer_new(bh_result); + + ext4_update_inode_fsync_trans(handle, inode, 1); got_it: map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key)); if (count > blocks_to_boundary) @@ -4777,6 +4779,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) struct ext4_inode_info *ei; struct buffer_head *bh; struct inode *inode; + journal_t *journal = EXT4_SB(sb)->s_journal; long ret; int block; @@ -4842,6 +4845,31 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) ei->i_data[block] = raw_inode->i_block[block]; INIT_LIST_HEAD(&ei->i_orphan); + /* + * Set transaction id's of transactions that have to be committed + * to finish f[data]sync. We set them to currently running transaction + * as we cannot be sure that the inode or some of its metadata isn't + * part of the transaction - the inode could have been reclaimed and + * now it is reread from disk. + */ + if (journal) { + transaction_t *transaction; + tid_t tid; + + spin_lock(&journal->j_state_lock); + if (journal->j_running_transaction) + transaction = journal->j_running_transaction; + else + transaction = journal->j_committing_transaction; + if (transaction) + tid = transaction->t_tid; + else + tid = journal->j_commit_sequence; + spin_unlock(&journal->j_state_lock); + atomic_set(&ei->i_sync_tid, tid); + atomic_set(&ei->i_datasync_tid, tid); + } + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) { ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize); if (EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize > @@ -5102,6 +5130,7 @@ static int ext4_do_update_inode(handle_t *handle, err = rc; ei->i_state &= ~EXT4_STATE_NEW; + ext4_update_inode_fsync_trans(handle, inode, 0); out_brelse: brelse(bh); ext4_std_error(inode->i_sb, err); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 312211e..9a6716f 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -704,6 +704,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb) spin_lock_init(&(ei->i_block_reservation_lock)); INIT_LIST_HEAD(&ei->i_aio_dio_complete_list); ei->cur_aio_dio = NULL; + atomic_set(&ei->i_datasync_tid, 0); + atomic_set(&ei->i_sync_tid, 0); return &ei->vfs_inode; } -- 1.6.0.2 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync 2009-10-27 12:48 ` [PATCH 4/4] ext4: Wait for proper transaction commit on fsync Jan Kara @ 2009-11-04 14:57 ` Andi Kleen 2009-11-04 15:53 ` Jan Kara 2009-11-16 0:46 ` Theodore Tso 1 sibling, 1 reply; 22+ messages in thread From: Andi Kleen @ 2009-11-04 14:57 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4, tytso Jan Kara <jack@suse.cz> writes: > ext4_io_end_t *cur_aio_dio; > + > + /* > + * Transactions that contain inode's metadata needed to complete > + * fsync and fdatasync, respectively. > + */ > + atomic_t i_sync_tid; > + atomic_t i_datasync_tid; This might be a stupid question, but the atomic implies you don't hold any kind of reference to the transaction. So what prevents these IDs from wrapping while in there? Given it would be probably take a long time today, but might be not fully future proof to very fast future IO devices. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync 2009-11-04 14:57 ` Andi Kleen @ 2009-11-04 15:53 ` Jan Kara 0 siblings, 0 replies; 22+ messages in thread From: Jan Kara @ 2009-11-04 15:53 UTC (permalink / raw) To: Andi Kleen; +Cc: Jan Kara, linux-ext4, tytso On Wed 04-11-09 15:57:22, Andi Kleen wrote: > Jan Kara <jack@suse.cz> writes: > > ext4_io_end_t *cur_aio_dio; > > + > > + /* > > + * Transactions that contain inode's metadata needed to complete > > + * fsync and fdatasync, respectively. > > + */ > > + atomic_t i_sync_tid; > > + atomic_t i_datasync_tid; > > This might be a stupid question, but the atomic implies you don't hold > any kind of reference to the transaction. So what prevents these IDs > from wrapping while in there? Given it would be probably take a long > time today, but might be not fully future proof to very fast future IO > devices. Yes, IDs can wrap. But journalling layer actually compares them so that wrapping does not matter unless you manage to do 2^31 transactions inbetween. So we are still a few orders of magnitude safe with current hw. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync 2009-10-27 12:48 ` [PATCH 4/4] ext4: Wait for proper transaction commit on fsync Jan Kara 2009-11-04 14:57 ` Andi Kleen @ 2009-11-16 0:46 ` Theodore Tso 2009-11-16 10:43 ` Jan Kara 1 sibling, 1 reply; 22+ messages in thread From: Theodore Tso @ 2009-11-16 0:46 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4 On Tue, Oct 27, 2009 at 01:48:49PM +0100, Jan Kara wrote: > + /* > + * Transactions that contain inode's metadata needed to complete > + * fsync and fdatasync, respectively. > + */ > + atomic_t i_sync_tid; > + atomic_t i_datasync_tid; Why do we need an atomic_t here at all? It's not clear it's necessary. What specific race are you worried about? - Ted ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync 2009-11-16 0:46 ` Theodore Tso @ 2009-11-16 10:43 ` Jan Kara 2009-12-09 3:51 ` tytso 0 siblings, 1 reply; 22+ messages in thread From: Jan Kara @ 2009-11-16 10:43 UTC (permalink / raw) To: Theodore Tso; +Cc: Jan Kara, linux-ext4 On Sun 15-11-09 19:46:41, Theodore Tso wrote: > On Tue, Oct 27, 2009 at 01:48:49PM +0100, Jan Kara wrote: > > + /* > > + * Transactions that contain inode's metadata needed to complete > > + * fsync and fdatasync, respectively. > > + */ > > + atomic_t i_sync_tid; > > + atomic_t i_datasync_tid; > > Why do we need an atomic_t here at all? It's not clear it's > necessary. What specific race are you worried about? Well, i_[data]sync_tid are set and read from several places which aren't currently synchronized against each other and I din't want to introduce any synchronization. So atomic_t seemed like a clean way of making clear that loads / stores from those fields are atomic, regardless of architecture, memory alignment or whatever insanity that might make us see just half overwritten field. On all archs where this means just plain stores / loads (such as x86), there's no performance hit since the operations are implemented as such. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync 2009-11-16 10:43 ` Jan Kara @ 2009-12-09 3:51 ` tytso 2009-12-09 4:54 ` tytso 2009-12-09 11:30 ` Jan Kara 0 siblings, 2 replies; 22+ messages in thread From: tytso @ 2009-12-09 3:51 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4 On Mon, Nov 16, 2009 at 11:43:31AM +0100, Jan Kara wrote: > > Why do we need an atomic_t here at all? It's not clear it's > > necessary. What specific race are you worried about? > Well, i_[data]sync_tid are set and read from several places which aren't > currently synchronized against each other and I din't want to introduce any > synchronization. So atomic_t seemed like a clean way of making clear that > loads / stores from those fields are atomic, regardless of architecture, > memory alignment or whatever insanity that might make us see just half > overwritten field. On all archs where this means just plain stores / loads > (such as x86), there's no performance hit since the operations are > implemented as such. Sorry for not responding to this one sooner, but see this URL: http://digitalvampire.org/blog/index.php/2007/05/13/atomic-cargo-cults/ - Ted ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync 2009-12-09 3:51 ` tytso @ 2009-12-09 4:54 ` tytso 2009-12-09 11:29 ` Jan Kara 2009-12-09 11:30 ` Jan Kara 1 sibling, 1 reply; 22+ messages in thread From: tytso @ 2009-12-09 4:54 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4 Here's a revised version of your patch that I've included in the ext4 patch queue. I've removed the use of the atomic_t data type. I've also cleaned up the fsync() function some more to make it easier to follow, and I've fixed the handling in no-journal mode (we can't depend on the forcing a commit if there is no journal, so we can't drop the use of sync_inode() --- we can use simple_fsync(), though, which does everything we need if there is no journal). - Ted commit b436b9bef84de6893e86346d8fbf7104bc520645 Author: Jan Kara <jack@suse.cz> Date: Tue Dec 8 23:51:10 2009 -0500 ext4: Wait for proper transaction commit on fsync We cannot rely on buffer dirty bits during fsync because pdflush can come before fsync is called and clear dirty bits without forcing a transaction commit. What we do is that we track which transaction has last changed the inode and which transaction last changed allocation and force it to disk on fsync. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 4cfc2f0..ab31e65 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -709,6 +709,13 @@ struct ext4_inode_info { struct list_head i_aio_dio_complete_list; /* current io_end structure for async DIO write*/ ext4_io_end_t *cur_aio_dio; + + /* + * Transactions that contain inode's metadata needed to complete + * fsync and fdatasync, respectively. + */ + tid_t i_sync_tid; + tid_t i_datasync_tid; }; /* diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h index 2c2b262..05eca81 100644 --- a/fs/ext4/ext4_jbd2.h +++ b/fs/ext4/ext4_jbd2.h @@ -249,6 +249,19 @@ static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode) return 0; } +static inline void ext4_update_inode_fsync_trans(handle_t *handle, + struct inode *inode, + int datasync) +{ + struct ext4_inode_info *ei = EXT4_I(inode); + + if (ext4_handle_valid(handle)) { + ei->i_sync_tid = handle->h_transaction->t_tid; + if (datasync) + ei->i_datasync_tid = handle->h_transaction->t_tid; + } +} + /* super.c */ int ext4_force_commit(struct super_block *sb); diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 5967f18..700206e 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -3058,6 +3058,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode, if (flags == EXT4_GET_BLOCKS_DIO_CONVERT_EXT) { ret = ext4_convert_unwritten_extents_dio(handle, inode, path); + if (ret >= 0) + ext4_update_inode_fsync_trans(handle, inode, 1); goto out2; } /* buffered IO case */ @@ -3085,6 +3087,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode, ret = ext4_ext_convert_to_initialized(handle, inode, path, iblock, max_blocks); + if (ret >= 0) + ext4_update_inode_fsync_trans(handle, inode, 1); out: if (ret <= 0) { err = ret; @@ -3323,10 +3327,16 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, allocated = ext4_ext_get_actual_len(&newex); set_buffer_new(bh_result); - /* Cache only when it is _not_ an uninitialized extent */ - if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0) + /* + * Cache the extent and update transaction to commit on fdatasync only + * when it is _not_ an uninitialized extent. + */ + if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0) { ext4_ext_put_in_cache(inode, iblock, allocated, newblock, EXT4_EXT_CACHE_EXTENT); + ext4_update_inode_fsync_trans(handle, inode, 1); + } else + ext4_update_inode_fsync_trans(handle, inode, 0); out: if (allocated > max_blocks) allocated = max_blocks; diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index a3c2507..0b22497 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -51,25 +51,30 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync) { struct inode *inode = dentry->d_inode; + struct ext4_inode_info *ei = EXT4_I(inode); journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; - int err, ret = 0; + int ret; + tid_t commit_tid; J_ASSERT(ext4_journal_current_handle() == NULL); trace_ext4_sync_file(file, dentry, datasync); + if (inode->i_sb->s_flags & MS_RDONLY) + return 0; + ret = flush_aio_dio_completed_IO(inode); if (ret < 0) return ret; + + if (!journal) + return simple_fsync(file, dentry, datasync); + /* - * data=writeback: + * data=writeback,ordered: * The caller's filemap_fdatawrite()/wait will sync the data. - * sync_inode() will sync the metadata - * - * data=ordered: - * The caller's filemap_fdatawrite() will write the data and - * sync_inode() will write the inode if it is dirty. Then the caller's - * filemap_fdatawait() will wait on the pages. + * Metadata is in the journal, we wait for proper transaction to + * commit here. * * data=journal: * filemap_fdatawrite won't do anything (the buffers are clean). @@ -82,27 +87,10 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync) if (ext4_should_journal_data(inode)) return ext4_force_commit(inode->i_sb); - if (!journal) - ret = sync_mapping_buffers(inode->i_mapping); - - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC)) - goto out; - - /* - * The VFS has written the file data. If the inode is unaltered - * then we need not start a commit. - */ - if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) { - struct writeback_control wbc = { - .sync_mode = WB_SYNC_ALL, - .nr_to_write = 0, /* sys_fsync did this */ - }; - err = sync_inode(inode, &wbc); - if (ret == 0) - ret = err; - } -out: - if (journal && (journal->j_flags & JBD2_BARRIER)) + commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid; + if (jbd2_log_start_commit(journal, commit_tid)) + jbd2_log_wait_commit(journal, commit_tid); + else if (journal->j_flags & JBD2_BARRIER) blkdev_issue_flush(inode->i_sb->s_bdev, NULL); return ret; } diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 958c3ff..f1bc1e3 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -983,6 +983,8 @@ static int ext4_ind_get_blocks(handle_t *handle, struct inode *inode, goto cleanup; set_buffer_new(bh_result); + + ext4_update_inode_fsync_trans(handle, inode, 1); got_it: map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key)); if (count > blocks_to_boundary) @@ -4738,6 +4740,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) struct ext4_inode *raw_inode; struct ext4_inode_info *ei; struct inode *inode; + journal_t *journal = EXT4_SB(sb)->s_journal; long ret; int block; @@ -4802,6 +4805,31 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) ei->i_data[block] = raw_inode->i_block[block]; INIT_LIST_HEAD(&ei->i_orphan); + /* + * Set transaction id's of transactions that have to be committed + * to finish f[data]sync. We set them to currently running transaction + * as we cannot be sure that the inode or some of its metadata isn't + * part of the transaction - the inode could have been reclaimed and + * now it is reread from disk. + */ + if (journal) { + transaction_t *transaction; + tid_t tid; + + spin_lock(&journal->j_state_lock); + if (journal->j_running_transaction) + transaction = journal->j_running_transaction; + else + transaction = journal->j_committing_transaction; + if (transaction) + tid = transaction->t_tid; + else + tid = journal->j_commit_sequence; + spin_unlock(&journal->j_state_lock); + ei->i_sync_tid = tid; + ei->i_datasync_tid = tid; + } + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) { ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize); if (EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize > @@ -5056,6 +5084,7 @@ static int ext4_do_update_inode(handle_t *handle, err = rc; ei->i_state &= ~EXT4_STATE_NEW; + ext4_update_inode_fsync_trans(handle, inode, 0); out_brelse: brelse(bh); ext4_std_error(inode->i_sb, err); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 8ab0c95..2b13dcf 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -706,6 +706,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb) spin_lock_init(&(ei->i_block_reservation_lock)); INIT_LIST_HEAD(&ei->i_aio_dio_complete_list); ei->cur_aio_dio = NULL; + ei->i_sync_tid = 0; + ei->i_datasync_tid = 0; return &ei->vfs_inode; } ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync 2009-12-09 4:54 ` tytso @ 2009-12-09 11:29 ` Jan Kara 2009-12-09 14:32 ` tytso 0 siblings, 1 reply; 22+ messages in thread From: Jan Kara @ 2009-12-09 11:29 UTC (permalink / raw) To: tytso; +Cc: Jan Kara, linux-ext4 On Tue 08-12-09 23:54:24, tytso@mit.edu wrote: > Here's a revised version of your patch that I've included in the ext4 > patch queue. I've removed the use of the atomic_t data type. I've OK, I'm just unsure: Isn't even 32-bit read non-atomic if it is not properly aligned e.g. on powerPC? So shouldn't we make sure those inode fields are aligned to 32-bit boundary (or at least add a comment about that)? > also cleaned up the fsync() function some more to make it easier to > follow, and I've fixed the handling in no-journal mode (we can't > depend on the forcing a commit if there is no journal, so we can't > drop the use of sync_inode() --- we can use simple_fsync(), though, > which does everything we need if there is no journal). Good catch. Thanks. Honza > commit b436b9bef84de6893e86346d8fbf7104bc520645 > Author: Jan Kara <jack@suse.cz> > Date: Tue Dec 8 23:51:10 2009 -0500 > > ext4: Wait for proper transaction commit on fsync > > We cannot rely on buffer dirty bits during fsync because pdflush can come > before fsync is called and clear dirty bits without forcing a transaction > commit. What we do is that we track which transaction has last changed > the inode and which transaction last changed allocation and force it to > disk on fsync. > > Signed-off-by: Jan Kara <jack@suse.cz> > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index 4cfc2f0..ab31e65 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -709,6 +709,13 @@ struct ext4_inode_info { > struct list_head i_aio_dio_complete_list; > /* current io_end structure for async DIO write*/ > ext4_io_end_t *cur_aio_dio; > + > + /* > + * Transactions that contain inode's metadata needed to complete > + * fsync and fdatasync, respectively. > + */ > + tid_t i_sync_tid; > + tid_t i_datasync_tid; > }; > > /* > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h > index 2c2b262..05eca81 100644 > --- a/fs/ext4/ext4_jbd2.h > +++ b/fs/ext4/ext4_jbd2.h > @@ -249,6 +249,19 @@ static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode) > return 0; > } > > +static inline void ext4_update_inode_fsync_trans(handle_t *handle, > + struct inode *inode, > + int datasync) > +{ > + struct ext4_inode_info *ei = EXT4_I(inode); > + > + if (ext4_handle_valid(handle)) { > + ei->i_sync_tid = handle->h_transaction->t_tid; > + if (datasync) > + ei->i_datasync_tid = handle->h_transaction->t_tid; > + } > +} > + > /* super.c */ > int ext4_force_commit(struct super_block *sb); > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > index 5967f18..700206e 100644 > --- a/fs/ext4/extents.c > +++ b/fs/ext4/extents.c > @@ -3058,6 +3058,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode, > if (flags == EXT4_GET_BLOCKS_DIO_CONVERT_EXT) { > ret = ext4_convert_unwritten_extents_dio(handle, inode, > path); > + if (ret >= 0) > + ext4_update_inode_fsync_trans(handle, inode, 1); > goto out2; > } > /* buffered IO case */ > @@ -3085,6 +3087,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode, > ret = ext4_ext_convert_to_initialized(handle, inode, > path, iblock, > max_blocks); > + if (ret >= 0) > + ext4_update_inode_fsync_trans(handle, inode, 1); > out: > if (ret <= 0) { > err = ret; > @@ -3323,10 +3327,16 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, > allocated = ext4_ext_get_actual_len(&newex); > set_buffer_new(bh_result); > > - /* Cache only when it is _not_ an uninitialized extent */ > - if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0) > + /* > + * Cache the extent and update transaction to commit on fdatasync only > + * when it is _not_ an uninitialized extent. > + */ > + if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0) { > ext4_ext_put_in_cache(inode, iblock, allocated, newblock, > EXT4_EXT_CACHE_EXTENT); > + ext4_update_inode_fsync_trans(handle, inode, 1); > + } else > + ext4_update_inode_fsync_trans(handle, inode, 0); > out: > if (allocated > max_blocks) > allocated = max_blocks; > diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c > index a3c2507..0b22497 100644 > --- a/fs/ext4/fsync.c > +++ b/fs/ext4/fsync.c > @@ -51,25 +51,30 @@ > int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync) > { > struct inode *inode = dentry->d_inode; > + struct ext4_inode_info *ei = EXT4_I(inode); > journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; > - int err, ret = 0; > + int ret; > + tid_t commit_tid; > > J_ASSERT(ext4_journal_current_handle() == NULL); > > trace_ext4_sync_file(file, dentry, datasync); > > + if (inode->i_sb->s_flags & MS_RDONLY) > + return 0; > + > ret = flush_aio_dio_completed_IO(inode); > if (ret < 0) > return ret; > + > + if (!journal) > + return simple_fsync(file, dentry, datasync); > + > /* > - * data=writeback: > + * data=writeback,ordered: > * The caller's filemap_fdatawrite()/wait will sync the data. > - * sync_inode() will sync the metadata > - * > - * data=ordered: > - * The caller's filemap_fdatawrite() will write the data and > - * sync_inode() will write the inode if it is dirty. Then the caller's > - * filemap_fdatawait() will wait on the pages. > + * Metadata is in the journal, we wait for proper transaction to > + * commit here. > * > * data=journal: > * filemap_fdatawrite won't do anything (the buffers are clean). > @@ -82,27 +87,10 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync) > if (ext4_should_journal_data(inode)) > return ext4_force_commit(inode->i_sb); > > - if (!journal) > - ret = sync_mapping_buffers(inode->i_mapping); > - > - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC)) > - goto out; > - > - /* > - * The VFS has written the file data. If the inode is unaltered > - * then we need not start a commit. > - */ > - if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) { > - struct writeback_control wbc = { > - .sync_mode = WB_SYNC_ALL, > - .nr_to_write = 0, /* sys_fsync did this */ > - }; > - err = sync_inode(inode, &wbc); > - if (ret == 0) > - ret = err; > - } > -out: > - if (journal && (journal->j_flags & JBD2_BARRIER)) > + commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid; > + if (jbd2_log_start_commit(journal, commit_tid)) > + jbd2_log_wait_commit(journal, commit_tid); > + else if (journal->j_flags & JBD2_BARRIER) > blkdev_issue_flush(inode->i_sb->s_bdev, NULL); > return ret; > } > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 958c3ff..f1bc1e3 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -983,6 +983,8 @@ static int ext4_ind_get_blocks(handle_t *handle, struct inode *inode, > goto cleanup; > > set_buffer_new(bh_result); > + > + ext4_update_inode_fsync_trans(handle, inode, 1); > got_it: > map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key)); > if (count > blocks_to_boundary) > @@ -4738,6 +4740,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) > struct ext4_inode *raw_inode; > struct ext4_inode_info *ei; > struct inode *inode; > + journal_t *journal = EXT4_SB(sb)->s_journal; > long ret; > int block; > > @@ -4802,6 +4805,31 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) > ei->i_data[block] = raw_inode->i_block[block]; > INIT_LIST_HEAD(&ei->i_orphan); > > + /* > + * Set transaction id's of transactions that have to be committed > + * to finish f[data]sync. We set them to currently running transaction > + * as we cannot be sure that the inode or some of its metadata isn't > + * part of the transaction - the inode could have been reclaimed and > + * now it is reread from disk. > + */ > + if (journal) { > + transaction_t *transaction; > + tid_t tid; > + > + spin_lock(&journal->j_state_lock); > + if (journal->j_running_transaction) > + transaction = journal->j_running_transaction; > + else > + transaction = journal->j_committing_transaction; > + if (transaction) > + tid = transaction->t_tid; > + else > + tid = journal->j_commit_sequence; > + spin_unlock(&journal->j_state_lock); > + ei->i_sync_tid = tid; > + ei->i_datasync_tid = tid; > + } > + > if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) { > ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize); > if (EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize > > @@ -5056,6 +5084,7 @@ static int ext4_do_update_inode(handle_t *handle, > err = rc; > ei->i_state &= ~EXT4_STATE_NEW; > > + ext4_update_inode_fsync_trans(handle, inode, 0); > out_brelse: > brelse(bh); > ext4_std_error(inode->i_sb, err); > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > index 8ab0c95..2b13dcf 100644 > --- a/fs/ext4/super.c > +++ b/fs/ext4/super.c > @@ -706,6 +706,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb) > spin_lock_init(&(ei->i_block_reservation_lock)); > INIT_LIST_HEAD(&ei->i_aio_dio_complete_list); > ei->cur_aio_dio = NULL; > + ei->i_sync_tid = 0; > + ei->i_datasync_tid = 0; > > return &ei->vfs_inode; > } -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync 2009-12-09 11:29 ` Jan Kara @ 2009-12-09 14:32 ` tytso 0 siblings, 0 replies; 22+ messages in thread From: tytso @ 2009-12-09 14:32 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4 On Wed, Dec 09, 2009 at 12:29:40PM +0100, Jan Kara wrote: > On Tue 08-12-09 23:54:24, tytso@mit.edu wrote: > > Here's a revised version of your patch that I've included in the ext4 > > patch queue. I've removed the use of the atomic_t data type. I've > OK, I'm just unsure: Isn't even 32-bit read non-atomic if it is not > properly aligned e.g. on powerPC? So shouldn't we make sure those inode > fields are aligned to 32-bit boundary (or at least add a comment about > that)? But gcc guarantees such alignments, unless you do something horrible from a performance point of view, such as using __attribute__((packed)). And in fact, if things are not aligned, on some platforms unaligned access will trigger a software trap, and the kernel trap handler has to fix up the unaligned access. (I believe Power is one of these platforms if memory serves correctly.) Even on platforms that don't, the performance hit where the hardware has to do the unaligned access so horrible that gcc avoids the misalignment automatically. That's why sometimes I'll go around and try to adjust structure member orders, since something like this: struct foo { short int a; int b; short int c; } foo_array[2]; Will consume 24 bytes.... while this: struct foo { short int a; short int c; int b; } foo_array[2]; ... will consume 16 bytes. For structures that are used in huge amounts (like inode slab caches), it can mean a measurable memory savings. - Ted ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] ext4: Wait for proper transaction commit on fsync 2009-12-09 3:51 ` tytso 2009-12-09 4:54 ` tytso @ 2009-12-09 11:30 ` Jan Kara 1 sibling, 0 replies; 22+ messages in thread From: Jan Kara @ 2009-12-09 11:30 UTC (permalink / raw) To: tytso; +Cc: linux-ext4 On Tue 08-12-09 22:51:20, tytso@mit.edu wrote: > On Mon, Nov 16, 2009 at 11:43:31AM +0100, Jan Kara wrote: > > > Why do we need an atomic_t here at all? It's not clear it's > > > necessary. What specific race are you worried about? > > Well, i_[data]sync_tid are set and read from several places which aren't > > currently synchronized against each other and I din't want to introduce any > > synchronization. So atomic_t seemed like a clean way of making clear that > > loads / stores from those fields are atomic, regardless of architecture, > > memory alignment or whatever insanity that might make us see just half > > overwritten field. On all archs where this means just plain stores / loads > > (such as x86), there's no performance hit since the operations are > > implemented as such. > > Sorry for not responding to this one sooner, but see this URL: > > http://digitalvampire.org/blog/index.php/2007/05/13/atomic-cargo-cults/ Thanks for the pointer :) It's a good reading... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/4] Fix fsync on ext3 and ext4 (v2) 2009-10-27 12:48 [PATCH 0/4] Fix fsync on ext3 and ext4 (v2) Jan Kara ` (3 preceding siblings ...) 2009-10-27 12:48 ` [PATCH 4/4] ext4: Wait for proper transaction commit on fsync Jan Kara @ 2009-11-05 13:02 ` Jan Kara 2009-11-05 16:35 ` Aneesh Kumar K.V 4 siblings, 1 reply; 22+ messages in thread From: Jan Kara @ 2009-11-05 13:02 UTC (permalink / raw) To: linux-ext4; +Cc: tytso, Aneesh Kumar K.V Hi, > this is a second try for a patchset which makes ext3 and ext4 properly force > a transaction commit when needed. We now do not rely on buffer dirty bits > (which does not work as pdflush can just clear them without forcing a > transaction commit) but rather keep transaction ids that need to be committed > for each inode. > Since last version, I've implemented Aneesh's and Curt's comments and also > fixed a missing initialization of the fields. I've tested that now the patch > works correctly for uninitialized extents as well as for standard writes. If > noone objects, would you merge the ext4 part Ted? I'll take care of the ext3 > patch. Aneesh, Ted, is the second version of the patchset fine with you? Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/4] Fix fsync on ext3 and ext4 (v2) 2009-11-05 13:02 ` [PATCH 0/4] Fix fsync on ext3 and ext4 (v2) Jan Kara @ 2009-11-05 16:35 ` Aneesh Kumar K.V 2009-11-11 14:17 ` Jan Kara 0 siblings, 1 reply; 22+ messages in thread From: Aneesh Kumar K.V @ 2009-11-05 16:35 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4, tytso On Thu, Nov 05, 2009 at 02:02:50PM +0100, Jan Kara wrote: > Hi, > > > this is a second try for a patchset which makes ext3 and ext4 properly force > > a transaction commit when needed. We now do not rely on buffer dirty bits > > (which does not work as pdflush can just clear them without forcing a > > transaction commit) but rather keep transaction ids that need to be committed > > for each inode. > > Since last version, I've implemented Aneesh's and Curt's comments and also > > fixed a missing initialization of the fields. I've tested that now the patch > > works correctly for uninitialized extents as well as for standard writes. If > > noone objects, would you merge the ext4 part Ted? I'll take care of the ext3 > > patch. > Aneesh, Ted, is the second version of the patchset fine with you? The patches looks good. Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> -aneesh ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/4] Fix fsync on ext3 and ext4 (v2) 2009-11-05 16:35 ` Aneesh Kumar K.V @ 2009-11-11 14:17 ` Jan Kara 0 siblings, 0 replies; 22+ messages in thread From: Jan Kara @ 2009-11-11 14:17 UTC (permalink / raw) To: Aneesh Kumar K.V; +Cc: Jan Kara, linux-ext4, tytso On Thu 05-11-09 22:05:10, Aneesh Kumar K.V wrote: > On Thu, Nov 05, 2009 at 02:02:50PM +0100, Jan Kara wrote: > > Hi, > > > > > this is a second try for a patchset which makes ext3 and ext4 properly force > > > a transaction commit when needed. We now do not rely on buffer dirty bits > > > (which does not work as pdflush can just clear them without forcing a > > > transaction commit) but rather keep transaction ids that need to be committed > > > for each inode. > > > Since last version, I've implemented Aneesh's and Curt's comments and also > > > fixed a missing initialization of the fields. I've tested that now the patch > > > works correctly for uninitialized extents as well as for standard writes. If > > > noone objects, would you merge the ext4 part Ted? I'll take care of the ext3 > > > patch. > > Aneesh, Ted, is the second version of the patchset fine with you? > > The patches looks good. > > Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Thanks for the review! I've merged the ext3 patch and will push it to Linus soon. Ted, will you take care of ext4 changes? Thanks. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 0/4] Fix fsync bug in ext3 and ext4
@ 2009-10-20 7:24 Jan Kara
2009-10-20 7:24 ` [PATCH 2/4] ext4: Avoid issuing barriers on error recovery path Jan Kara
0 siblings, 1 reply; 22+ messages in thread
From: Jan Kara @ 2009-10-20 7:24 UTC (permalink / raw)
To: linux-ext4; +Cc: tytso, chris.mason
Hi,
the patch series below fixes the problem ext3 and ext4 has that we
rely on inode dirty bits for correct fsync. This has two problems:
a) ext3 and ext4 often does not set them (as they call ext?_mark_inode_dirty)
and thus only quota code sets the dirty bit as a sideeffect.
b) pdflush can come in and clear the dirty bit any time (thanks to Chris
for pointing out this).
The second and third patch in the series are just minor bug fixes to ext4.
I did some basic testing to verify that the code does what I'd expect it to
do but an extra pair of eyes checking the new code would be helpful...
Honza
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH 2/4] ext4: Avoid issuing barriers on error recovery path 2009-10-20 7:24 [PATCH 0/4] Fix fsync bug in ext3 and ext4 Jan Kara @ 2009-10-20 7:24 ` Jan Kara 0 siblings, 0 replies; 22+ messages in thread From: Jan Kara @ 2009-10-20 7:24 UTC (permalink / raw) To: linux-ext4; +Cc: tytso, chris.mason, Jan Kara There is no reason why we should issue IO barriers when we just hit an error in ext4_sync_file. So move out: label to just return and introduce flush: label to those places which really need it. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/fsync.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 2b15312..2bf9413 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -88,7 +88,7 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync) ret = sync_mapping_buffers(inode->i_mapping); if (datasync && !(inode->i_state & I_DIRTY_DATASYNC)) - goto out; + goto flush; /* * The VFS has written the file data. If the inode is unaltered @@ -103,8 +103,9 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync) if (ret == 0) ret = err; } -out: +flush: if (journal && (journal->j_flags & JBD2_BARRIER)) blkdev_issue_flush(inode->i_sb->s_bdev, NULL); +out: return ret; } -- 1.6.0.2 ^ permalink raw reply related [flat|nested] 22+ messages in thread
end of thread, other threads:[~2009-12-09 14:32 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-10-27 12:48 [PATCH 0/4] Fix fsync on ext3 and ext4 (v2) Jan Kara 2009-10-27 12:48 ` [PATCH 1/4] ext3: Wait for proper transaction commit on fsync Jan Kara 2009-10-27 12:48 ` [PATCH 2/4] ext4: Avoid issuing barriers on error recovery path Jan Kara 2009-11-16 0:47 ` Theodore Tso 2009-11-16 2:40 ` [PATCH] ext4: Avoid issuing unnecessary barriers Theodore Ts'o 2009-11-16 10:29 ` Jan Kara 2009-10-27 12:48 ` [PATCH 3/4] ext4: Fix error handling ext4_ind_get_blocks Jan Kara 2009-11-16 2:57 ` Theodore Tso 2009-10-27 12:48 ` [PATCH 4/4] ext4: Wait for proper transaction commit on fsync Jan Kara 2009-11-04 14:57 ` Andi Kleen 2009-11-04 15:53 ` Jan Kara 2009-11-16 0:46 ` Theodore Tso 2009-11-16 10:43 ` Jan Kara 2009-12-09 3:51 ` tytso 2009-12-09 4:54 ` tytso 2009-12-09 11:29 ` Jan Kara 2009-12-09 14:32 ` tytso 2009-12-09 11:30 ` Jan Kara 2009-11-05 13:02 ` [PATCH 0/4] Fix fsync on ext3 and ext4 (v2) Jan Kara 2009-11-05 16:35 ` Aneesh Kumar K.V 2009-11-11 14:17 ` Jan Kara -- strict thread matches above, loose matches on Subject: below -- 2009-10-20 7:24 [PATCH 0/4] Fix fsync bug in ext3 and ext4 Jan Kara 2009-10-20 7:24 ` [PATCH 2/4] ext4: Avoid issuing barriers on error recovery path Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).