[PATCH] ext4: fix io-barrier logic for external journal case

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] ext4: fix io-barrier logic for external journal case
@ 2010-03-11 15:40 Dmitry Monakhov
  2010-03-11 16:27 ` Jan Kara
  0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Monakhov @ 2010-03-11 15:40 UTC (permalink / raw)
  To: linux-ext4; +Cc: Theodore Ts'o, Jan Kara

We have to submit barrier before we start journal commit process.
otherwise transaction may be committed before data flushed to disk.
There is no difference from performance of view, but definitely
fsync becomes more correct.

If jbd2_log_start_commit return 0 then it means that transaction
was already committed. So we don't have to issue barrier for
ordered mode, because it was already done during commit.

By unknown reason we ignored ret val from jbd2_log_wait_commit()
so even in case of EIO fsync will succeed.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/fsync.c |   28 +++++++++++++---------------
 1 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 0d0c323..621a8ed 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -88,21 +88,19 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
 		return ext4_force_commit(inode->i_sb);
 
 	commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
-	if (jbd2_log_start_commit(journal, commit_tid)) {
-		/*
-		 * When the journal is on a different device than the
-		 * fs data disk, we need to issue the barrier in
-		 * writeback mode.  (In ordered mode, the jbd2 layer
-		 * will take care of issuing the barrier.  In
-		 * data=journal, all of the data blocks are written to
-		 * the journal device.)
-		 */
-		if (ext4_should_writeback_data(inode) &&
-		    (journal->j_fs_dev != journal->j_dev) &&
-		    (journal->j_flags & JBD2_BARRIER))
-			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
-		jbd2_log_wait_commit(journal, commit_tid);
-	} else if (journal->j_flags & JBD2_BARRIER)
+	/*
+	 * When the journal is on a different device than the
+	 * fs data disk, we need to issue the barrier in
+	 * writeback mode.  (In ordered mode, the jbd2 layer
+	 * will take care of issuing the barrier.  In
+	 * data=journal, all of the data blocks are written to
+	 * the journal device.)
+	 */
+	if (ext4_should_writeback_data(inode) &&
+		(journal->j_fs_dev != journal->j_dev) &&
+		(journal->j_flags & JBD2_BARRIER))
 		blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+	if (jbd2_log_start_commit(journal, commit_tid))
+		ret = jbd2_log_wait_commit(journal, commit_tid);
 	return ret;
 }
-- 
1.6.6


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] ext4: fix io-barrier logic for external journal case
  2010-03-11 15:40 [PATCH] ext4: fix io-barrier logic for external journal case Dmitry Monakhov
@ 2010-03-11 16:27 ` Jan Kara
  2010-03-12  8:37   ` [PATCH] ext4: check missed return value ext4_sync_file Dmitry Monakhov
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2010-03-11 16:27 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, Theodore Ts'o, Jan Kara

> We have to submit barrier before we start journal commit process.
> otherwise transaction may be committed before data flushed to disk.
> There is no difference from performance of view, but definitely
> fsync becomes more correct.
> 
> If jbd2_log_start_commit return 0 then it means that transaction
> was already committed. So we don't have to issue barrier for
> ordered mode, because it was already done during commit.
  Umm, we have to - when a file has just been rewritten (i.e. no block
allocation), then i_datasync_tid is not updated and thus we won't commit
any transaction as a part of fdatasync (and that is correct because there
are no metadata that need to be written for that fdatasync). But we still
have to flush disk caches with data submitted by filemap_fdatawrite_and_wait.

> By unknown reason we ignored ret val from jbd2_log_wait_commit()
> so even in case of EIO fsync will succeed.
  I just forgot jbd2_log_wait_commit can return a failure...

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] ext4: check missed return value ext4_sync_file
  2010-03-11 16:27 ` Jan Kara
@ 2010-03-12  8:37   ` Dmitry Monakhov
  2010-03-12 17:20     ` Dmitry Monakhov
                       ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Dmitry Monakhov @ 2010-03-12  8:37 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Theodore Ts'o

[-- Attachment #1: Type: text/plain, Size: 2226 bytes --]

Jan Kara <jack@suse.cz> writes:

>> We have to submit barrier before we start journal commit process.
>> otherwise transaction may be committed before data flushed to disk.
>> There is no difference from performance of view, but definitely
>> fsync becomes more correct.
Unfortunately this change does affect performance because latency
will be increased since we have to wait barrier before we start
journal commit. 
>> 
>> If jbd2_log_start_commit return 0 then it means that transaction
>> was already committed. So we don't have to issue barrier for
>> ordered mode, because it was already done during commit.
>   Umm, we have to - when a file has just been rewritten (i.e. no block
> allocation), then i_datasync_tid is not updated and thus we won't commit
> any transaction as a part of fdatasync (and that is correct because there
> are no metadata that need to be written for that fdatasync). But we still
> have to flush disk caches with data submitted by filemap_fdatawrite_and_wait.
Yepp. I've missed that. i thought that transaction id updated
even in that case.
The most unpleasant part in ext4_sync_file implementation is that 
barrier is issued on each fsync() call.  So some bad user may perform:
while(1) fsync(fd);
which result in bad system performance. And since barrier request is 
empty it is hard to detect the reason of troubles.
Off course we may solve it by introducing some sort of dirty flag
which is set in write_page, and clear in fsync. But it looks as
ugly workaround.
>
>> By unknown reason we ignored ret val from jbd2_log_wait_commit()
>> so even in case of EIO fsync will succeed.
>   I just forgot jbd2_log_wait_commit can return a failure...
In respect to previous comments the patch reduced to simple missed
error check fix.
BTW: While investigating similar code in ext3 i've found what
fsync is broken in case of external journal. JBD itself does not
send barrier to j_fs_dev. So if fsync goes via
log_start_commit/log_wait_commit path data loss is still possible.
I'm able to reproduce this via simple write test
wile (1) {
 write(fd, buf, 1024*1024)
 fsync(fd);
}
and then reboot in the middle of operation.
Later file content check spotted data inconsistency.
Will send a fix ASAP.


[-- Attachment #2: 0001-ext4-check-missed-return-value-ext4_sync_file.patch --]
[-- Type: text/plain, Size: 933 bytes --]

>From 1f7382ea4a8b8e3880e1938d161f924ea572a1e1 Mon Sep 17 00:00:00 2001
From: Dmitry Monakhov <dmonakhov@openvz.org>
Date: Thu, 11 Mar 2010 20:14:13 +0300
Subject: [PATCH] ext4: check missed return value ext4_sync_file


Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/fsync.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 0d0c323..42bd94a 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -101,7 +101,7 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
 		    (journal->j_fs_dev != journal->j_dev) &&
 		    (journal->j_flags & JBD2_BARRIER))
 			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
-		jbd2_log_wait_commit(journal, commit_tid);
+		ret = jbd2_log_wait_commit(journal, commit_tid);
 	} else if (journal->j_flags & JBD2_BARRIER)
 		blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
 	return ret;
-- 
1.6.6


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] ext4: check missed return value ext4_sync_file
  2010-03-12  8:37   ` [PATCH] ext4: check missed return value ext4_sync_file Dmitry Monakhov
@ 2010-03-12 17:20     ` Dmitry Monakhov
  2010-03-17 11:23     ` Jan Kara
  2010-03-22  0:50     ` tytso
  2 siblings, 0 replies; 8+ messages in thread
From: Dmitry Monakhov @ 2010-03-12 17:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Theodore Ts'o

Dmitry Monakhov <dmonakhov@openvz.org> writes:

> Jan Kara <jack@suse.cz> writes:
>
>>> We have to submit barrier before we start journal commit process.
>>> otherwise transaction may be committed before data flushed to disk.
>>> There is no difference from performance of view, but definitely
>>> fsync becomes more correct.
> Unfortunately this change does affect performance because latency
> will be increased since we have to wait barrier before we start
> journal commit. 
>>> 
>>> If jbd2_log_start_commit return 0 then it means that transaction
>>> was already committed. So we don't have to issue barrier for
>>> ordered mode, because it was already done during commit.
>>   Umm, we have to - when a file has just been rewritten (i.e. no block
>> allocation), then i_datasync_tid is not updated and thus we won't commit
>> any transaction as a part of fdatasync (and that is correct because there
>> are no metadata that need to be written for that fdatasync). But we still
>> have to flush disk caches with data submitted by filemap_fdatawrite_and_wait.
> Yepp. I've missed that. i thought that transaction id updated
> even in that case.
> The most unpleasant part in ext4_sync_file implementation is that 
> barrier is issued on each fsync() call.  So some bad user may perform:
> while(1) fsync(fd);
> which result in bad system performance. And since barrier request is 
> empty it is hard to detect the reason of troubles.
> Off course we may solve it by introducing some sort of dirty flag
> which is set in write_page, and clear in fsync. But it looks as
> ugly workaround.
>>
>>> By unknown reason we ignored ret val from jbd2_log_wait_commit()
>>> so even in case of EIO fsync will succeed.
>>   I just forgot jbd2_log_wait_commit can return a failure...
> In respect to previous comments the patch reduced to simple missed
> error check fix.
It is fun but I've found what journalled mode is still broken in ext4
in case of external journal. We forget to issue io-barrier to j_fs_dev 
if transaction has only metadata and has no data blocks :)
This affect all data modes.
It is easy to reproduce on classic test-case with data=journall
for(i=0; i < 3; i++) {
  memset(buf, 'a'+i);
  pwrite(fd, buf, 1024*1024, 0)
  fsync(fd);
}
/* At this time transaction was committed so journal is empty */
<<POWER_OFF
Later i've found old data('b' chars) at the end of the file.
So i've prepared another patch  which supersede previous one.
> BTW: While investigating similar code in ext3 i've found what
> fsync is broken in case of external journal. JBD itself does not
> send barrier to j_fs_dev. So if fsync goes via
> log_start_commit/log_wait_commit path data loss is still possible.
> I'm able to reproduce this via simple write test
> wile (1) {
>  write(fd, buf, 1024*1024)
>  fsync(fd);
> }
> and then reboot in the middle of operation.
> Later file content check spotted data inconsistency.
> Will send a fix ASAP.
>
> From 1f7382ea4a8b8e3880e1938d161f924ea572a1e1 Mon Sep 17 00:00:00 2001
> From: Dmitry Monakhov <dmonakhov@openvz.org>
> Date: Thu, 11 Mar 2010 20:14:13 +0300
> Subject: [PATCH] ext4: check missed return value ext4_sync_file
>
>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  fs/ext4/fsync.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 0d0c323..42bd94a 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -101,7 +101,7 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
>  		    (journal->j_fs_dev != journal->j_dev) &&
>  		    (journal->j_flags & JBD2_BARRIER))
>  			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> -		jbd2_log_wait_commit(journal, commit_tid);
> +		ret = jbd2_log_wait_commit(journal, commit_tid);
>  	} else if (journal->j_flags & JBD2_BARRIER)
>  		blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
>  	return ret;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] ext4: check missed return value ext4_sync_file
  2010-03-12  8:37   ` [PATCH] ext4: check missed return value ext4_sync_file Dmitry Monakhov
  2010-03-12 17:20     ` Dmitry Monakhov
@ 2010-03-17 11:23     ` Jan Kara
  2010-03-17 11:24       ` Jan Kara
  2010-03-17 11:38       ` Dmitry Monakhov
  2010-03-22  0:50     ` tytso
  2 siblings, 2 replies; 8+ messages in thread
From: Jan Kara @ 2010-03-17 11:23 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Jan Kara, linux-ext4, Theodore Ts'o

> Jan Kara <jack@suse.cz> writes:
> >> 
> >> If jbd2_log_start_commit return 0 then it means that transaction
> >> was already committed. So we don't have to issue barrier for
> >> ordered mode, because it was already done during commit.
> >   Umm, we have to - when a file has just been rewritten (i.e. no block
> > allocation), then i_datasync_tid is not updated and thus we won't commit
> > any transaction as a part of fdatasync (and that is correct because there
> > are no metadata that need to be written for that fdatasync). But we still
> > have to flush disk caches with data submitted by filemap_fdatawrite_and_wait.
> Yepp. I've missed that. i thought that transaction id updated even in that
> case.  The most unpleasant part in ext4_sync_file implementation is that
> barrier is issued on each fsync() call.  So some bad user may perform:
> while(1) fsync(fd); which result in bad system performance. And since barrier
> request is empty it is hard to detect the reason of troubles.
  Actually, you'll be able to see the barrier requests in the blktrace dump
so it won't be that hard to detect.

> Off course we may solve it by introducing some sort of dirty flag which is
> set in write_page, and clear in fsync. But it looks as ugly workaround.
  I agree that sending barrier request on each fsync isn't very nice but
in common case, I'd assume that an application calls fsync only if it has
written something to the file previously. So I wouldn't invest much into
solving this until I see a realistic use case where it matters...

> >> By unknown reason we ignored ret val from jbd2_log_wait_commit()
> >> so even in case of EIO fsync will succeed.
> >   I just forgot jbd2_log_wait_commit can return a failure...
> In respect to previous comments the patch reduced to simple missed
> error check fix.
  I guess you can resend the fix to Ted directly to catch his attention.

> BTW: While investigating similar code in ext3 i've found what fsync is broken
> in case of external journal.
  Yes, I've noticed this recently as well. So will you send a fix or should
I go and backport ext4 fixes of this?

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] ext4: check missed return value ext4_sync_file
  2010-03-17 11:23     ` Jan Kara
@ 2010-03-17 11:24       ` Jan Kara
  2010-03-17 11:38       ` Dmitry Monakhov
  1 sibling, 0 replies; 8+ messages in thread
From: Jan Kara @ 2010-03-17 11:24 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Jan Kara, linux-ext4, Theodore Ts'o

> > Jan Kara <jack@suse.cz> writes:
> > >> 
> > >> If jbd2_log_start_commit return 0 then it means that transaction
> > >> was already committed. So we don't have to issue barrier for
> > >> ordered mode, because it was already done during commit.
> > >   Umm, we have to - when a file has just been rewritten (i.e. no block
> > > allocation), then i_datasync_tid is not updated and thus we won't commit
> > > any transaction as a part of fdatasync (and that is correct because there
> > > are no metadata that need to be written for that fdatasync). But we still
> > > have to flush disk caches with data submitted by filemap_fdatawrite_and_wait.
> > Yepp. I've missed that. i thought that transaction id updated even in that
> > case.  The most unpleasant part in ext4_sync_file implementation is that
> > barrier is issued on each fsync() call.  So some bad user may perform:
> > while(1) fsync(fd); which result in bad system performance. And since barrier
> > request is empty it is hard to detect the reason of troubles.
>   Actually, you'll be able to see the barrier requests in the blktrace dump
> so it won't be that hard to detect.
> 
> > Off course we may solve it by introducing some sort of dirty flag which is
> > set in write_page, and clear in fsync. But it looks as ugly workaround.
>   I agree that sending barrier request on each fsync isn't very nice but
> in common case, I'd assume that an application calls fsync only if it has
> written something to the file previously. So I wouldn't invest much into
> solving this until I see a realistic use case where it matters...
> 
> > >> By unknown reason we ignored ret val from jbd2_log_wait_commit()
> > >> so even in case of EIO fsync will succeed.
> > >   I just forgot jbd2_log_wait_commit can return a failure...
> > In respect to previous comments the patch reduced to simple missed
> > error check fix.
>   I guess you can resend the fix to Ted directly to catch his attention.
> 
> > BTW: While investigating similar code in ext3 i've found what fsync is broken
> > in case of external journal.
>   Yes, I've noticed this recently as well. So will you send a fix or should
> I go and backport ext4 fixes of this?
  Oops, sorry, I've notice you sent the patches to the list already...

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] ext4: check missed return value ext4_sync_file
  2010-03-17 11:23     ` Jan Kara
  2010-03-17 11:24       ` Jan Kara
@ 2010-03-17 11:38       ` Dmitry Monakhov
  1 sibling, 0 replies; 8+ messages in thread
From: Dmitry Monakhov @ 2010-03-17 11:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Theodore Ts'o

Jan Kara <jack@suse.cz> writes:

>> Jan Kara <jack@suse.cz> writes:
>> >> 
>> >> If jbd2_log_start_commit return 0 then it means that transaction
>> >> was already committed. So we don't have to issue barrier for
>> >> ordered mode, because it was already done during commit.
>> >   Umm, we have to - when a file has just been rewritten (i.e. no block
>> > allocation), then i_datasync_tid is not updated and thus we won't commit
>> > any transaction as a part of fdatasync (and that is correct because there
>> > are no metadata that need to be written for that fdatasync). But we still
>> > have to flush disk caches with data submitted by filemap_fdatawrite_and_wait.
>> Yepp. I've missed that. i thought that transaction id updated even in that
>> case.  The most unpleasant part in ext4_sync_file implementation is that
>> barrier is issued on each fsync() call.  So some bad user may perform:
>> while(1) fsync(fd); which result in bad system performance. And since barrier
>> request is empty it is hard to detect the reason of troubles.
>   Actually, you'll be able to see the barrier requests in the blktrace dump
> so it won't be that hard to detect.
>
>> Off course we may solve it by introducing some sort of dirty flag which is
>> set in write_page, and clear in fsync. But it looks as ugly workaround.
>   I agree that sending barrier request on each fsync isn't very nice but
> in common case, I'd assume that an application calls fsync only if it has
> written something to the file previously. So I wouldn't invest much into
> solving this until I see a realistic use case where it matters...
>
>> >> By unknown reason we ignored ret val from jbd2_log_wait_commit()
>> >> so even in case of EIO fsync will succeed.
>> >   I just forgot jbd2_log_wait_commit can return a failure...
>> In respect to previous comments the patch reduced to simple missed
>> error check fix.
>   I guess you can resend the fix to Ted directly to catch his attention.
Ohh.. After this letter i've found new issues with metadata, as result
new patch version was sent.
http://marc.info/?l=linux-ext4&m=126841481923132&w=2
>
>> BTW: While investigating similar code in ext3 i've found what fsync is broken
>> in case of external journal.
>   Yes, I've noticed this recently as well. So will you send a fix or should
> I go and backport ext4 fixes of this?
I've already done that
http://marc.info/?l=linux-ext4&m=126841482023138&w=2
It already contains fix for metadata handling logic.


>
> 								Honza

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] ext4: check missed return value ext4_sync_file
  2010-03-12  8:37   ` [PATCH] ext4: check missed return value ext4_sync_file Dmitry Monakhov
  2010-03-12 17:20     ` Dmitry Monakhov
  2010-03-17 11:23     ` Jan Kara
@ 2010-03-22  0:50     ` tytso
  2 siblings, 0 replies; 8+ messages in thread
From: tytso @ 2010-03-22  0:50 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Jan Kara, linux-ext4

On Fri, Mar 12, 2010 at 11:37:43AM +0300, Dmitry Monakhov wrote:
> The most unpleasant part in ext4_sync_file implementation is that 
> barrier is issued on each fsync() call.  So some bad user may perform:
> while(1) fsync(fd);
> which result in bad system performance. And since barrier request is 
> empty it is hard to detect the reason of troubles.
> Off course we may solve it by introducing some sort of dirty flag
> which is set in write_page, and clear in fsync. But it looks as
> ugly workaround.

We could potentially put the dirty flag in the inode instead, and set
it write_prepare() and writepages() code paths.  I'm not entirely sure
it's worth it, though.

> In respect to previous comments the patch reduced to simple missed
> error check fix.

I've added this to the ext4 patch queue, and I will ignore your
earlier version of the patch.

					- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-03-22  2:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-11 15:40 [PATCH] ext4: fix io-barrier logic for external journal case Dmitry Monakhov
2010-03-11 16:27 ` Jan Kara
2010-03-12  8:37   ` [PATCH] ext4: check missed return value ext4_sync_file Dmitry Monakhov
2010-03-12 17:20     ` Dmitry Monakhov
2010-03-17 11:23     ` Jan Kara
2010-03-17 11:24       ` Jan Kara
2010-03-17 11:38       ` Dmitry Monakhov
2010-03-22  0:50     ` tytso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).