* [PATCH 1/2] ext4/jbd2: fix io-barrier logic in case of external journal
@ 2010-03-12 17:26 Dmitry Monakhov
2010-03-12 17:26 ` [PATCH 2/2] ext3, jbd: Add barriers for file systems with exernal journals Dmitry Monakhov
2010-03-22 1:20 ` [PATCH 1/2] ext4/jbd2: fix io-barrier logic in case of external journal tytso
0 siblings, 2 replies; 9+ messages in thread
From: Dmitry Monakhov @ 2010-03-12 17:26 UTC (permalink / raw)
To: linux-ext4; +Cc: Dmitry Monakhov
Currently journalled mode is still broken if fs use external
journal. This is because all data handled as metadata.
We must to issue io-barrier to fs_device even if transaction
has only metadata blocks.
Also add missed return value check.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/fsync.c | 2 +-
fs/jbd2/commit.c | 3 +++
2 files changed, 4 insertions(+), 1 deletions(-)
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 98bd140..eb52837 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -101,7 +101,7 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
(journal->j_fs_dev != journal->j_dev) &&
(journal->j_flags & JBD2_BARRIER))
blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
- jbd2_log_wait_commit(journal, commit_tid);
+ ret = jbd2_log_wait_commit(journal, commit_tid);
} else if (journal->j_flags & JBD2_BARRIER)
blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
return ret;
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 1bc74b6..2f62f1b 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -682,6 +682,9 @@ void jbd2_journal_commit_transaction(journal_t *journal)
tag->t_flags |= cpu_to_be32(JBD2_FLAG_LAST_TAG);
start_journal_io:
+ if (bufs)
+ commit_transaction->t_flushed_data_blocks = 1;
+
for (i = 0; i < bufs; i++) {
struct buffer_head *bh = wbuf[i];
/*
--
1.6.6
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/2] ext3, jbd: Add barriers for file systems with exernal journals
2010-03-12 17:26 [PATCH 1/2] ext4/jbd2: fix io-barrier logic in case of external journal Dmitry Monakhov
@ 2010-03-12 17:26 ` Dmitry Monakhov
2010-03-22 1:20 ` [PATCH 1/2] ext4/jbd2: fix io-barrier logic in case of external journal tytso
1 sibling, 0 replies; 9+ messages in thread
From: Dmitry Monakhov @ 2010-03-12 17:26 UTC (permalink / raw)
To: linux-ext4; +Cc: Dmitry Monakhov
This is a bit complicated because we are trying to optimize when we
send barriers to the fs data disk. We could just throw in an extra
barrier to the data disk whenever we send a barrier to the journal
disk, but that's not always strictly necessary.
Send barrier only if transaction has data or metadata.
The patch is mostly backported from ext4.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext3/fsync.c | 39 ++++++++++++++++++++++++---------------
fs/jbd/commit.c | 16 ++++++++++++++++
include/linux/jbd.h | 1 +
3 files changed, 41 insertions(+), 15 deletions(-)
diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c
index 8209f26..983a3bc 100644
--- a/fs/ext3/fsync.c
+++ b/fs/ext3/fsync.c
@@ -70,10 +70,8 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
* (they were dirtied by commit). But that's OK - the blocks are
* safe in-journal, which is all fsync() needs to ensure.
*/
- if (ext3_should_journal_data(inode)) {
- ret = ext3_force_commit(inode->i_sb);
- goto out;
- }
+ if (ext3_should_journal_data(inode))
+ return ext3_force_commit(inode->i_sb);
if (datasync)
commit_tid = atomic_read(&ei->i_datasync_tid);
@@ -81,17 +79,28 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
commit_tid = atomic_read(&ei->i_sync_tid);
if (log_start_commit(journal, commit_tid)) {
- log_wait_commit(journal, commit_tid);
- goto out;
- }
+ /*
+ * When the journal is on a different device than the
+ * fs data disk, we need to issue the barrier in
+ * writeback mode. (In ordered mode, the jbd layer
+ * will take care of issuing the barrier. In
+ * data=journal, all of the data blocks are written to
+ * the journal device.)
+ */
+ if (ext3_should_writeback_data(inode) &&
+ (journal->j_fs_dev != journal->j_dev) &&
+ (journal->j_flags & JFS_BARRIER))
+ blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
- /*
- * In case we didn't commit a transaction, we have to flush
- * disk caches manually so that data really is on persistent
- * storage
- */
- if (test_opt(inode->i_sb, BARRIER))
- blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
-out:
+ ret = log_wait_commit(journal, commit_tid);
+ } else {
+ /*
+ * In case we didn't commit a transaction, we have to flush
+ * disk caches manually so that data really is on persistent
+ * storage
+ */
+ if (test_opt(inode->i_sb, BARRIER))
+ blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+ }
return ret;
}
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 2c90e3e..027e02b 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -21,6 +21,7 @@
#include <linux/mm.h>
#include <linux/pagemap.h>
#include <linux/bio.h>
+#include <linux/blkdev.h>
/*
* Default IO end handler for temporary BJ_IO buffer_heads.
@@ -194,6 +195,7 @@ static int journal_submit_data_buffers(journal_t *journal,
struct journal_head *jh;
struct buffer_head *bh;
int locked;
+ int sync_data = 0;
int bufs = 0;
struct buffer_head **wbuf = journal->j_wbuf;
int err = 0;
@@ -211,6 +213,7 @@ write_out_data:
spin_lock(&journal->j_list_lock);
while (commit_transaction->t_sync_datalist) {
+ sync_data = 1;
jh = commit_transaction->t_sync_datalist;
bh = jh2bh(jh);
locked = 0;
@@ -288,6 +291,7 @@ write_out_data:
goto write_out_data;
}
}
+ commit_transaction->t_flushed_data_blocks |= sync_data;
spin_unlock(&journal->j_list_lock);
journal_do_submit_data(wbuf, bufs, write_op);
@@ -668,6 +672,8 @@ void journal_commit_transaction(journal_t *journal)
tag->t_flags |= cpu_to_be32(JFS_FLAG_LAST_TAG);
start_journal_io:
+ if (bufs)
+ commit_transaction->t_flushed_data_blocks = 1;
for (i = 0; i < bufs; i++) {
struct buffer_head *bh = wbuf[i];
lock_buffer(bh);
@@ -685,6 +691,16 @@ start_journal_io:
}
}
+ /*
+ * If the journal is not located on the file system device,
+ * then we must flush the file system device before we issue
+ * the commit record
+ */
+ if (commit_transaction->t_flushed_data_blocks &&
+ (journal->j_fs_dev != journal->j_dev) &&
+ (journal->j_flags & JFS_BARRIER))
+ blkdev_issue_flush(journal->j_fs_dev, NULL);
+
/* Lo and behold: we have just managed to send a transaction to
the log. Before we can commit it, wait for the IO so far to
complete. Control buffers being written are on the
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index f3aa59c..3ea2807 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -546,6 +546,7 @@ struct transaction_s
* waiting for it to finish.
*/
unsigned int t_synchronous_commit:1;
+ unsigned int t_flushed_data_blocks:1;
};
/**
--
1.6.6
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] ext4/jbd2: fix io-barrier logic in case of external journal
2010-03-12 17:26 [PATCH 1/2] ext4/jbd2: fix io-barrier logic in case of external journal Dmitry Monakhov
2010-03-12 17:26 ` [PATCH 2/2] ext3, jbd: Add barriers for file systems with exernal journals Dmitry Monakhov
@ 2010-03-22 1:20 ` tytso
2010-03-22 14:04 ` Dmitry Monakhov
1 sibling, 1 reply; 9+ messages in thread
From: tytso @ 2010-03-22 1:20 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4
On Fri, Mar 12, 2010 at 08:26:49PM +0300, Dmitry Monakhov wrote:
> start_journal_io:
> + if (bufs)
> + commit_transaction->t_flushed_data_blocks = 1;
> +
I'm not convinced this is right.
>From your test case, the problem isn't because we have journaled
metadata blocks (which is what bufs) counts, but because fsync()
depends on data blocks also getting flushed out to disks.
However, if we aren't closing the transaction because of fsync(), I
don't think we need to do a barrier in the case of an external
journal. So instead of effectively unconditionally setting
t_flushed_data_blocks (since bufs is nearly always going to be
non-zero), I think the better fix is to test to see if the journal
device != to the fs data device in fsync(), and if so, start the
barrier operation there.
Do you agree?
Best regards,
- Ted
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] ext4/jbd2: fix io-barrier logic in case of external journal
2010-03-22 1:20 ` [PATCH 1/2] ext4/jbd2: fix io-barrier logic in case of external journal tytso
@ 2010-03-22 14:04 ` Dmitry Monakhov
2010-03-22 15:03 ` tytso
0 siblings, 1 reply; 9+ messages in thread
From: Dmitry Monakhov @ 2010-03-22 14:04 UTC (permalink / raw)
To: tytso; +Cc: linux-ext4
tytso@mit.edu writes:
> On Fri, Mar 12, 2010 at 08:26:49PM +0300, Dmitry Monakhov wrote:
>> start_journal_io:
>> + if (bufs)
>> + commit_transaction->t_flushed_data_blocks = 1;
>> +
>
> I'm not convinced this is right.
>
> From your test case, the problem isn't because we have journaled
> metadata blocks (which is what bufs) counts, but because fsync()
> depends on data blocks also getting flushed out to disks.
>
> However, if we aren't closing the transaction because of fsync(), I
> don't think we need to do a barrier in the case of an external
> journal. So instead of effectively unconditionally setting
> t_flushed_data_blocks (since bufs is nearly always going to be
> non-zero), I think the better fix is to test to see if the journal
> device != to the fs data device in fsync(), and if so, start the
> barrier operation there.
>
> Do you agree?
Yes.
BTW Would it be correct to update j_tail in
jbd2_journal_commit_transaction() to something more recent
if we have issued an io-barrier to j_fs_dev?
This will helps to reduce journal_recovery time which may be really
painful in some slow devices.
I've take a look at async commit logic: fs/jbd2/commit.c
void jbd2_journal_commit_transaction(journal_t *journal)
{
725: /* Done it all: now write the commit record asynchronously. */
if (JBD2_HAS_INCOMPAT_FEATURE(journal,
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT))
{
err = journal_submit_commit_record(journal,
commit_transaction,
&cbh, crc32_sum);
if (err)
__jbd2_journal_abort_hard(journal);
if (journal->j_flags & JBD2_BARRIER)
blkdev_issue_flush(journal->j_dev, NULL);
<<< blkdev_issue_flush is wait for barrier to complete by default, but
<<< in fact we don't have to wait for barrier here. I've prepared a
<<< patch wich add flags to control blkdev_issue_flush() wait
<<< behavior, and this is the place for no-wait variant.
...
855: if (!err && !is_journal_aborted(journal))
err = journal_wait_on_commit_record(journal, cbh);
}
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] ext4/jbd2: fix io-barrier logic in case of external journal
2010-03-22 14:04 ` Dmitry Monakhov
@ 2010-03-22 15:03 ` tytso
2010-03-22 16:14 ` Dmitry Monakhov
0 siblings, 1 reply; 9+ messages in thread
From: tytso @ 2010-03-22 15:03 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4
On Mon, Mar 22, 2010 at 05:04:19PM +0300, Dmitry Monakhov wrote:
> tytso@mit.edu writes:
>
> > On Fri, Mar 12, 2010 at 08:26:49PM +0300, Dmitry Monakhov wrote:
> >> start_journal_io:
> >> + if (bufs)
> >> + commit_transaction->t_flushed_data_blocks = 1;
> >> +
> >
> > I'm not convinced this is right.
> >
> > From your test case, the problem isn't because we have journaled
> > metadata blocks (which is what bufs) counts, but because fsync()
> > depends on data blocks also getting flushed out to disks.
> >
> > However, if we aren't closing the transaction because of fsync(), I
> > don't think we need to do a barrier in the case of an external
> > journal. So instead of effectively unconditionally setting
> > t_flushed_data_blocks (since bufs is nearly always going to be
> > non-zero), I think the better fix is to test to see if the journal
> > device != to the fs data device in fsync(), and if so, start the
> > barrier operation there.
> >
> > Do you agree?
> Yes.
Just to be clear, since I realized I wrote fsync() when I should have
written fs/ext4/fsync.c, my proposal was to put this check in
ext4_sync_file().
> BTW Would it be correct to update j_tail in
> jbd2_journal_commit_transaction() to something more recent
> if we have issued an io-barrier to j_fs_dev?
> This will helps to reduce journal_recovery time which may be really
> painful in some slow devices.
Um, maybe. We are already calling __jbd2_journal_clean_checkpoint_list(),
and the barrier operation *is* expensive.
On the other hand, updating the journal superblock on every sync is
another seek that would have to made before the barrier operation, and
I'm a bit concerned that this seek would be noticeable. If it is
noticeable, is it worth it to optimize for the uncommon case (a power
failure requiring a journal replay) when it might cost us something,
however, small, on every single journal update?
Do we really think the journal replay time is really something that is
a major pain point. I can think of optimizations that involve
skipping writes that will get updated later in future transactions,
but it means complicating the replay code, which has been stable for a
long time, and it's not clear to me that the costs are worth the
benefits.
> I've take a look at async commit logic: fs/jbd2/commit.c
> void jbd2_journal_commit_transaction(journal_t *journal)
> {
> 725: /* Done it all: now write the commit record asynchronously. */
> if (JBD2_HAS_INCOMPAT_FEATURE(journal,
> JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT))
> {
> err = journal_submit_commit_record(journal,
> commit_transaction,
> &cbh, crc32_sum);
> if (err)
> __jbd2_journal_abort_hard(journal);
> if (journal->j_flags & JBD2_BARRIER)
> blkdev_issue_flush(journal->j_dev, NULL);
> <<< blkdev_issue_flush is wait for barrier to complete by default, but
> <<< in fact we don't have to wait for barrier here. I've prepared a
> <<< patch wich add flags to control blkdev_issue_flush() wait
> <<< behavior, and this is the place for no-wait variant.
I think that's right, as long as we're confident that subsequent
writes won't get scheduled before the no-wait barrier. If it did, it
would be a bug in the block I/O layer, so it should be OK.
- Ted
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] ext4/jbd2: fix io-barrier logic in case of external journal
2010-03-22 15:03 ` tytso
@ 2010-03-22 16:14 ` Dmitry Monakhov
2010-03-22 20:22 ` tytso
2010-03-30 1:47 ` Jan Kara
0 siblings, 2 replies; 9+ messages in thread
From: Dmitry Monakhov @ 2010-03-22 16:14 UTC (permalink / raw)
To: tytso; +Cc: linux-ext4
tytso@mit.edu writes:
> On Mon, Mar 22, 2010 at 05:04:19PM +0300, Dmitry Monakhov wrote:
>> tytso@mit.edu writes:
>>
>> > On Fri, Mar 12, 2010 at 08:26:49PM +0300, Dmitry Monakhov wrote:
>> >> start_journal_io:
>> >> + if (bufs)
>> >> + commit_transaction->t_flushed_data_blocks = 1;
>> >> +
>> >
>> > I'm not convinced this is right.
>> >
>> > From your test case, the problem isn't because we have journaled
>> > metadata blocks (which is what bufs) counts, but because fsync()
>> > depends on data blocks also getting flushed out to disks.
>> >
>> > However, if we aren't closing the transaction because of fsync(), I
>> > don't think we need to do a barrier in the case of an external
>> > journal. So instead of effectively unconditionally setting
>> > t_flushed_data_blocks (since bufs is nearly always going to be
>> > non-zero), I think the better fix is to test to see if the journal
>> > device != to the fs data device in fsync(), and if so, start the
>> > barrier operation there.
>> >
>> > Do you agree?
>> Yes.
>
> Just to be clear, since I realized I wrote fsync() when I should have
> written fs/ext4/fsync.c, my proposal was to put this check in
> ext4_sync_file().
This means that we will end up with two io-barriers in a row
for external journal case:
ext4_sync_file()
->jbd2_log_start_commit()
->blkdev_issue_flush(j_fs_dev)
/* seems following flush is mandatory to guarantee the metadata
* consistency */
->blkdev_issue_flush(j_fs_dev)
may be it will be better to pass some sort of barrier flag to
jbd2_log_start_commit() this function mark journal->j_commit_request
with ->t_flushed_data_blocks = 1, so io-carrier will be issued
even if transaction has only metadata
Advantages:
1) approach will works for all data modes
2) only one io-barrier is needed to guarantee the data and
metadata consiscency.
3) changes not so complicated.
I've already started to work with the patch but was surprised with
commit logic.
->__jbd2_log_start_commit(target)
journal->j_commit_request = target
->wake_up(&journal->j_wait_commit)
->kjournald2
transaction = journal->j_running_transaction
->jbd2_journal_commit_transaction(journal);
commit_transaction = journal->j_running_transaction
...
/* So j_commit_request is used only for comparison. But actually
committing journal->j_running_transaction->t_tid transaction
instead of j_commit_request. Why?
*/
>
>> BTW Would it be correct to update j_tail in
>> jbd2_journal_commit_transaction() to something more recent
>> if we have issued an io-barrier to j_fs_dev?
>> This will helps to reduce journal_recovery time which may be really
>> painful in some slow devices.
>
> Um, maybe. We are already calling __jbd2_journal_clean_checkpoint_list(),
> and the barrier operation *is* expensive.
>
> On the other hand, updating the journal superblock on every sync is
> another seek that would have to made before the barrier operation, and
> I'm a bit concerned that this seek would be noticeable. If it is
> noticeable, is it worth it to optimize for the uncommon case (a power
> failure requiring a journal replay) when it might cost us something,
> however, small, on every single journal update?
>
> Do we really think the journal replay time is really something that is
> a major pain point. I can think of optimizations that involve
> skipping writes that will get updated later in future transactions,
> but it means complicating the replay code, which has been stable for a
> long time, and it's not clear to me that the costs are worth the
> benefits.
Never mind. It was just my thoughts. The price of broken recovery stage
is to expansive to fix such rare cases.
>
>> I've take a look at async commit logic: fs/jbd2/commit.c
>> void jbd2_journal_commit_transaction(journal_t *journal)
>> {
>> 725: /* Done it all: now write the commit record asynchronously. */
>> if (JBD2_HAS_INCOMPAT_FEATURE(journal,
>> JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT))
>> {
>> err = journal_submit_commit_record(journal,
>> commit_transaction,
>> &cbh, crc32_sum);
>> if (err)
>> __jbd2_journal_abort_hard(journal);
>> if (journal->j_flags & JBD2_BARRIER)
>> blkdev_issue_flush(journal->j_dev, NULL);
>> <<< blkdev_issue_flush is wait for barrier to complete by default, but
>> <<< in fact we don't have to wait for barrier here. I've prepared a
>> <<< patch wich add flags to control blkdev_issue_flush() wait
>> <<< behavior, and this is the place for no-wait variant.
>
> I think that's right, as long as we're confident that subsequent
> writes won't get scheduled before the no-wait barrier. If it did, it
> would be a bug in the block I/O layer, so it should be OK.
>
> - Ted
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] ext4/jbd2: fix io-barrier logic in case of external journal
2010-03-22 16:14 ` Dmitry Monakhov
@ 2010-03-22 20:22 ` tytso
2010-03-30 2:14 ` Jan Kara
2010-03-30 1:47 ` Jan Kara
1 sibling, 1 reply; 9+ messages in thread
From: tytso @ 2010-03-22 20:22 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4
On Mon, Mar 22, 2010 at 07:14:13PM +0300, Dmitry Monakhov wrote:
> >
> > Just to be clear, since I realized I wrote fsync() when I should have
> > written fs/ext4/fsync.c, my proposal was to put this check in
> > ext4_sync_file().
> This means that we will end up with two io-barriers in a row
> for external journal case:
> ext4_sync_file()
> ->jbd2_log_start_commit()
> ->blkdev_issue_flush(j_fs_dev)
> /* seems following flush is mandatory to guarantee the metadata
> * consistency */
> ->blkdev_issue_flush(j_fs_dev)
Well, not if we only issue a barrier in the external barrier case....
but I agree the logic may start getting ugly.
> I've already started to work with the patch but was surprised with
> commit logic.
> ->__jbd2_log_start_commit(target)
> journal->j_commit_request = target
> ->wake_up(&journal->j_wait_commit)
> ->kjournald2
> transaction = journal->j_running_transaction
> ->jbd2_journal_commit_transaction(journal);
> commit_transaction = journal->j_running_transaction
> ...
> /* So j_commit_request is used only for comparison. But actually
> committing journal->j_running_transaction->t_tid transaction
> instead of j_commit_request. Why?
> */
Yeah, I don't think jbd2_log_start_commit() has the semantics that are
currently documented. We will return 0 if we wake up the commit
thread, yes --- but since we only check j_commit_request, and it's a
geq test, it might very well be the case that the transaction was
committed long ago, and so we end up kicking off a transaction commit
when one is not needed. Or, it may be that a transaction is just
already being committed (and in fact is just about to be completed),
and so the wake_up() call is a no-op, and so we don't get a barrier
when one is needed (and expected) by ext4_sync_file().
We need to look at this very closely, but I think
jbd2_log_start_commit() needs to check to see whether there is a
current committing transaction, and whether it is the one which is
that has been requested as a target. If so, and a barrier is
requested, I think we need to have jbd2_log_start_commit() do the
barrier. There is a risk of having a double barrier in that case, but
modifying the flag on the currently committing transaction has the
danger of being lost by kjournald --- or I guess alternatively we
could have kjournald check that flag while holding j_state_lock, which
might be better.
If the currently running transaction is the requested target, then
that's easy; we can set the flag, and then wake up the j_wait_commit
wait queue.
In any case, log_start_commit() doesn't currently distinguish between
whether the targetted commit is the running or the committing
transaction when it returns 0, and I think that's a problem.
- Ted
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] ext4/jbd2: fix io-barrier logic in case of external journal
2010-03-22 16:14 ` Dmitry Monakhov
2010-03-22 20:22 ` tytso
@ 2010-03-30 1:47 ` Jan Kara
1 sibling, 0 replies; 9+ messages in thread
From: Jan Kara @ 2010-03-30 1:47 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: tytso, linux-ext4
> tytso@mit.edu writes:
>
> > On Mon, Mar 22, 2010 at 05:04:19PM +0300, Dmitry Monakhov wrote:
> >> tytso@mit.edu writes:
> >>
> >> > On Fri, Mar 12, 2010 at 08:26:49PM +0300, Dmitry Monakhov wrote:
> >> >> start_journal_io:
> >> >> + if (bufs)
> >> >> + commit_transaction->t_flushed_data_blocks = 1;
> >> >> +
> >> >
> >> > I'm not convinced this is right.
> >> >
> >> > From your test case, the problem isn't because we have journaled
> >> > metadata blocks (which is what bufs) counts, but because fsync()
> >> > depends on data blocks also getting flushed out to disks.
> >> >
> >> > However, if we aren't closing the transaction because of fsync(), I
> >> > don't think we need to do a barrier in the case of an external
> >> > journal. So instead of effectively unconditionally setting
> >> > t_flushed_data_blocks (since bufs is nearly always going to be
> >> > non-zero), I think the better fix is to test to see if the journal
> >> > device != to the fs data device in fsync(), and if so, start the
> >> > barrier operation there.
> >> >
> >> > Do you agree?
> >> Yes.
> >
> > Just to be clear, since I realized I wrote fsync() when I should have
> > written fs/ext4/fsync.c, my proposal was to put this check in
> > ext4_sync_file().
> This means that we will end up with two io-barriers in a row
> for external journal case:
> ext4_sync_file()
> ->jbd2_log_start_commit()
> ->blkdev_issue_flush(j_fs_dev)
> /* seems following flush is mandatory to guarantee the metadata
> * consistency */
> ->blkdev_issue_flush(j_fs_dev)
>
> may be it will be better to pass some sort of barrier flag to
> jbd2_log_start_commit() this function mark journal->j_commit_request
> with ->t_flushed_data_blocks = 1, so io-carrier will be issued
> even if transaction has only metadata
> Advantages:
> 1) approach will works for all data modes
> 2) only one io-barrier is needed to guarantee the data and
> metadata consiscency.
> 3) changes not so complicated.
>
> I've already started to work with the patch but was surprised with
> commit logic.
> ->__jbd2_log_start_commit(target)
> journal->j_commit_request = target
> ->wake_up(&journal->j_wait_commit)
> ->kjournald2
> transaction = journal->j_running_transaction
> ->jbd2_journal_commit_transaction(journal);
> commit_transaction = journal->j_running_transaction
> ...
> /* So j_commit_request is used only for comparison. But actually
> committing journal->j_running_transaction->t_tid transaction
> instead of j_commit_request. Why?
> */
Remember: There is always at most one running transaction and at most one
transaction in the committing state - all the JBD2 and ext4 code heavily relies
on this. So in this case, because kjournald isn't committing any transaction,
we know that there is no committing transaction and at most one running
transaction whose tid = journal->j_commit_sequence + 1. So actually the only
trasaction we can commit is the running one and it does not even make sence to
try committing another one (the check in log_start_commit actually makes sure
that j_commit_request cannot be set to a lower value than it is and the code in
kjournald2 makes sure that j_commit_request >= j_commit_sequence. So I think
the code works right (although I agree it's hard to read - it's an old legacy
;).
Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] ext4/jbd2: fix io-barrier logic in case of external journal
2010-03-22 20:22 ` tytso
@ 2010-03-30 2:14 ` Jan Kara
0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2010-03-30 2:14 UTC (permalink / raw)
To: tytso; +Cc: Dmitry Monakhov, linux-ext4
> On Mon, Mar 22, 2010 at 07:14:13PM +0300, Dmitry Monakhov wrote:
> > >
> > > Just to be clear, since I realized I wrote fsync() when I should have
> > > written fs/ext4/fsync.c, my proposal was to put this check in
> > > ext4_sync_file().
> > This means that we will end up with two io-barriers in a row
> > for external journal case:
> > ext4_sync_file()
> > ->jbd2_log_start_commit()
> > ->blkdev_issue_flush(j_fs_dev)
> > /* seems following flush is mandatory to guarantee the metadata
> > * consistency */
> > ->blkdev_issue_flush(j_fs_dev)
>
> Well, not if we only issue a barrier in the external barrier case....
> but I agree the logic may start getting ugly.
>
> > I've already started to work with the patch but was surprised with
> > commit logic.
> > ->__jbd2_log_start_commit(target)
> > journal->j_commit_request = target
> > ->wake_up(&journal->j_wait_commit)
> > ->kjournald2
> > transaction = journal->j_running_transaction
> > ->jbd2_journal_commit_transaction(journal);
> > commit_transaction = journal->j_running_transaction
> > ...
> > /* So j_commit_request is used only for comparison. But actually
> > committing journal->j_running_transaction->t_tid transaction
> > instead of j_commit_request. Why?
> > */
>
> Yeah, I don't think jbd2_log_start_commit() has the semantics that are
> currently documented. We will return 0 if we wake up the commit
> thread, yes --- but since we only check j_commit_request, and it's a
> geq test, it might very well be the case that the transaction was
> committed long ago, and so we end up kicking off a transaction commit
> when one is not needed. Or, it may be that a transaction is just
> already being committed (and in fact is just about to be completed),
> and so the wake_up() call is a no-op, and so we don't get a barrier
> when one is needed (and expected) by ext4_sync_file().
>
> We need to look at this very closely, but I think
> jbd2_log_start_commit() needs to check to see whether there is a
> current committing transaction, and whether it is the one which is
> that has been requested as a target. If so, and a barrier is
> requested, I think we need to have jbd2_log_start_commit() do the
> barrier. There is a risk of having a double barrier in that case, but
> modifying the flag on the currently committing transaction has the
> danger of being lost by kjournald --- or I guess alternatively we
> could have kjournald check that flag while holding j_state_lock, which
> might be better.
But still if you happen to set the flag after commit code has checked
it, you have lost the race. Given the bug you describe below, I think
we should provide a new call from JBD2 that will do all the necessary
magic for given TID - something like:
data_barrier = 0;
if (journal->j_commit_sequence > tid)
data_barrier = 1;
else if (journal->j_committing_transaction &&
journal->j_committing_transaction->t_tid == tid) {
/* Here we could be more clever and set the barrier
* flag of the transaction if according to its state
* it has not yet issued data barrier */
data_barrier = 1;
} else {
journal->j_running_transaction->has_data = 1;
jbd2_log_start_commit(journal, tid);
}
jbd2_log_wait_commit(journal, tid);
if (data_barrier)
blk_dev_issue_flush(journal->j_fs_dev);
What do you think?
> If the currently running transaction is the requested target, then
> that's easy; we can set the flag, and then wake up the j_wait_commit
> wait queue.
>
> In any case, log_start_commit() doesn't currently distinguish between
> whether the targetted commit is the running or the committing
> transaction when it returns 0, and I think that's a problem.
Yeah, I agree this is a problem and it could cause ext4_sync_file not
properly wait. The simplest fix is to call jbd2_log_wait_commit
unconditionally. But I'd maybe go with the above approach.
Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2010-03-30 2:14 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-12 17:26 [PATCH 1/2] ext4/jbd2: fix io-barrier logic in case of external journal Dmitry Monakhov
2010-03-12 17:26 ` [PATCH 2/2] ext3, jbd: Add barriers for file systems with exernal journals Dmitry Monakhov
2010-03-22 1:20 ` [PATCH 1/2] ext4/jbd2: fix io-barrier logic in case of external journal tytso
2010-03-22 14:04 ` Dmitry Monakhov
2010-03-22 15:03 ` tytso
2010-03-22 16:14 ` Dmitry Monakhov
2010-03-22 20:22 ` tytso
2010-03-30 2:14 ` Jan Kara
2010-03-30 1:47 ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).