* [PATCH 0/4] jbd: possible filesystem corruption fixes (rebased)
@ 2008-05-14 4:43 Hidehiro Kawai
2008-05-14 4:47 ` [PATCH 1/4] jbd: strictly check for write errors on data buffers (rebased) Hidehiro Kawai
` (3 more replies)
0 siblings, 4 replies; 23+ messages in thread
From: Hidehiro Kawai @ 2008-05-14 4:43 UTC (permalink / raw)
To: Andrew Morton, sct, adilger
Cc: linux-kernel, linux-ext4, Jan Kara, Josef Bacik, Mingming Cao,
Hidehiro Kawai, Satoshi OSHIMA, sugita
Subject: [PATCH 0/4] jbd: possible filesystem corruption fixes (rebased)
This is the rebased version against 2.6.26-rc2, so there is no
essential difference from the previous post.
The previous post can be found at: http://lkml.org/lkml/2008/4/18/154
(The previous post may have been filtered out as SPAM mails
due to a trouble in the mail submission.)
This patch set fixes several error handling problems. As the
result, we can save the filesystem from file data and structural
corruption especially caused by temporal I/O errors. Do temporal
I/O errors occur so often? At least it will be not uncommon
for iSCSI storages.
This fixes have been done only for ext3/JBD parts. The ext4/JBD2
version has not been prepared yet, but merging this patch set
will be worthwhile because it takes away possible filesystem
corruption.
[PATCH 1/4] jbd: strictly check for write errors on data buffers
Without this patch, some file data in ordered mode aren't
checked for errors. This means user processes can continue to
update the filesystem without noticing the write failure.
Furthermore, the page cache which we failed to write becomes
reclaimable. So if the page cache is reclaimed then we
succeed to read its data from the disk, the data corruption
will occur because the data is old.
Jan's ordered mode rewrite patch also fixes this problem, but
this patch will be needed at least for the current kernel.
[PATCH 2/4] jbd: ordered data integrity fix
This patch fixes the ordered mode violation problem caused
by write error. Jan's ordered mode rewrite patch will also
fix this problem.
[PATCH 3/4] jbd: abort when failed to log metadata buffers
Without this patch, the filesystem can corrupt along with
the following scenario:
1. fail to write a metadata buffer to block B in the journal
2. succeed to write the commit record
3. the system crashes, reboots and mount the filesystem
4. in the recovery phase, succeed to read data from block B
5. write back the read data to the filesystem, but it is
a stale metadata
6. lose some files and directories!
This problem wouldn't happen if we have JBD2's journal
checksumming feature and it's always turned on.
[PATCH 4/4] ext3/jbd: fix error handling for checkpoint io
Without this patch, the filesystem can lose some metadata
updates even though the transactions have been committed.
Regards,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center
^ permalink raw reply [flat|nested] 23+ messages in thread* [PATCH 1/4] jbd: strictly check for write errors on data buffers (rebased) 2008-05-14 4:43 [PATCH 0/4] jbd: possible filesystem corruption fixes (rebased) Hidehiro Kawai @ 2008-05-14 4:47 ` Hidehiro Kawai 2008-05-14 12:56 ` Jan Kara 2008-05-14 4:48 ` [PATCH 2/4] jbd: ordered data integrity fix (rebased) Hidehiro Kawai ` (2 subsequent siblings) 3 siblings, 1 reply; 23+ messages in thread From: Hidehiro Kawai @ 2008-05-14 4:47 UTC (permalink / raw) To: Andrew Morton, sct, adilger Cc: Hidehiro Kawai, linux-kernel, linux-ext4, Jan Kara, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita Subject: [PATCH 1/4] jbd: strictly check for write errors on data buffers In ordered mode, we should abort journaling when an I/O error has occurred on a file data buffer in the committing transaction. But there can be data buffers which are not checked for error: (a) the buffer which has already been written out by pdflush (b) the buffer which has been unlocked before scanned in the t_locked_list loop This patch adds missing error checks and aborts journaling appropriately. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitacih.com> --- fs/jbd/commit.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) Index: linux-2.6.26-rc2/fs/jbd/commit.c =================================================================== --- linux-2.6.26-rc2.orig/fs/jbd/commit.c +++ linux-2.6.26-rc2/fs/jbd/commit.c @@ -172,7 +172,7 @@ static void journal_do_submit_data(struc /* * Submit all the data buffers to disk */ -static void journal_submit_data_buffers(journal_t *journal, +static int journal_submit_data_buffers(journal_t *journal, transaction_t *commit_transaction) { struct journal_head *jh; @@ -180,6 +180,7 @@ static void journal_submit_data_buffers( int locked; int bufs = 0; struct buffer_head **wbuf = journal->j_wbuf; + int err = 0; /* * Whenever we unlock the journal and sleep, things can get added @@ -253,6 +254,8 @@ write_out_data: put_bh(bh); } else { BUFFER_TRACE(bh, "writeout complete: unfile"); + if (unlikely(!buffer_uptodate(bh))) + err = -EIO; __journal_unfile_buffer(jh); jbd_unlock_bh_state(bh); if (locked) @@ -271,6 +274,8 @@ write_out_data: } spin_unlock(&journal->j_list_lock); journal_do_submit_data(wbuf, bufs); + + return err; } /* @@ -410,8 +415,7 @@ void journal_commit_transaction(journal_ * Now start flushing things to disk, in the order they appear * on the transaction lists. Data blocks go first. */ - err = 0; - journal_submit_data_buffers(journal, commit_transaction); + err = journal_submit_data_buffers(journal, commit_transaction); /* * Wait for all previously submitted IO to complete. @@ -426,10 +430,10 @@ void journal_commit_transaction(journal_ if (buffer_locked(bh)) { spin_unlock(&journal->j_list_lock); wait_on_buffer(bh); - if (unlikely(!buffer_uptodate(bh))) - err = -EIO; spin_lock(&journal->j_list_lock); } + if (unlikely(!buffer_uptodate(bh))) + err = -EIO; if (!inverted_lock(journal, bh)) { put_bh(bh); spin_lock(&journal->j_list_lock); ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/4] jbd: strictly check for write errors on data buffers (rebased) 2008-05-14 4:47 ` [PATCH 1/4] jbd: strictly check for write errors on data buffers (rebased) Hidehiro Kawai @ 2008-05-14 12:56 ` Jan Kara 0 siblings, 0 replies; 23+ messages in thread From: Jan Kara @ 2008-05-14 12:56 UTC (permalink / raw) To: Hidehiro Kawai Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Jan Kara, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita Hi, On Wed 14-05-08 13:47:31, Hidehiro Kawai wrote: > Subject: [PATCH 1/4] jbd: strictly check for write errors on data buffers > > In ordered mode, we should abort journaling when an I/O error has > occurred on a file data buffer in the committing transaction. But > there can be data buffers which are not checked for error: > > (a) the buffer which has already been written out by pdflush > (b) the buffer which has been unlocked before scanned in the > t_locked_list loop > > This patch adds missing error checks and aborts journaling > appropriately. > > Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitacih.com> Acked-by: Jan Kara <jack@suse.cz> Honza > --- > fs/jbd/commit.c | 14 +++++++++----- > 1 file changed, 9 insertions(+), 5 deletions(-) > > Index: linux-2.6.26-rc2/fs/jbd/commit.c > =================================================================== > --- linux-2.6.26-rc2.orig/fs/jbd/commit.c > +++ linux-2.6.26-rc2/fs/jbd/commit.c > @@ -172,7 +172,7 @@ static void journal_do_submit_data(struc > /* > * Submit all the data buffers to disk > */ > -static void journal_submit_data_buffers(journal_t *journal, > +static int journal_submit_data_buffers(journal_t *journal, > transaction_t *commit_transaction) > { > struct journal_head *jh; > @@ -180,6 +180,7 @@ static void journal_submit_data_buffers( > int locked; > int bufs = 0; > struct buffer_head **wbuf = journal->j_wbuf; > + int err = 0; > > /* > * Whenever we unlock the journal and sleep, things can get added > @@ -253,6 +254,8 @@ write_out_data: > put_bh(bh); > } else { > BUFFER_TRACE(bh, "writeout complete: unfile"); > + if (unlikely(!buffer_uptodate(bh))) > + err = -EIO; > __journal_unfile_buffer(jh); > jbd_unlock_bh_state(bh); > if (locked) > @@ -271,6 +274,8 @@ write_out_data: > } > spin_unlock(&journal->j_list_lock); > journal_do_submit_data(wbuf, bufs); > + > + return err; > } > > /* > @@ -410,8 +415,7 @@ void journal_commit_transaction(journal_ > * Now start flushing things to disk, in the order they appear > * on the transaction lists. Data blocks go first. > */ > - err = 0; > - journal_submit_data_buffers(journal, commit_transaction); > + err = journal_submit_data_buffers(journal, commit_transaction); > > /* > * Wait for all previously submitted IO to complete. > @@ -426,10 +430,10 @@ void journal_commit_transaction(journal_ > if (buffer_locked(bh)) { > spin_unlock(&journal->j_list_lock); > wait_on_buffer(bh); > - if (unlikely(!buffer_uptodate(bh))) > - err = -EIO; > spin_lock(&journal->j_list_lock); > } > + if (unlikely(!buffer_uptodate(bh))) > + err = -EIO; > if (!inverted_lock(journal, bh)) { > put_bh(bh); > spin_lock(&journal->j_list_lock); > > -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 2/4] jbd: ordered data integrity fix (rebased) 2008-05-14 4:43 [PATCH 0/4] jbd: possible filesystem corruption fixes (rebased) Hidehiro Kawai 2008-05-14 4:47 ` [PATCH 1/4] jbd: strictly check for write errors on data buffers (rebased) Hidehiro Kawai @ 2008-05-14 4:48 ` Hidehiro Kawai 2008-05-14 13:10 ` Jan Kara 2008-05-14 4:49 ` [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased) Hidehiro Kawai 2008-05-14 4:50 ` [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) Hidehiro Kawai 3 siblings, 1 reply; 23+ messages in thread From: Hidehiro Kawai @ 2008-05-14 4:48 UTC (permalink / raw) To: Andrew Morton, sct, adilger Cc: Hidehiro Kawai, linux-kernel, linux-ext4, Jan Kara, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita Subject: [PATCH 2/4] jbd: ordered data integrity fix In ordered mode, if a buffer being dirtied exists in the committing transaction, we write the buffer to the disk, move it from the committing transaction to the running transaction, then dirty it. But we don't have to remove the buffer from the committing transaction when the buffer couldn't be written out, otherwise it breaks the ordered mode rule. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> --- fs/jbd/transaction.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) Index: linux-2.6.26-rc2/fs/jbd/transaction.c =================================================================== --- linux-2.6.26-rc2.orig/fs/jbd/transaction.c +++ linux-2.6.26-rc2/fs/jbd/transaction.c @@ -954,9 +954,10 @@ int journal_dirty_data(handle_t *handle, journal_t *journal = handle->h_transaction->t_journal; int need_brelse = 0; struct journal_head *jh; + int ret = 0; if (is_handle_aborted(handle)) - return 0; + return ret; jh = journal_add_journal_head(bh); JBUFFER_TRACE(jh, "entry"); @@ -1067,7 +1068,16 @@ int journal_dirty_data(handle_t *handle, time if it is redirtied */ } - /* journal_clean_data_list() may have got there first */ + /* + * We shouldn't remove the buffer from the committing + * transaction if it has failed to be written. + * Otherwise, it breaks the ordered mode rule. + */ + if (unlikely(!buffer_uptodate(bh))) { + ret = -EIO; + goto no_journal; + } + if (jh->b_transaction != NULL) { JBUFFER_TRACE(jh, "unfile from commit"); __journal_temp_unlink_buffer(jh); @@ -1108,7 +1118,7 @@ no_journal: } JBUFFER_TRACE(jh, "exit"); journal_put_journal_head(jh); - return 0; + return ret; } /** ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/4] jbd: ordered data integrity fix (rebased) 2008-05-14 4:48 ` [PATCH 2/4] jbd: ordered data integrity fix (rebased) Hidehiro Kawai @ 2008-05-14 13:10 ` Jan Kara 2008-05-16 10:25 ` Hidehiro Kawai 0 siblings, 1 reply; 23+ messages in thread From: Jan Kara @ 2008-05-14 13:10 UTC (permalink / raw) To: Hidehiro Kawai Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Jan Kara, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita On Wed 14-05-08 13:48:43, Hidehiro Kawai wrote: > Subject: [PATCH 2/4] jbd: ordered data integrity fix > > In ordered mode, if a buffer being dirtied exists in the committing > transaction, we write the buffer to the disk, move it from the > committing transaction to the running transaction, then dirty it. > But we don't have to remove the buffer from the committing > transaction when the buffer couldn't be written out, otherwise it > breaks the ordered mode rule. Hmm, could you elaborate a bit more what exactly is broken and how does this help to fix it? Because even if we find EIO happened on data buffer, we currently don't do anything else than just remove the buffer from the transaction and abort the journal. And even if we later managed to write the data buffer from other process before the journal is aborted, ordered mode guarantees are satisfied - we only guarantee that too old data cannot be seen, newer can be seen easily... Thanks. Honza > > Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> > --- > fs/jbd/transaction.c | 16 +++++++++++++--- > 1 file changed, 13 insertions(+), 3 deletions(-) > > Index: linux-2.6.26-rc2/fs/jbd/transaction.c > =================================================================== > --- linux-2.6.26-rc2.orig/fs/jbd/transaction.c > +++ linux-2.6.26-rc2/fs/jbd/transaction.c > @@ -954,9 +954,10 @@ int journal_dirty_data(handle_t *handle, > journal_t *journal = handle->h_transaction->t_journal; > int need_brelse = 0; > struct journal_head *jh; > + int ret = 0; > > if (is_handle_aborted(handle)) > - return 0; > + return ret; > > jh = journal_add_journal_head(bh); > JBUFFER_TRACE(jh, "entry"); > @@ -1067,7 +1068,16 @@ int journal_dirty_data(handle_t *handle, > time if it is redirtied */ > } > > - /* journal_clean_data_list() may have got there first */ > + /* > + * We shouldn't remove the buffer from the committing > + * transaction if it has failed to be written. > + * Otherwise, it breaks the ordered mode rule. > + */ > + if (unlikely(!buffer_uptodate(bh))) { > + ret = -EIO; > + goto no_journal; > + } > + > if (jh->b_transaction != NULL) { > JBUFFER_TRACE(jh, "unfile from commit"); > __journal_temp_unlink_buffer(jh); > @@ -1108,7 +1118,7 @@ no_journal: > } > JBUFFER_TRACE(jh, "exit"); > journal_put_journal_head(jh); > - return 0; > + return ret; > } > > /** > > -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/4] jbd: ordered data integrity fix (rebased) 2008-05-14 13:10 ` Jan Kara @ 2008-05-16 10:25 ` Hidehiro Kawai 2008-05-19 3:11 ` Jan Kara 0 siblings, 1 reply; 23+ messages in thread From: Hidehiro Kawai @ 2008-05-16 10:25 UTC (permalink / raw) To: Jan Kara Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita Hi, Thanks for your comment. Jan Kara wrote: > On Wed 14-05-08 13:48:43, Hidehiro Kawai wrote: > >>Subject: [PATCH 2/4] jbd: ordered data integrity fix >> >>In ordered mode, if a buffer being dirtied exists in the committing >>transaction, we write the buffer to the disk, move it from the >>committing transaction to the running transaction, then dirty it. >>But we don't have to remove the buffer from the committing >>transaction when the buffer couldn't be written out, otherwise it >>breaks the ordered mode rule. > > Hmm, could you elaborate a bit more what exactly is broken and how does > this help to fix it? Because even if we find EIO happened on data buffer, > we currently don't do anything else than just remove the buffer from the > transaction and abort the journal. And even if we later managed to write > the data buffer from other process before the journal is aborted, ordered > mode guarantees are satisfied - we only guarantee that too old data cannot > be seen, newer can be seen easily... Thanks. In the case where I stated the above, error checking is postponed to the next (currently running) transaction because the buffer is removed from the committing transaction before checked for an error. This can happen repeatedly, then the error won't be detected "for a long time". However, finally the error is detected by, for example, journal_commit_transaction(), we can abort the journal. So this problem is not so serious than the other patches which I sent. Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/4] jbd: ordered data integrity fix (rebased) 2008-05-16 10:25 ` Hidehiro Kawai @ 2008-05-19 3:11 ` Jan Kara 0 siblings, 0 replies; 23+ messages in thread From: Jan Kara @ 2008-05-19 3:11 UTC (permalink / raw) To: Hidehiro Kawai Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita Hello, On Fri 16-05-08 19:25:44, Hidehiro Kawai wrote: > Jan Kara wrote: > > > On Wed 14-05-08 13:48:43, Hidehiro Kawai wrote: > > > >>Subject: [PATCH 2/4] jbd: ordered data integrity fix > >> > >>In ordered mode, if a buffer being dirtied exists in the committing > >>transaction, we write the buffer to the disk, move it from the > >>committing transaction to the running transaction, then dirty it. > >>But we don't have to remove the buffer from the committing > >>transaction when the buffer couldn't be written out, otherwise it > >>breaks the ordered mode rule. > > > > Hmm, could you elaborate a bit more what exactly is broken and how does > > this help to fix it? Because even if we find EIO happened on data buffer, > > we currently don't do anything else than just remove the buffer from the > > transaction and abort the journal. And even if we later managed to write > > the data buffer from other process before the journal is aborted, ordered > > mode guarantees are satisfied - we only guarantee that too old data cannot > > be seen, newer can be seen easily... Thanks. > > In the case where I stated the above, error checking is postponed to > the next (currently running) transaction because the buffer is removed > from the committing transaction before checked for an error. This can > happen repeatedly, then the error won't be detected "for a long time". > However, finally the error is detected by, for example, > journal_commit_transaction(), we can abort the journal. So this > problem is not so serious than the other patches which I sent. OK, I see. So I agree with the change but please add this explanation (like: cannot remove buffer with io error from the committing transaction because otherwise it would miss the error and commit would not abort) to the comment in journal_dirty_data(). Thanks. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased) 2008-05-14 4:43 [PATCH 0/4] jbd: possible filesystem corruption fixes (rebased) Hidehiro Kawai 2008-05-14 4:47 ` [PATCH 1/4] jbd: strictly check for write errors on data buffers (rebased) Hidehiro Kawai 2008-05-14 4:48 ` [PATCH 2/4] jbd: ordered data integrity fix (rebased) Hidehiro Kawai @ 2008-05-14 4:49 ` Hidehiro Kawai 2008-05-14 13:15 ` Jan Kara 2008-05-14 4:50 ` [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) Hidehiro Kawai 3 siblings, 1 reply; 23+ messages in thread From: Hidehiro Kawai @ 2008-05-14 4:49 UTC (permalink / raw) To: Andrew Morton, sct, adilger Cc: Hidehiro Kawai, linux-kernel, linux-ext4, Jan Kara, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita Subject: [PATCH 3/4] jbd: abort when failed to log metadata buffers If we failed to write metadata buffers to the journal space and succeeded to write the commit record, stale data can be written back to the filesystem as metadata in the recovery phase. To avoid this, when we failed to write out metadata buffers, abort the journal before writing the commit record. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> --- fs/jbd/commit.c | 3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6.26-rc2/fs/jbd/commit.c =================================================================== --- linux-2.6.26-rc2.orig/fs/jbd/commit.c +++ linux-2.6.26-rc2/fs/jbd/commit.c @@ -703,6 +703,9 @@ wait_for_iobuf: __brelse(bh); } + if (err) + journal_abort(journal, err); + J_ASSERT (commit_transaction->t_shadow_list == NULL); jbd_debug(3, "JBD: commit phase 5\n"); ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased) 2008-05-14 4:49 ` [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased) Hidehiro Kawai @ 2008-05-14 13:15 ` Jan Kara 2008-05-16 10:26 ` Hidehiro Kawai 0 siblings, 1 reply; 23+ messages in thread From: Jan Kara @ 2008-05-14 13:15 UTC (permalink / raw) To: Hidehiro Kawai Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Jan Kara, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita On Wed 14-05-08 13:49:51, Hidehiro Kawai wrote: > Subject: [PATCH 3/4] jbd: abort when failed to log metadata buffers > > If we failed to write metadata buffers to the journal space and > succeeded to write the commit record, stale data can be written > back to the filesystem as metadata in the recovery phase. > > To avoid this, when we failed to write out metadata buffers, > abort the journal before writing the commit record. > > Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> > --- > fs/jbd/commit.c | 3 +++ > 1 file changed, 3 insertions(+) > > Index: linux-2.6.26-rc2/fs/jbd/commit.c > =================================================================== > --- linux-2.6.26-rc2.orig/fs/jbd/commit.c > +++ linux-2.6.26-rc2/fs/jbd/commit.c > @@ -703,6 +703,9 @@ wait_for_iobuf: > __brelse(bh); > } > > + if (err) > + journal_abort(journal, err); > + > J_ASSERT (commit_transaction->t_shadow_list == NULL); Shouldn't this rather be further just before journal_write_commit_record()? We should abort also if writing revoke records etc. failed, shouldn't we? > jbd_debug(3, "JBD: commit phase 5\n"); > Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased) 2008-05-14 13:15 ` Jan Kara @ 2008-05-16 10:26 ` Hidehiro Kawai 2008-05-19 3:14 ` Jan Kara 0 siblings, 1 reply; 23+ messages in thread From: Hidehiro Kawai @ 2008-05-16 10:26 UTC (permalink / raw) To: Jan Kara Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita Hi, Thank you for review. Jan Kara wrote: > On Wed 14-05-08 13:49:51, Hidehiro Kawai wrote: > >>Subject: [PATCH 3/4] jbd: abort when failed to log metadata buffers >> >>If we failed to write metadata buffers to the journal space and >>succeeded to write the commit record, stale data can be written >>back to the filesystem as metadata in the recovery phase. >> >>To avoid this, when we failed to write out metadata buffers, >>abort the journal before writing the commit record. >> >>Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> >>--- >> fs/jbd/commit.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >>Index: linux-2.6.26-rc2/fs/jbd/commit.c >>=================================================================== >>--- linux-2.6.26-rc2.orig/fs/jbd/commit.c >>+++ linux-2.6.26-rc2/fs/jbd/commit.c >>@@ -703,6 +703,9 @@ wait_for_iobuf: >> __brelse(bh); >> } >> >>+ if (err) >>+ journal_abort(journal, err); >>+ >> J_ASSERT (commit_transaction->t_shadow_list == NULL); > > Shouldn't this rather be further just before > journal_write_commit_record()? We should abort also if writing revoke > records etc. failed, shouldn't we? Unlike metadata blocks, each revoke block has a descriptor with the sequence number of the commiting transaction. If we failed to write a revoke block, there should be an old control block, metadata block, or zero-filled block where we tried to write the revoke block. In the recovery process, this old invalid block is detected by checking its magic number and sequence number, then the transaction is ignored even if we have succeeded to write the commit record. So I think we don't need to check for errors just after writing revoke records. Thanks, >> jbd_debug(3, "JBD: commit phase 5\n"); >> -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased) 2008-05-16 10:26 ` Hidehiro Kawai @ 2008-05-19 3:14 ` Jan Kara 2008-05-21 1:33 ` Hidehiro Kawai 0 siblings, 1 reply; 23+ messages in thread From: Jan Kara @ 2008-05-19 3:14 UTC (permalink / raw) To: Hidehiro Kawai Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita Hello, On Fri 16-05-08 19:26:57, Hidehiro Kawai wrote: > Jan Kara wrote: > > > On Wed 14-05-08 13:49:51, Hidehiro Kawai wrote: > > > >>Subject: [PATCH 3/4] jbd: abort when failed to log metadata buffers > >> > >>If we failed to write metadata buffers to the journal space and > >>succeeded to write the commit record, stale data can be written > >>back to the filesystem as metadata in the recovery phase. > >> > >>To avoid this, when we failed to write out metadata buffers, > >>abort the journal before writing the commit record. > >> > >>Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> > >>--- > >> fs/jbd/commit.c | 3 +++ > >> 1 file changed, 3 insertions(+) > >> > >>Index: linux-2.6.26-rc2/fs/jbd/commit.c > >>=================================================================== > >>--- linux-2.6.26-rc2.orig/fs/jbd/commit.c > >>+++ linux-2.6.26-rc2/fs/jbd/commit.c > >>@@ -703,6 +703,9 @@ wait_for_iobuf: > >> __brelse(bh); > >> } > >> > >>+ if (err) > >>+ journal_abort(journal, err); > >>+ > >> J_ASSERT (commit_transaction->t_shadow_list == NULL); > > > > Shouldn't this rather be further just before > > journal_write_commit_record()? We should abort also if writing revoke > > records etc. failed, shouldn't we? > > Unlike metadata blocks, each revoke block has a descriptor with the > sequence number of the commiting transaction. If we failed to write > a revoke block, there should be an old control block, metadata block, > or zero-filled block where we tried to write the revoke block. > In the recovery process, this old invalid block is detected by > checking its magic number and sequence number, then the transaction > is ignored even if we have succeeded to write the commit record. > So I think we don't need to check for errors just after writing > revoke records. Yes, I agree that not doing such check will not cause data corruption but still I think that in case we fail to properly commit a transaction, we should detect the error and abort the journal... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased) 2008-05-19 3:14 ` Jan Kara @ 2008-05-21 1:33 ` Hidehiro Kawai 0 siblings, 0 replies; 23+ messages in thread From: Hidehiro Kawai @ 2008-05-21 1:33 UTC (permalink / raw) To: Jan Kara Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita Hi, Jan Kara wrote: > > On Fri 16-05-08 19:26:57, Hidehiro Kawai wrote: > >>Jan Kara wrote: >> >> >>>On Wed 14-05-08 13:49:51, Hidehiro Kawai wrote: >>> >>> >>>>Subject: [PATCH 3/4] jbd: abort when failed to log metadata buffers >>>> >>>>If we failed to write metadata buffers to the journal space and >>>>succeeded to write the commit record, stale data can be written >>>>back to the filesystem as metadata in the recovery phase. >>>> >>>>To avoid this, when we failed to write out metadata buffers, >>>>abort the journal before writing the commit record. >>>> >>>>Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> >>>>--- >>>>fs/jbd/commit.c | 3 +++ >>>>1 file changed, 3 insertions(+) >>>> >>>>Index: linux-2.6.26-rc2/fs/jbd/commit.c >>>>=================================================================== >>>>--- linux-2.6.26-rc2.orig/fs/jbd/commit.c >>>>+++ linux-2.6.26-rc2/fs/jbd/commit.c >>>>@@ -703,6 +703,9 @@ wait_for_iobuf: >>>> __brelse(bh); >>>> } >>>> >>>>+ if (err) >>>>+ journal_abort(journal, err); >>>>+ >>>> J_ASSERT (commit_transaction->t_shadow_list == NULL); >>> >>> Shouldn't this rather be further just before >>>journal_write_commit_record()? We should abort also if writing revoke >>>records etc. failed, shouldn't we? >> >>Unlike metadata blocks, each revoke block has a descriptor with the >>sequence number of the commiting transaction. If we failed to write >>a revoke block, there should be an old control block, metadata block, >>or zero-filled block where we tried to write the revoke block. >>In the recovery process, this old invalid block is detected by >>checking its magic number and sequence number, then the transaction >>is ignored even if we have succeeded to write the commit record. >>So I think we don't need to check for errors just after writing >>revoke records. > > Yes, I agree that not doing such check will not cause data corruption but > still I think that in case we fail to properly commit a transaction, we > should detect the error and abort the journal... I see. I'll move the aborting point to just before journal_write_commit_record() in the next version. Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) 2008-05-14 4:43 [PATCH 0/4] jbd: possible filesystem corruption fixes (rebased) Hidehiro Kawai ` (2 preceding siblings ...) 2008-05-14 4:49 ` [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased) Hidehiro Kawai @ 2008-05-14 4:50 ` Hidehiro Kawai 2008-05-14 13:16 ` Josef Bacik 2008-05-14 14:32 ` Jan Kara 3 siblings, 2 replies; 23+ messages in thread From: Hidehiro Kawai @ 2008-05-14 4:50 UTC (permalink / raw) To: Andrew Morton, sct, adilger Cc: Hidehiro Kawai, linux-kernel, linux-ext4, Jan Kara, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita Subject: [PATCH 4/4] jbd: fix error handling for checkpoint io When a checkpointing IO fails, current JBD code doesn't check the error and continue journaling. This means latest metadata can be lost from both the journal and filesystem. This patch leaves the failed metadata blocks in the journal space and aborts journaling in the case of log_do_checkpoint(). To achieve this, we need to do: 1. don't remove the failed buffer from the checkpoint list where in the case of __try_to_free_cp_buf() because it may be released or overwritten by a later transaction 2. log_do_checkpoint() is the last chance, remove the failed buffer from the checkpoint list and abort journaling 3. when checkpointing fails, don't update the journal super block to prevent the journalled contents from being cleaned 4. when checkpointing fails, don't clear the needs_recovery flag, otherwise the journalled contents are ignored and cleaned in the recovery phase 5. if the recovery fails, keep the needs_recovery flag Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> --- fs/ext3/ioctl.c | 12 ++++---- fs/ext3/super.c | 13 +++++++-- fs/jbd/checkpoint.c | 59 ++++++++++++++++++++++++++++++++++-------- fs/jbd/journal.c | 21 +++++++++----- fs/jbd/recovery.c | 7 +++- include/linux/jbd.h | 7 ++++ 6 files changed, 91 insertions(+), 28 deletions(-) Index: linux-2.6.26-rc2/fs/jbd/checkpoint.c =================================================================== --- linux-2.6.26-rc2.orig/fs/jbd/checkpoint.c +++ linux-2.6.26-rc2/fs/jbd/checkpoint.c @@ -93,7 +93,8 @@ static int __try_to_free_cp_buf(struct j int ret = 0; struct buffer_head *bh = jh2bh(jh); - if (jh->b_jlist == BJ_None && !buffer_locked(bh) && !buffer_dirty(bh)) { + if (jh->b_jlist == BJ_None && !buffer_locked(bh) && + !buffer_dirty(bh) && buffer_uptodate(bh)) { JBUFFER_TRACE(jh, "remove from checkpoint list"); ret = __journal_remove_checkpoint(jh) + 1; jbd_unlock_bh_state(bh); @@ -140,6 +141,25 @@ void __log_wait_for_space(journal_t *jou } /* + * We couldn't write back some metadata buffers to the filesystem. + * To avoid unwritten-back metadata buffers from losing, set + * JFS_CP_ABORT flag and make sure that the journal space isn't + * cleaned. This function also aborts journaling. + */ +static void __journal_abort_checkpoint(journal_t *journal, int errno) +{ + if (is_checkpoint_aborted(journal)) + return; + + spin_lock(&journal->j_state_lock); + journal->j_flags |= JFS_CP_ABORT; + spin_unlock(&journal->j_state_lock); + printk(KERN_WARNING "JBD: Checkpointing failed. Some metadata blocks " + "are still old.\n"); + journal_abort(journal, errno); +} + +/* * We were unable to perform jbd_trylock_bh_state() inside j_list_lock. * The caller must restart a list walk. Wait for someone else to run * jbd_unlock_bh_state(). @@ -160,21 +180,25 @@ static void jbd_sync_bh(journal_t *journ * buffers. Note that we take the buffers in the opposite ordering * from the one in which they were submitted for IO. * + * Return 0 on success, and return <0 if some buffers have failed + * to be written out. + * * Called with j_list_lock held. */ -static void __wait_cp_io(journal_t *journal, transaction_t *transaction) +static int __wait_cp_io(journal_t *journal, transaction_t *transaction) { struct journal_head *jh; struct buffer_head *bh; tid_t this_tid; int released = 0; + int ret = 0; this_tid = transaction->t_tid; restart: /* Did somebody clean up the transaction in the meanwhile? */ if (journal->j_checkpoint_transactions != transaction || transaction->t_tid != this_tid) - return; + return ret; while (!released && transaction->t_checkpoint_io_list) { jh = transaction->t_checkpoint_io_list; bh = jh2bh(jh); @@ -194,6 +218,9 @@ restart: spin_lock(&journal->j_list_lock); goto restart; } + if (unlikely(!buffer_uptodate(bh))) + ret = -EIO; + /* * Now in whatever state the buffer currently is, we know that * it has been written out and so we can drop it from the list @@ -203,6 +230,8 @@ restart: journal_remove_journal_head(bh); __brelse(bh); } + + return ret; } #define NR_BATCH 64 @@ -226,7 +255,8 @@ __flush_batch(journal_t *journal, struct * Try to flush one buffer from the checkpoint list to disk. * * Return 1 if something happened which requires us to abort the current - * scan of the checkpoint list. + * scan of the checkpoint list. Return <0 if the buffer has failed to + * be written out. * * Called with j_list_lock held and drops it if 1 is returned * Called under jbd_lock_bh_state(jh2bh(jh)), and drops it @@ -256,6 +286,9 @@ static int __process_buffer(journal_t *j log_wait_commit(journal, tid); ret = 1; } else if (!buffer_dirty(bh)) { + ret = 1; + if (unlikely(!buffer_uptodate(bh))) + ret = -EIO; J_ASSERT_JH(jh, !buffer_jbddirty(bh)); BUFFER_TRACE(bh, "remove from checkpoint"); __journal_remove_checkpoint(jh); @@ -263,7 +296,6 @@ static int __process_buffer(journal_t *j jbd_unlock_bh_state(bh); journal_remove_journal_head(bh); __brelse(bh); - ret = 1; } else { /* * Important: we are about to write the buffer, and @@ -318,6 +350,7 @@ int log_do_checkpoint(journal_t *journal * OK, we need to start writing disk blocks. Take one transaction * and write it. */ + result = 0; spin_lock(&journal->j_list_lock); if (!journal->j_checkpoint_transactions) goto out; @@ -334,7 +367,7 @@ restart: int batch_count = 0; struct buffer_head *bhs[NR_BATCH]; struct journal_head *jh; - int retry = 0; + int retry = 0, err; while (!retry && transaction->t_checkpoint_list) { struct buffer_head *bh; @@ -347,6 +380,8 @@ restart: break; } retry = __process_buffer(journal, jh, bhs,&batch_count); + if (retry < 0) + result = retry; if (!retry && (need_resched() || spin_needbreak(&journal->j_list_lock))) { spin_unlock(&journal->j_list_lock); @@ -371,14 +406,18 @@ restart: * Now we have cleaned up the first transaction's checkpoint * list. Let's clean up the second one */ - __wait_cp_io(journal, transaction); + err = __wait_cp_io(journal, transaction); + if (!result) + result = err; } out: spin_unlock(&journal->j_list_lock); - result = cleanup_journal_tail(journal); if (result < 0) - return result; - return 0; + __journal_abort_checkpoint(journal, result); + else + result = cleanup_journal_tail(journal); + + return (result < 0) ? result : 0; } /* Index: linux-2.6.26-rc2/include/linux/jbd.h =================================================================== --- linux-2.6.26-rc2.orig/include/linux/jbd.h +++ linux-2.6.26-rc2/include/linux/jbd.h @@ -816,6 +816,8 @@ struct journal_s #define JFS_FLUSHED 0x008 /* The journal superblock has been flushed */ #define JFS_LOADED 0x010 /* The journal superblock has been loaded */ #define JFS_BARRIER 0x020 /* Use IDE barriers */ +#define JFS_CP_ABORT 0x040 /* Checkpointing has failed. We don't have to + * clean the journal space. */ /* * Function declarations for the journaling transaction and buffer @@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou return journal->j_flags & JFS_ABORT; } +static inline int is_checkpoint_aborted(journal_t *journal) +{ + return journal->j_flags & JFS_CP_ABORT; +} + static inline int is_handle_aborted(handle_t *handle) { if (handle->h_aborted) Index: linux-2.6.26-rc2/fs/ext3/ioctl.c =================================================================== --- linux-2.6.26-rc2.orig/fs/ext3/ioctl.c +++ linux-2.6.26-rc2/fs/ext3/ioctl.c @@ -239,7 +239,7 @@ setrsvsz_out: case EXT3_IOC_GROUP_EXTEND: { ext3_fsblk_t n_blocks_count; struct super_block *sb = inode->i_sb; - int err; + int err, err2; if (!capable(CAP_SYS_RESOURCE)) return -EPERM; @@ -254,16 +254,16 @@ setrsvsz_out: } err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count); journal_lock_updates(EXT3_SB(sb)->s_journal); - journal_flush(EXT3_SB(sb)->s_journal); + err2 = journal_flush(EXT3_SB(sb)->s_journal); journal_unlock_updates(EXT3_SB(sb)->s_journal); group_extend_out: mnt_drop_write(filp->f_path.mnt); - return err; + return (err) ? err : err2; } case EXT3_IOC_GROUP_ADD: { struct ext3_new_group_data input; struct super_block *sb = inode->i_sb; - int err; + int err, err2; if (!capable(CAP_SYS_RESOURCE)) return -EPERM; @@ -280,11 +280,11 @@ group_extend_out: err = ext3_group_add(sb, &input); journal_lock_updates(EXT3_SB(sb)->s_journal); - journal_flush(EXT3_SB(sb)->s_journal); + err2 = journal_flush(EXT3_SB(sb)->s_journal); journal_unlock_updates(EXT3_SB(sb)->s_journal); group_add_out: mnt_drop_write(filp->f_path.mnt); - return err; + return (err) ? err : err2; } Index: linux-2.6.26-rc2/fs/ext3/super.c =================================================================== --- linux-2.6.26-rc2.orig/fs/ext3/super.c +++ linux-2.6.26-rc2/fs/ext3/super.c @@ -395,7 +395,10 @@ static void ext3_put_super (struct super ext3_xattr_put_super(sb); journal_destroy(sbi->s_journal); if (!(sb->s_flags & MS_RDONLY)) { - EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER); + if (!is_checkpoint_aborted(sbi->s_journal)) { + EXT3_CLEAR_INCOMPAT_FEATURE(sb, + EXT3_FEATURE_INCOMPAT_RECOVER); + } es->s_state = cpu_to_le16(sbi->s_mount_state); BUFFER_TRACE(sbi->s_sbh, "marking dirty"); mark_buffer_dirty(sbi->s_sbh); @@ -2373,7 +2376,13 @@ static void ext3_write_super_lockfs(stru /* Now we set up the journal barrier. */ journal_lock_updates(journal); - journal_flush(journal); + + /* + * We don't want to clear needs_recovery flag when we failed + * to flush the journal. + */ + if (journal_flush(journal) < 0) + return; /* Journal blocked and flushed, clear needs_recovery flag. */ EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER); Index: linux-2.6.26-rc2/fs/jbd/journal.c =================================================================== --- linux-2.6.26-rc2.orig/fs/jbd/journal.c +++ linux-2.6.26-rc2/fs/jbd/journal.c @@ -674,7 +674,7 @@ static journal_t * journal_init_common ( journal->j_commit_interval = (HZ * JBD_DEFAULT_MAX_COMMIT_AGE); /* The journal is marked for error until we succeed with recovery! */ - journal->j_flags = JFS_ABORT; + journal->j_flags = JFS_ABORT | JFS_CP_ABORT; /* Set up a default-sized revoke table for the new mount. */ err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH); @@ -1107,7 +1107,7 @@ int journal_load(journal_t *journal) if (journal_reset(journal)) goto recovery_error; - journal->j_flags &= ~JFS_ABORT; + journal->j_flags &= ~(JFS_ABORT | JFS_CP_ABORT); journal->j_flags |= JFS_LOADED; return 0; @@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal) journal->j_tail = 0; journal->j_tail_sequence = ++journal->j_transaction_sequence; if (journal->j_sb_buffer) { - journal_update_superblock(journal, 1); + if (!is_checkpoint_aborted(journal)) + journal_update_superblock(journal, 1); brelse(journal->j_sb_buffer); } @@ -1333,7 +1334,6 @@ static int journal_convert_superblock_v1 int journal_flush(journal_t *journal) { - int err = 0; transaction_t *transaction = NULL; unsigned long old_tail; @@ -1356,14 +1356,19 @@ int journal_flush(journal_t *journal) spin_unlock(&journal->j_state_lock); } - /* ...and flush everything in the log out to disk. */ + /* ...and flush everything in the log out to disk. + * Even if an error occurs, we don't stop this loop. + * We do checkpoint as much as possible. */ spin_lock(&journal->j_list_lock); - while (!err && journal->j_checkpoint_transactions != NULL) { + while (journal->j_checkpoint_transactions != NULL) { spin_unlock(&journal->j_list_lock); - err = log_do_checkpoint(journal); + log_do_checkpoint(journal); spin_lock(&journal->j_list_lock); } spin_unlock(&journal->j_list_lock); + if (is_checkpoint_aborted(journal)) + return -EIO; + cleanup_journal_tail(journal); /* Finally, mark the journal as really needing no recovery. @@ -1385,7 +1390,7 @@ int journal_flush(journal_t *journal) J_ASSERT(journal->j_head == journal->j_tail); J_ASSERT(journal->j_tail_sequence == journal->j_transaction_sequence); spin_unlock(&journal->j_state_lock); - return err; + return 0; } /** Index: linux-2.6.26-rc2/fs/jbd/recovery.c =================================================================== --- linux-2.6.26-rc2.orig/fs/jbd/recovery.c +++ linux-2.6.26-rc2/fs/jbd/recovery.c @@ -223,7 +223,7 @@ do { \ */ int journal_recover(journal_t *journal) { - int err; + int err, err2; journal_superblock_t * sb; struct recovery_info info; @@ -261,7 +261,10 @@ int journal_recover(journal_t *journal) journal->j_transaction_sequence = ++info.end_transaction; journal_clear_revoke(journal); - sync_blockdev(journal->j_fs_dev); + err2 = sync_blockdev(journal->j_fs_dev); + if (!err) + err = err2; + return err; } ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) 2008-05-14 4:50 ` [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) Hidehiro Kawai @ 2008-05-14 13:16 ` Josef Bacik 2008-05-14 14:44 ` Jan Kara 2008-05-14 14:32 ` Jan Kara 1 sibling, 1 reply; 23+ messages in thread From: Josef Bacik @ 2008-05-14 13:16 UTC (permalink / raw) To: Hidehiro Kawai Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Jan Kara, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita On Wed, May 14, 2008 at 01:50:43PM +0900, Hidehiro Kawai wrote: > Subject: [PATCH 4/4] jbd: fix error handling for checkpoint io > > When a checkpointing IO fails, current JBD code doesn't check the > error and continue journaling. This means latest metadata can be > lost from both the journal and filesystem. > > This patch leaves the failed metadata blocks in the journal space > and aborts journaling in the case of log_do_checkpoint(). > To achieve this, we need to do: > > 1. don't remove the failed buffer from the checkpoint list where in > the case of __try_to_free_cp_buf() because it may be released or > overwritten by a later transaction > 2. log_do_checkpoint() is the last chance, remove the failed buffer > from the checkpoint list and abort journaling > 3. when checkpointing fails, don't update the journal super block to > prevent the journalled contents from being cleaned > 4. when checkpointing fails, don't clear the needs_recovery flag, > otherwise the journalled contents are ignored and cleaned in the > recovery phase > 5. if the recovery fails, keep the needs_recovery flag > > Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> > --- > fs/ext3/ioctl.c | 12 ++++---- > fs/ext3/super.c | 13 +++++++-- > fs/jbd/checkpoint.c | 59 ++++++++++++++++++++++++++++++++++-------- > fs/jbd/journal.c | 21 +++++++++----- > fs/jbd/recovery.c | 7 +++- > include/linux/jbd.h | 7 ++++ > 6 files changed, 91 insertions(+), 28 deletions(-) > > Index: linux-2.6.26-rc2/fs/jbd/checkpoint.c > =================================================================== > --- linux-2.6.26-rc2.orig/fs/jbd/checkpoint.c > +++ linux-2.6.26-rc2/fs/jbd/checkpoint.c > @@ -93,7 +93,8 @@ static int __try_to_free_cp_buf(struct j > int ret = 0; > struct buffer_head *bh = jh2bh(jh); > > - if (jh->b_jlist == BJ_None && !buffer_locked(bh) && !buffer_dirty(bh)) { > + if (jh->b_jlist == BJ_None && !buffer_locked(bh) && > + !buffer_dirty(bh) && buffer_uptodate(bh)) { > JBUFFER_TRACE(jh, "remove from checkpoint list"); > ret = __journal_remove_checkpoint(jh) + 1; > jbd_unlock_bh_state(bh); > @@ -140,6 +141,25 @@ void __log_wait_for_space(journal_t *jou > } > > /* > + * We couldn't write back some metadata buffers to the filesystem. > + * To avoid unwritten-back metadata buffers from losing, set > + * JFS_CP_ABORT flag and make sure that the journal space isn't > + * cleaned. This function also aborts journaling. > + */ > +static void __journal_abort_checkpoint(journal_t *journal, int errno) > +{ > + if (is_checkpoint_aborted(journal)) > + return; > + > + spin_lock(&journal->j_state_lock); > + journal->j_flags |= JFS_CP_ABORT; > + spin_unlock(&journal->j_state_lock); > + printk(KERN_WARNING "JBD: Checkpointing failed. Some metadata blocks " > + "are still old.\n"); > + journal_abort(journal, errno); > +} > + > +/* > * We were unable to perform jbd_trylock_bh_state() inside j_list_lock. > * The caller must restart a list walk. Wait for someone else to run > * jbd_unlock_bh_state(). > @@ -160,21 +180,25 @@ static void jbd_sync_bh(journal_t *journ > * buffers. Note that we take the buffers in the opposite ordering > * from the one in which they were submitted for IO. > * > + * Return 0 on success, and return <0 if some buffers have failed > + * to be written out. > + * > * Called with j_list_lock held. > */ > -static void __wait_cp_io(journal_t *journal, transaction_t *transaction) > +static int __wait_cp_io(journal_t *journal, transaction_t *transaction) > { > struct journal_head *jh; > struct buffer_head *bh; > tid_t this_tid; > int released = 0; > + int ret = 0; > > this_tid = transaction->t_tid; > restart: > /* Did somebody clean up the transaction in the meanwhile? */ > if (journal->j_checkpoint_transactions != transaction || > transaction->t_tid != this_tid) > - return; > + return ret; > while (!released && transaction->t_checkpoint_io_list) { > jh = transaction->t_checkpoint_io_list; > bh = jh2bh(jh); > @@ -194,6 +218,9 @@ restart: > spin_lock(&journal->j_list_lock); > goto restart; > } > + if (unlikely(!buffer_uptodate(bh))) > + ret = -EIO; > + > /* > * Now in whatever state the buffer currently is, we know that > * it has been written out and so we can drop it from the list > @@ -203,6 +230,8 @@ restart: > journal_remove_journal_head(bh); > __brelse(bh); > } > + > + return ret; > } > > #define NR_BATCH 64 > @@ -226,7 +255,8 @@ __flush_batch(journal_t *journal, struct > * Try to flush one buffer from the checkpoint list to disk. > * > * Return 1 if something happened which requires us to abort the current > - * scan of the checkpoint list. > + * scan of the checkpoint list. Return <0 if the buffer has failed to > + * be written out. > * > * Called with j_list_lock held and drops it if 1 is returned > * Called under jbd_lock_bh_state(jh2bh(jh)), and drops it > @@ -256,6 +286,9 @@ static int __process_buffer(journal_t *j > log_wait_commit(journal, tid); > ret = 1; > } else if (!buffer_dirty(bh)) { > + ret = 1; > + if (unlikely(!buffer_uptodate(bh))) > + ret = -EIO; > J_ASSERT_JH(jh, !buffer_jbddirty(bh)); > BUFFER_TRACE(bh, "remove from checkpoint"); > __journal_remove_checkpoint(jh); > @@ -263,7 +296,6 @@ static int __process_buffer(journal_t *j > jbd_unlock_bh_state(bh); > journal_remove_journal_head(bh); > __brelse(bh); > - ret = 1; > } else { > /* > * Important: we are about to write the buffer, and > @@ -318,6 +350,7 @@ int log_do_checkpoint(journal_t *journal > * OK, we need to start writing disk blocks. Take one transaction > * and write it. > */ > + result = 0; > spin_lock(&journal->j_list_lock); > if (!journal->j_checkpoint_transactions) > goto out; > @@ -334,7 +367,7 @@ restart: > int batch_count = 0; > struct buffer_head *bhs[NR_BATCH]; > struct journal_head *jh; > - int retry = 0; > + int retry = 0, err; > > while (!retry && transaction->t_checkpoint_list) { > struct buffer_head *bh; > @@ -347,6 +380,8 @@ restart: > break; > } > retry = __process_buffer(journal, jh, bhs,&batch_count); > + if (retry < 0) > + result = retry; > if (!retry && (need_resched() || > spin_needbreak(&journal->j_list_lock))) { > spin_unlock(&journal->j_list_lock); > @@ -371,14 +406,18 @@ restart: > * Now we have cleaned up the first transaction's checkpoint > * list. Let's clean up the second one > */ > - __wait_cp_io(journal, transaction); > + err = __wait_cp_io(journal, transaction); > + if (!result) > + result = err; > } > out: > spin_unlock(&journal->j_list_lock); > - result = cleanup_journal_tail(journal); > if (result < 0) > - return result; > - return 0; > + __journal_abort_checkpoint(journal, result); > + else > + result = cleanup_journal_tail(journal); > + > + return (result < 0) ? result : 0; > } > > /* > Index: linux-2.6.26-rc2/include/linux/jbd.h > =================================================================== > --- linux-2.6.26-rc2.orig/include/linux/jbd.h > +++ linux-2.6.26-rc2/include/linux/jbd.h > @@ -816,6 +816,8 @@ struct journal_s > #define JFS_FLUSHED 0x008 /* The journal superblock has been flushed */ > #define JFS_LOADED 0x010 /* The journal superblock has been loaded */ > #define JFS_BARRIER 0x020 /* Use IDE barriers */ > +#define JFS_CP_ABORT 0x040 /* Checkpointing has failed. We don't have to > + * clean the journal space. */ > > /* > * Function declarations for the journaling transaction and buffer > @@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou > return journal->j_flags & JFS_ABORT; > } > > +static inline int is_checkpoint_aborted(journal_t *journal) > +{ > + return journal->j_flags & JFS_CP_ABORT; > +} > + > static inline int is_handle_aborted(handle_t *handle) > { > if (handle->h_aborted) > Index: linux-2.6.26-rc2/fs/ext3/ioctl.c > =================================================================== > --- linux-2.6.26-rc2.orig/fs/ext3/ioctl.c > +++ linux-2.6.26-rc2/fs/ext3/ioctl.c > @@ -239,7 +239,7 @@ setrsvsz_out: > case EXT3_IOC_GROUP_EXTEND: { > ext3_fsblk_t n_blocks_count; > struct super_block *sb = inode->i_sb; > - int err; > + int err, err2; > > if (!capable(CAP_SYS_RESOURCE)) > return -EPERM; > @@ -254,16 +254,16 @@ setrsvsz_out: > } > err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count); > journal_lock_updates(EXT3_SB(sb)->s_journal); > - journal_flush(EXT3_SB(sb)->s_journal); > + err2 = journal_flush(EXT3_SB(sb)->s_journal); > journal_unlock_updates(EXT3_SB(sb)->s_journal); > group_extend_out: > mnt_drop_write(filp->f_path.mnt); > - return err; > + return (err) ? err : err2; > } > case EXT3_IOC_GROUP_ADD: { > struct ext3_new_group_data input; > struct super_block *sb = inode->i_sb; > - int err; > + int err, err2; > > if (!capable(CAP_SYS_RESOURCE)) > return -EPERM; > @@ -280,11 +280,11 @@ group_extend_out: > > err = ext3_group_add(sb, &input); > journal_lock_updates(EXT3_SB(sb)->s_journal); > - journal_flush(EXT3_SB(sb)->s_journal); > + err2 = journal_flush(EXT3_SB(sb)->s_journal); > journal_unlock_updates(EXT3_SB(sb)->s_journal); > group_add_out: > mnt_drop_write(filp->f_path.mnt); > - return err; > + return (err) ? err : err2; > } > > > Index: linux-2.6.26-rc2/fs/ext3/super.c > =================================================================== > --- linux-2.6.26-rc2.orig/fs/ext3/super.c > +++ linux-2.6.26-rc2/fs/ext3/super.c > @@ -395,7 +395,10 @@ static void ext3_put_super (struct super > ext3_xattr_put_super(sb); > journal_destroy(sbi->s_journal); > if (!(sb->s_flags & MS_RDONLY)) { > - EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER); > + if (!is_checkpoint_aborted(sbi->s_journal)) { > + EXT3_CLEAR_INCOMPAT_FEATURE(sb, > + EXT3_FEATURE_INCOMPAT_RECOVER); > + } > es->s_state = cpu_to_le16(sbi->s_mount_state); > BUFFER_TRACE(sbi->s_sbh, "marking dirty"); > mark_buffer_dirty(sbi->s_sbh); Is this bit here really needed? If we abort the journal the fs will be mounted read only and we should never get in here. Is there a case where we could abort the journal and not be flipped to being read-only? If there is such a case I would think that we should fix that by making the fs read-only, and not have this check. > @@ -2373,7 +2376,13 @@ static void ext3_write_super_lockfs(stru > > /* Now we set up the journal barrier. */ > journal_lock_updates(journal); > - journal_flush(journal); > + > + /* > + * We don't want to clear needs_recovery flag when we failed > + * to flush the journal. > + */ > + if (journal_flush(journal) < 0) > + return; > > /* Journal blocked and flushed, clear needs_recovery flag. */ > EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER); > Index: linux-2.6.26-rc2/fs/jbd/journal.c > =================================================================== > --- linux-2.6.26-rc2.orig/fs/jbd/journal.c > +++ linux-2.6.26-rc2/fs/jbd/journal.c > @@ -674,7 +674,7 @@ static journal_t * journal_init_common ( > journal->j_commit_interval = (HZ * JBD_DEFAULT_MAX_COMMIT_AGE); > > /* The journal is marked for error until we succeed with recovery! */ > - journal->j_flags = JFS_ABORT; > + journal->j_flags = JFS_ABORT | JFS_CP_ABORT; > > /* Set up a default-sized revoke table for the new mount. */ > err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH); > @@ -1107,7 +1107,7 @@ int journal_load(journal_t *journal) > if (journal_reset(journal)) > goto recovery_error; > > - journal->j_flags &= ~JFS_ABORT; > + journal->j_flags &= ~(JFS_ABORT | JFS_CP_ABORT); > journal->j_flags |= JFS_LOADED; > return 0; > > @@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal) > journal->j_tail = 0; > journal->j_tail_sequence = ++journal->j_transaction_sequence; > if (journal->j_sb_buffer) { > - journal_update_superblock(journal, 1); > + if (!is_checkpoint_aborted(journal)) > + journal_update_superblock(journal, 1); > brelse(journal->j_sb_buffer); > } > > @@ -1333,7 +1334,6 @@ static int journal_convert_superblock_v1 > > int journal_flush(journal_t *journal) > { > - int err = 0; > transaction_t *transaction = NULL; > unsigned long old_tail; > > @@ -1356,14 +1356,19 @@ int journal_flush(journal_t *journal) > spin_unlock(&journal->j_state_lock); > } > > - /* ...and flush everything in the log out to disk. */ > + /* ...and flush everything in the log out to disk. > + * Even if an error occurs, we don't stop this loop. > + * We do checkpoint as much as possible. */ > spin_lock(&journal->j_list_lock); > - while (!err && journal->j_checkpoint_transactions != NULL) { > + while (journal->j_checkpoint_transactions != NULL) { > spin_unlock(&journal->j_list_lock); > - err = log_do_checkpoint(journal); > + log_do_checkpoint(journal); > spin_lock(&journal->j_list_lock); > } > spin_unlock(&journal->j_list_lock); > + if (is_checkpoint_aborted(journal)) > + return -EIO; > + > cleanup_journal_tail(journal); > > /* Finally, mark the journal as really needing no recovery. > @@ -1385,7 +1390,7 @@ int journal_flush(journal_t *journal) > J_ASSERT(journal->j_head == journal->j_tail); > J_ASSERT(journal->j_tail_sequence == journal->j_transaction_sequence); > spin_unlock(&journal->j_state_lock); > - return err; > + return 0; > } > > /** > Index: linux-2.6.26-rc2/fs/jbd/recovery.c > =================================================================== > --- linux-2.6.26-rc2.orig/fs/jbd/recovery.c > +++ linux-2.6.26-rc2/fs/jbd/recovery.c > @@ -223,7 +223,7 @@ do { \ > */ > int journal_recover(journal_t *journal) > { > - int err; > + int err, err2; > journal_superblock_t * sb; > > struct recovery_info info; > @@ -261,7 +261,10 @@ int journal_recover(journal_t *journal) > journal->j_transaction_sequence = ++info.end_transaction; > > journal_clear_revoke(journal); > - sync_blockdev(journal->j_fs_dev); > + err2 = sync_blockdev(journal->j_fs_dev); > + if (!err) > + err = err2; > + > return err; > } > > > Other than that one question it looks good to me. Thanks much, Josef ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) 2008-05-14 13:16 ` Josef Bacik @ 2008-05-14 14:44 ` Jan Kara 2008-05-14 14:37 ` Josef Bacik 0 siblings, 1 reply; 23+ messages in thread From: Jan Kara @ 2008-05-14 14:44 UTC (permalink / raw) To: Josef Bacik Cc: Hidehiro Kawai, Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Jan Kara, Mingming Cao, Satoshi OSHIMA, sugita > > > > Index: linux-2.6.26-rc2/fs/ext3/super.c > > =================================================================== > > --- linux-2.6.26-rc2.orig/fs/ext3/super.c > > +++ linux-2.6.26-rc2/fs/ext3/super.c > > @@ -395,7 +395,10 @@ static void ext3_put_super (struct super > > ext3_xattr_put_super(sb); > > journal_destroy(sbi->s_journal); > > if (!(sb->s_flags & MS_RDONLY)) { > > - EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER); > > + if (!is_checkpoint_aborted(sbi->s_journal)) { > > + EXT3_CLEAR_INCOMPAT_FEATURE(sb, > > + EXT3_FEATURE_INCOMPAT_RECOVER); > > + } > > es->s_state = cpu_to_le16(sbi->s_mount_state); > > BUFFER_TRACE(sbi->s_sbh, "marking dirty"); > > mark_buffer_dirty(sbi->s_sbh); > > Is this bit here really needed? If we abort the journal the fs will be mounted > read only and we should never get in here. Is there a case where we could abort > the journal and not be flipped to being read-only? If there is such a case I > would think that we should fix that by making the fs read-only, and not have > this check. Actually, journal_abort() (which could be called from journal_destroy()) does nothing to the filesystem as such. So at this moment, ext3 can still happily think everything is OK. We only detect aborted journal in ext3_journal_start_sb() and call ext3_abort() in that case, which does all that is needed... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) 2008-05-14 14:44 ` Jan Kara @ 2008-05-14 14:37 ` Josef Bacik 2008-05-16 10:28 ` Hidehiro Kawai 0 siblings, 1 reply; 23+ messages in thread From: Josef Bacik @ 2008-05-14 14:37 UTC (permalink / raw) To: Jan Kara Cc: Josef Bacik, Hidehiro Kawai, Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Mingming Cao, Satoshi OSHIMA, sugita On Wed, May 14, 2008 at 04:44:10PM +0200, Jan Kara wrote: > > > > > > Index: linux-2.6.26-rc2/fs/ext3/super.c > > > =================================================================== > > > --- linux-2.6.26-rc2.orig/fs/ext3/super.c > > > +++ linux-2.6.26-rc2/fs/ext3/super.c > > > @@ -395,7 +395,10 @@ static void ext3_put_super (struct super > > > ext3_xattr_put_super(sb); > > > journal_destroy(sbi->s_journal); > > > if (!(sb->s_flags & MS_RDONLY)) { > > > - EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER); > > > + if (!is_checkpoint_aborted(sbi->s_journal)) { > > > + EXT3_CLEAR_INCOMPAT_FEATURE(sb, > > > + EXT3_FEATURE_INCOMPAT_RECOVER); > > > + } > > > es->s_state = cpu_to_le16(sbi->s_mount_state); > > > BUFFER_TRACE(sbi->s_sbh, "marking dirty"); > > > mark_buffer_dirty(sbi->s_sbh); > > > > Is this bit here really needed? If we abort the journal the fs will be mounted > > read only and we should never get in here. Is there a case where we could abort > > the journal and not be flipped to being read-only? If there is such a case I > > would think that we should fix that by making the fs read-only, and not have > > this check. > Actually, journal_abort() (which could be called from journal_destroy()) > does nothing to the filesystem as such. So at this moment, ext3 can still > happily think everything is OK. We only detect aborted journal in > ext3_journal_start_sb() and call ext3_abort() in that case, which does all > that is needed... > Hmm you're right, I was thinking we did some other stuff before put_super which would have caught a journal abort but it looks like thats not the case. Still shouldn't do is_checkpoint_aborted(sbi->s_journal) since journal_destroy() kfree's the journal. Should probably move the is_journal_aborted() check above that or something. Thanks, Josef ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) 2008-05-14 14:37 ` Josef Bacik @ 2008-05-16 10:28 ` Hidehiro Kawai 0 siblings, 0 replies; 23+ messages in thread From: Hidehiro Kawai @ 2008-05-16 10:28 UTC (permalink / raw) To: Josef Bacik Cc: Jan Kara, Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Mingming Cao, Satoshi OSHIMA, sugita Hi, Thank you for review. Josef Bacik wrote: > On Wed, May 14, 2008 at 04:44:10PM +0200, Jan Kara wrote: > >>>> >>>>Index: linux-2.6.26-rc2/fs/ext3/super.c >>>>=================================================================== >>>>--- linux-2.6.26-rc2.orig/fs/ext3/super.c >>>>+++ linux-2.6.26-rc2/fs/ext3/super.c >>>>@@ -395,7 +395,10 @@ static void ext3_put_super (struct super >>>> ext3_xattr_put_super(sb); >>>> journal_destroy(sbi->s_journal); >>>> if (!(sb->s_flags & MS_RDONLY)) { >>>>- EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER); >>>>+ if (!is_checkpoint_aborted(sbi->s_journal)) { >>>>+ EXT3_CLEAR_INCOMPAT_FEATURE(sb, >>>>+ EXT3_FEATURE_INCOMPAT_RECOVER); >>>>+ } >>>> es->s_state = cpu_to_le16(sbi->s_mount_state); >>>> BUFFER_TRACE(sbi->s_sbh, "marking dirty"); >>>> mark_buffer_dirty(sbi->s_sbh); >>> >>>Is this bit here really needed? If we abort the journal the fs will be mounted >>>read only and we should never get in here. Is there a case where we could abort >>>the journal and not be flipped to being read-only? If there is such a case I >>>would think that we should fix that by making the fs read-only, and not have >>>this check. >> >> Actually, journal_abort() (which could be called from journal_destroy()) >>does nothing to the filesystem as such. So at this moment, ext3 can still >>happily think everything is OK. We only detect aborted journal in >>ext3_journal_start_sb() and call ext3_abort() in that case, which does all >>that is needed... Yes, that is why I added this check. > Hmm you're right, I was thinking we did some other stuff before put_super which > would have caught a journal abort but it looks like thats not the case. Still > shouldn't do is_checkpoint_aborted(sbi->s_journal) since journal_destroy() > kfree's the journal. Should probably move the is_journal_aborted() check above > that or something. Thanks, Good catch, I will fix it. Thanks! -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) 2008-05-14 4:50 ` [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) Hidehiro Kawai 2008-05-14 13:16 ` Josef Bacik @ 2008-05-14 14:32 ` Jan Kara 2008-05-16 10:29 ` Hidehiro Kawai 1 sibling, 1 reply; 23+ messages in thread From: Jan Kara @ 2008-05-14 14:32 UTC (permalink / raw) To: Hidehiro Kawai Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Jan Kara, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita Hello, On Wed 14-05-08 13:50:43, Hidehiro Kawai wrote: > Subject: [PATCH 4/4] jbd: fix error handling for checkpoint io > > When a checkpointing IO fails, current JBD code doesn't check the > error and continue journaling. This means latest metadata can be > lost from both the journal and filesystem. > > This patch leaves the failed metadata blocks in the journal space > and aborts journaling in the case of log_do_checkpoint(). > To achieve this, we need to do: > > 1. don't remove the failed buffer from the checkpoint list where in > the case of __try_to_free_cp_buf() because it may be released or > overwritten by a later transaction > 2. log_do_checkpoint() is the last chance, remove the failed buffer > from the checkpoint list and abort journaling > 3. when checkpointing fails, don't update the journal super block to > prevent the journalled contents from being cleaned > 4. when checkpointing fails, don't clear the needs_recovery flag, > otherwise the journalled contents are ignored and cleaned in the > recovery phase > 5. if the recovery fails, keep the needs_recovery flag I have a few comments below.. > Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> > --- > fs/ext3/ioctl.c | 12 ++++---- > fs/ext3/super.c | 13 +++++++-- > fs/jbd/checkpoint.c | 59 ++++++++++++++++++++++++++++++++++-------- > fs/jbd/journal.c | 21 +++++++++----- > fs/jbd/recovery.c | 7 +++- > include/linux/jbd.h | 7 ++++ > 6 files changed, 91 insertions(+), 28 deletions(-) > > Index: linux-2.6.26-rc2/fs/jbd/checkpoint.c > =================================================================== > --- linux-2.6.26-rc2.orig/fs/jbd/checkpoint.c > +++ linux-2.6.26-rc2/fs/jbd/checkpoint.c > @@ -93,7 +93,8 @@ static int __try_to_free_cp_buf(struct j > int ret = 0; > struct buffer_head *bh = jh2bh(jh); > > - if (jh->b_jlist == BJ_None && !buffer_locked(bh) && !buffer_dirty(bh)) { > + if (jh->b_jlist == BJ_None && !buffer_locked(bh) && > + !buffer_dirty(bh) && buffer_uptodate(bh)) { > JBUFFER_TRACE(jh, "remove from checkpoint list"); > ret = __journal_remove_checkpoint(jh) + 1; > jbd_unlock_bh_state(bh); > @@ -140,6 +141,25 @@ void __log_wait_for_space(journal_t *jou > } > > /* > + * We couldn't write back some metadata buffers to the filesystem. > + * To avoid unwritten-back metadata buffers from losing, set > + * JFS_CP_ABORT flag and make sure that the journal space isn't > + * cleaned. This function also aborts journaling. > + */ > +static void __journal_abort_checkpoint(journal_t *journal, int errno) > +{ > + if (is_checkpoint_aborted(journal)) > + return; > + > + spin_lock(&journal->j_state_lock); > + journal->j_flags |= JFS_CP_ABORT; > + spin_unlock(&journal->j_state_lock); > + printk(KERN_WARNING "JBD: Checkpointing failed. Some metadata blocks " > + "are still old.\n"); > + journal_abort(journal, errno); > +} > + > +/* > * We were unable to perform jbd_trylock_bh_state() inside j_list_lock. > * The caller must restart a list walk. Wait for someone else to run > * jbd_unlock_bh_state(). > @@ -160,21 +180,25 @@ static void jbd_sync_bh(journal_t *journ > * buffers. Note that we take the buffers in the opposite ordering > * from the one in which they were submitted for IO. > * > + * Return 0 on success, and return <0 if some buffers have failed > + * to be written out. > + * > * Called with j_list_lock held. > */ > -static void __wait_cp_io(journal_t *journal, transaction_t *transaction) > +static int __wait_cp_io(journal_t *journal, transaction_t *transaction) > { > struct journal_head *jh; > struct buffer_head *bh; > tid_t this_tid; > int released = 0; > + int ret = 0; > > this_tid = transaction->t_tid; > restart: > /* Did somebody clean up the transaction in the meanwhile? */ > if (journal->j_checkpoint_transactions != transaction || > transaction->t_tid != this_tid) > - return; > + return ret; > while (!released && transaction->t_checkpoint_io_list) { > jh = transaction->t_checkpoint_io_list; > bh = jh2bh(jh); > @@ -194,6 +218,9 @@ restart: > spin_lock(&journal->j_list_lock); > goto restart; > } > + if (unlikely(!buffer_uptodate(bh))) > + ret = -EIO; > + > /* > * Now in whatever state the buffer currently is, we know that > * it has been written out and so we can drop it from the list > @@ -203,6 +230,8 @@ restart: > journal_remove_journal_head(bh); > __brelse(bh); > } > + > + return ret; > } > > #define NR_BATCH 64 > @@ -226,7 +255,8 @@ __flush_batch(journal_t *journal, struct > * Try to flush one buffer from the checkpoint list to disk. > * > * Return 1 if something happened which requires us to abort the current > - * scan of the checkpoint list. > + * scan of the checkpoint list. Return <0 if the buffer has failed to > + * be written out. > * > * Called with j_list_lock held and drops it if 1 is returned > * Called under jbd_lock_bh_state(jh2bh(jh)), and drops it > @@ -256,6 +286,9 @@ static int __process_buffer(journal_t *j > log_wait_commit(journal, tid); > ret = 1; > } else if (!buffer_dirty(bh)) { > + ret = 1; > + if (unlikely(!buffer_uptodate(bh))) > + ret = -EIO; > J_ASSERT_JH(jh, !buffer_jbddirty(bh)); > BUFFER_TRACE(bh, "remove from checkpoint"); > __journal_remove_checkpoint(jh); > @@ -263,7 +296,6 @@ static int __process_buffer(journal_t *j > jbd_unlock_bh_state(bh); > journal_remove_journal_head(bh); > __brelse(bh); > - ret = 1; > } else { > /* > * Important: we are about to write the buffer, and > @@ -318,6 +350,7 @@ int log_do_checkpoint(journal_t *journal > * OK, we need to start writing disk blocks. Take one transaction > * and write it. > */ > + result = 0; > spin_lock(&journal->j_list_lock); > if (!journal->j_checkpoint_transactions) > goto out; > @@ -334,7 +367,7 @@ restart: > int batch_count = 0; > struct buffer_head *bhs[NR_BATCH]; > struct journal_head *jh; > - int retry = 0; > + int retry = 0, err; > > while (!retry && transaction->t_checkpoint_list) { > struct buffer_head *bh; > @@ -347,6 +380,8 @@ restart: > break; > } > retry = __process_buffer(journal, jh, bhs,&batch_count); > + if (retry < 0) > + result = retry; > if (!retry && (need_resched() || > spin_needbreak(&journal->j_list_lock))) { > spin_unlock(&journal->j_list_lock); > @@ -371,14 +406,18 @@ restart: > * Now we have cleaned up the first transaction's checkpoint > * list. Let's clean up the second one > */ > - __wait_cp_io(journal, transaction); > + err = __wait_cp_io(journal, transaction); > + if (!result) > + result = err; > } > out: > spin_unlock(&journal->j_list_lock); > - result = cleanup_journal_tail(journal); > if (result < 0) > - return result; > - return 0; > + __journal_abort_checkpoint(journal, result); > + else > + result = cleanup_journal_tail(journal); > + > + return (result < 0) ? result : 0; > } > > /* > Index: linux-2.6.26-rc2/include/linux/jbd.h > =================================================================== > --- linux-2.6.26-rc2.orig/include/linux/jbd.h > +++ linux-2.6.26-rc2/include/linux/jbd.h > @@ -816,6 +816,8 @@ struct journal_s > #define JFS_FLUSHED 0x008 /* The journal superblock has been flushed */ > #define JFS_LOADED 0x010 /* The journal superblock has been loaded */ > #define JFS_BARRIER 0x020 /* Use IDE barriers */ > +#define JFS_CP_ABORT 0x040 /* Checkpointing has failed. We don't have to > + * clean the journal space. */ > > /* > * Function declarations for the journaling transaction and buffer > @@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou > return journal->j_flags & JFS_ABORT; > } > > +static inline int is_checkpoint_aborted(journal_t *journal) > +{ > + return journal->j_flags & JFS_CP_ABORT; > +} > + Actually, I don't think this new flag is needed (not that it would really harm anything). At least at the places where you check it you can as well check whether the journal is aborted (maybe except for journal_destroy() but see my comment there). > static inline int is_handle_aborted(handle_t *handle) > { > if (handle->h_aborted) > Index: linux-2.6.26-rc2/fs/ext3/ioctl.c > =================================================================== > --- linux-2.6.26-rc2.orig/fs/ext3/ioctl.c > +++ linux-2.6.26-rc2/fs/ext3/ioctl.c > @@ -239,7 +239,7 @@ setrsvsz_out: > case EXT3_IOC_GROUP_EXTEND: { > ext3_fsblk_t n_blocks_count; > struct super_block *sb = inode->i_sb; > - int err; > + int err, err2; > > if (!capable(CAP_SYS_RESOURCE)) > return -EPERM; > @@ -254,16 +254,16 @@ setrsvsz_out: > } > err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count); > journal_lock_updates(EXT3_SB(sb)->s_journal); > - journal_flush(EXT3_SB(sb)->s_journal); > + err2 = journal_flush(EXT3_SB(sb)->s_journal); > journal_unlock_updates(EXT3_SB(sb)->s_journal); > group_extend_out: > mnt_drop_write(filp->f_path.mnt); > - return err; > + return (err) ? err : err2; > } > case EXT3_IOC_GROUP_ADD: { > struct ext3_new_group_data input; > struct super_block *sb = inode->i_sb; > - int err; > + int err, err2; > > if (!capable(CAP_SYS_RESOURCE)) > return -EPERM; > @@ -280,11 +280,11 @@ group_extend_out: > > err = ext3_group_add(sb, &input); > journal_lock_updates(EXT3_SB(sb)->s_journal); > - journal_flush(EXT3_SB(sb)->s_journal); > + err2 = journal_flush(EXT3_SB(sb)->s_journal); > journal_unlock_updates(EXT3_SB(sb)->s_journal); > group_add_out: > mnt_drop_write(filp->f_path.mnt); > - return err; > + return (err) ? err : err2; > } Please split ext3 changes into a separate patch. There's no reason why they should be together with JBD fixes (i.e. JBD fixes don't depend on ext3 being fixed). Thanks. > Index: linux-2.6.26-rc2/fs/ext3/super.c > =================================================================== > --- linux-2.6.26-rc2.orig/fs/ext3/super.c > +++ linux-2.6.26-rc2/fs/ext3/super.c > @@ -395,7 +395,10 @@ static void ext3_put_super (struct super > ext3_xattr_put_super(sb); > journal_destroy(sbi->s_journal); > if (!(sb->s_flags & MS_RDONLY)) { > - EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER); > + if (!is_checkpoint_aborted(sbi->s_journal)) { > + EXT3_CLEAR_INCOMPAT_FEATURE(sb, > + EXT3_FEATURE_INCOMPAT_RECOVER); > + } I think you should test here whether the journal is aborted (and EXT3_MOUNT_ABORT isn't set) and if so, call ext3_abort() and just completely avoid updating super block... > es->s_state = cpu_to_le16(sbi->s_mount_state); > BUFFER_TRACE(sbi->s_sbh, "marking dirty"); > mark_buffer_dirty(sbi->s_sbh); > @@ -2373,7 +2376,13 @@ static void ext3_write_super_lockfs(stru > > /* Now we set up the journal barrier. */ > journal_lock_updates(journal); > - journal_flush(journal); > + > + /* > + * We don't want to clear needs_recovery flag when we failed > + * to flush the journal. > + */ > + if (journal_flush(journal) < 0) > + return; > > /* Journal blocked and flushed, clear needs_recovery flag. */ > EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER); > Index: linux-2.6.26-rc2/fs/jbd/journal.c > =================================================================== > --- linux-2.6.26-rc2.orig/fs/jbd/journal.c > +++ linux-2.6.26-rc2/fs/jbd/journal.c > @@ -674,7 +674,7 @@ static journal_t * journal_init_common ( > journal->j_commit_interval = (HZ * JBD_DEFAULT_MAX_COMMIT_AGE); > > /* The journal is marked for error until we succeed with recovery! */ > - journal->j_flags = JFS_ABORT; > + journal->j_flags = JFS_ABORT | JFS_CP_ABORT; > > /* Set up a default-sized revoke table for the new mount. */ > err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH); > @@ -1107,7 +1107,7 @@ int journal_load(journal_t *journal) > if (journal_reset(journal)) > goto recovery_error; > > - journal->j_flags &= ~JFS_ABORT; > + journal->j_flags &= ~(JFS_ABORT | JFS_CP_ABORT); > journal->j_flags |= JFS_LOADED; > return 0; > > @@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal) > journal->j_tail = 0; > journal->j_tail_sequence = ++journal->j_transaction_sequence; > if (journal->j_sb_buffer) { > - journal_update_superblock(journal, 1); > + if (!is_checkpoint_aborted(journal)) > + journal_update_superblock(journal, 1); > brelse(journal->j_sb_buffer); > } I don't like this much. I'd rather completely avoid updating j_tail and j_tail_sequence above in case the journal is aborted but I'd write the journal superblock so that information about abortion gets written... > @@ -1333,7 +1334,6 @@ static int journal_convert_superblock_v1 > > int journal_flush(journal_t *journal) > { > - int err = 0; > transaction_t *transaction = NULL; > unsigned long old_tail; > > @@ -1356,14 +1356,19 @@ int journal_flush(journal_t *journal) > spin_unlock(&journal->j_state_lock); > } > > - /* ...and flush everything in the log out to disk. */ > + /* ...and flush everything in the log out to disk. > + * Even if an error occurs, we don't stop this loop. > + * We do checkpoint as much as possible. */ > spin_lock(&journal->j_list_lock); > - while (!err && journal->j_checkpoint_transactions != NULL) { > + while (journal->j_checkpoint_transactions != NULL) { > spin_unlock(&journal->j_list_lock); > - err = log_do_checkpoint(journal); > + log_do_checkpoint(journal); > spin_lock(&journal->j_list_lock); > } > spin_unlock(&journal->j_list_lock); > + if (is_checkpoint_aborted(journal)) > + return -EIO; > + Why do you change the loop? If we fail to checkpoint some buffer, we stop journaling anyway and so journal will be replayed when fsck is run... > cleanup_journal_tail(journal); > > /* Finally, mark the journal as really needing no recovery. > @@ -1385,7 +1390,7 @@ int journal_flush(journal_t *journal) > J_ASSERT(journal->j_head == journal->j_tail); > J_ASSERT(journal->j_tail_sequence == journal->j_transaction_sequence); > spin_unlock(&journal->j_state_lock); > - return err; > + return 0; > } > > /** > Index: linux-2.6.26-rc2/fs/jbd/recovery.c > =================================================================== > --- linux-2.6.26-rc2.orig/fs/jbd/recovery.c > +++ linux-2.6.26-rc2/fs/jbd/recovery.c > @@ -223,7 +223,7 @@ do { \ > */ > int journal_recover(journal_t *journal) > { > - int err; > + int err, err2; > journal_superblock_t * sb; > > struct recovery_info info; > @@ -261,7 +261,10 @@ int journal_recover(journal_t *journal) > journal->j_transaction_sequence = ++info.end_transaction; > > journal_clear_revoke(journal); > - sync_blockdev(journal->j_fs_dev); > + err2 = sync_blockdev(journal->j_fs_dev); > + if (!err) > + err = err2; > + > return err; > } > Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) 2008-05-14 14:32 ` Jan Kara @ 2008-05-16 10:29 ` Hidehiro Kawai 2008-05-19 3:38 ` Jan Kara 0 siblings, 1 reply; 23+ messages in thread From: Hidehiro Kawai @ 2008-05-16 10:29 UTC (permalink / raw) To: Jan Kara Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita Hi, Thank you for review. Jan Kara wrote: >> /* >> + * We couldn't write back some metadata buffers to the filesystem. >> + * To avoid unwritten-back metadata buffers from losing, set >> + * JFS_CP_ABORT flag and make sure that the journal space isn't >> + * cleaned. This function also aborts journaling. >> + */ >> +static void __journal_abort_checkpoint(journal_t *journal, int errno) >> +{ >> + if (is_checkpoint_aborted(journal)) >> + return; >> + >> + spin_lock(&journal->j_state_lock); >> + journal->j_flags |= JFS_CP_ABORT; >> + spin_unlock(&journal->j_state_lock); >> + printk(KERN_WARNING "JBD: Checkpointing failed. Some metadata blocks " >> + "are still old.\n"); >> + journal_abort(journal, errno); >> +} [snip] >>Index: linux-2.6.26-rc2/include/linux/jbd.h >>=================================================================== >>--- linux-2.6.26-rc2.orig/include/linux/jbd.h >>+++ linux-2.6.26-rc2/include/linux/jbd.h >>@@ -816,6 +816,8 @@ struct journal_s >> #define JFS_FLUSHED 0x008 /* The journal superblock has been flushed */ >> #define JFS_LOADED 0x010 /* The journal superblock has been loaded */ >> #define JFS_BARRIER 0x020 /* Use IDE barriers */ >>+#define JFS_CP_ABORT 0x040 /* Checkpointing has failed. We don't have to >>+ * clean the journal space. */ >> >> /* >> * Function declarations for the journaling transaction and buffer >>@@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou >> return journal->j_flags & JFS_ABORT; >> } >> >>+static inline int is_checkpoint_aborted(journal_t *journal) >>+{ >>+ return journal->j_flags & JFS_CP_ABORT; >>+} >>+ > > Actually, I don't think this new flag is needed (not that it would really > harm anything). At least at the places where you check it you can as well > check whether the journal is aborted (maybe except for journal_destroy() > but see my comment there). As you say, JFS_CP_ABORT isn't necessarily needed in the most cases, but it is needed for __journal_abort_checkpoint() which report the checkpoint related error by printk only once. If we use JFS_ABORT to check whether this call is second time, the error message may be never printed out because the journal has been aborted by another reason. If we don't check for the second call, the error message will be printed out several times. Instead of using __journal_abort_checkpoint(), we'll be able to do the similar thing by adding the printk() directly in journal_destroy(), journal_flush(), and __log_wait_for_space() (but it can be more than one time). I agree that we should not add another flag as much as possible. So I'll try to revise to remove the flag if you agree to add 3 printk()s. > Please split ext3 changes into a separate patch. There's no reason > why they should be together with JBD fixes (i.e. JBD fixes don't depend on > ext3 being fixed). Thanks. OK, I'll do so. >>Index: linux-2.6.26-rc2/fs/ext3/super.c >>=================================================================== >>--- linux-2.6.26-rc2.orig/fs/ext3/super.c >>+++ linux-2.6.26-rc2/fs/ext3/super.c >>@@ -395,7 +395,10 @@ static void ext3_put_super (struct super >> ext3_xattr_put_super(sb); >> journal_destroy(sbi->s_journal); >> if (!(sb->s_flags & MS_RDONLY)) { >>- EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER); >>+ if (!is_checkpoint_aborted(sbi->s_journal)) { >>+ EXT3_CLEAR_INCOMPAT_FEATURE(sb, >>+ EXT3_FEATURE_INCOMPAT_RECOVER); >>+ } > > I think you should test here whether the journal is aborted (and > EXT3_MOUNT_ABORT isn't set) and if so, call ext3_abort() and just > completely avoid updating super block... Your idea looks good. But as Josef pointed out, journal_destroy() frees sbi->s_journal, so we need to make journal_destroy() return the error status. >>@@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal) >> journal->j_tail = 0; >> journal->j_tail_sequence = ++journal->j_transaction_sequence; >> if (journal->j_sb_buffer) { >>- journal_update_superblock(journal, 1); >>+ if (!is_checkpoint_aborted(journal)) >>+ journal_update_superblock(journal, 1); >> brelse(journal->j_sb_buffer); >> } > > I don't like this much. I'd rather completely avoid updating j_tail and > j_tail_sequence above in case the journal is aborted but I'd write the > journal superblock so that information about abortion gets written... log_do_checkpoint() calls cleanup_journal_tail(), which advances j_tail and j_tail_sequence. So if we adopt the policy that we don't modify j_tail and j_tail_sequence when checkpointing failed, we should also fix cleanup_journal_tail(). I adopted the policy that we don't update the journal super block when checkpointing failed. When checkpointing failed, journal_abort() is called before cleanup_journal_tail(). So what we want to write should be in the journal super block. >>@@ -1333,7 +1334,6 @@ static int journal_convert_superblock_v1 >> >> int journal_flush(journal_t *journal) >> { >>- int err = 0; >> transaction_t *transaction = NULL; >> unsigned long old_tail; >> >>@@ -1356,14 +1356,19 @@ int journal_flush(journal_t *journal) >> spin_unlock(&journal->j_state_lock); >> } >> >>- /* ...and flush everything in the log out to disk. */ >>+ /* ...and flush everything in the log out to disk. >>+ * Even if an error occurs, we don't stop this loop. >>+ * We do checkpoint as much as possible. */ >> spin_lock(&journal->j_list_lock); >>- while (!err && journal->j_checkpoint_transactions != NULL) { >>+ while (journal->j_checkpoint_transactions != NULL) { >> spin_unlock(&journal->j_list_lock); >>- err = log_do_checkpoint(journal); >>+ log_do_checkpoint(journal); >> spin_lock(&journal->j_list_lock); >> } >> spin_unlock(&journal->j_list_lock); >>+ if (is_checkpoint_aborted(journal)) >>+ return -EIO; >>+ > > Why do you change the loop? If we fail to checkpoint some buffer, we stop > journaling anyway and so journal will be replayed when fsck is run... Certainly, fsck or journal_recover() replay the unwritten-back metadata block. journal_destroy() also call log_do_checkpoint() against all un-checkpointed transactions. I just thought `flush' means writing out all data which we should write to disk. There is no problem if I don't change the loop. Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) 2008-05-16 10:29 ` Hidehiro Kawai @ 2008-05-19 3:38 ` Jan Kara 2008-05-21 1:34 ` Hidehiro Kawai 0 siblings, 1 reply; 23+ messages in thread From: Jan Kara @ 2008-05-19 3:38 UTC (permalink / raw) To: Hidehiro Kawai Cc: Jan Kara, Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita Hello, On Fri 16-05-08 19:29:15, Hidehiro Kawai wrote: > Jan Kara wrote: > >> /* > >> + * We couldn't write back some metadata buffers to the filesystem. > >> + * To avoid unwritten-back metadata buffers from losing, set > >> + * JFS_CP_ABORT flag and make sure that the journal space isn't > >> + * cleaned. This function also aborts journaling. > >> + */ > >> +static void __journal_abort_checkpoint(journal_t *journal, int errno) > >> +{ > >> + if (is_checkpoint_aborted(journal)) > >> + return; > >> + > >> + spin_lock(&journal->j_state_lock); > >> + journal->j_flags |= JFS_CP_ABORT; > >> + spin_unlock(&journal->j_state_lock); > >> + printk(KERN_WARNING "JBD: Checkpointing failed. Some metadata blocks " > >> + "are still old.\n"); > >> + journal_abort(journal, errno); > >> +} > > [snip] > > >>Index: linux-2.6.26-rc2/include/linux/jbd.h > >>=================================================================== > >>--- linux-2.6.26-rc2.orig/include/linux/jbd.h > >>+++ linux-2.6.26-rc2/include/linux/jbd.h > >>@@ -816,6 +816,8 @@ struct journal_s > >> #define JFS_FLUSHED 0x008 /* The journal superblock has been flushed */ > >> #define JFS_LOADED 0x010 /* The journal superblock has been loaded */ > >> #define JFS_BARRIER 0x020 /* Use IDE barriers */ > >>+#define JFS_CP_ABORT 0x040 /* Checkpointing has failed. We don't have to > >>+ * clean the journal space. */ > >> > >> /* > >> * Function declarations for the journaling transaction and buffer > >>@@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou > >> return journal->j_flags & JFS_ABORT; > >> } > >> > >>+static inline int is_checkpoint_aborted(journal_t *journal) > >>+{ > >>+ return journal->j_flags & JFS_CP_ABORT; > >>+} > >>+ > > > > Actually, I don't think this new flag is needed (not that it would really > > harm anything). At least at the places where you check it you can as well > > check whether the journal is aborted (maybe except for journal_destroy() > > but see my comment there). > > As you say, JFS_CP_ABORT isn't necessarily needed in the most cases, > but it is needed for __journal_abort_checkpoint() which report the > checkpoint related error by printk only once. If we use JFS_ABORT > to check whether this call is second time, the error message may be > never printed out because the journal has been aborted by another > reason. If we don't check for the second call, the error message > will be printed out several times. Yes, I think that checking out for JFS_ABORT is the right thing to do. Once the journal has aborted for some reason, it is enough that we print some error message (and that is responsibility of the first caller of journal_abort()). Printing that checkpointing has not succeeded as well is IMO not needed. > Instead of using __journal_abort_checkpoint(), we'll be able to do > the similar thing by adding the printk() directly in journal_destroy(), > journal_flush(), and __log_wait_for_space() (but it can be more > than one time). > > I agree that we should not add another flag as much as possible. > So I'll try to revise to remove the flag if you agree to add > 3 printk()s. Well, as I said, one flag in journal is not a big deal but I found it a bit confusing why there's a special flag for checkpoint abort when standard abort would do fine as well. > >>Index: linux-2.6.26-rc2/fs/ext3/super.c > >>=================================================================== > >>--- linux-2.6.26-rc2.orig/fs/ext3/super.c > >>+++ linux-2.6.26-rc2/fs/ext3/super.c > >>@@ -395,7 +395,10 @@ static void ext3_put_super (struct super > >> ext3_xattr_put_super(sb); > >> journal_destroy(sbi->s_journal); > >> if (!(sb->s_flags & MS_RDONLY)) { > >>- EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER); > >>+ if (!is_checkpoint_aborted(sbi->s_journal)) { > >>+ EXT3_CLEAR_INCOMPAT_FEATURE(sb, > >>+ EXT3_FEATURE_INCOMPAT_RECOVER); > >>+ } > > > > I think you should test here whether the journal is aborted (and > > EXT3_MOUNT_ABORT isn't set) and if so, call ext3_abort() and just > > completely avoid updating super block... > > Your idea looks good. But as Josef pointed out, journal_destroy() > frees sbi->s_journal, so we need to make journal_destroy() return the > error status. Yes. > >>@@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal) > >> journal->j_tail = 0; > >> journal->j_tail_sequence = ++journal->j_transaction_sequence; > >> if (journal->j_sb_buffer) { > >>- journal_update_superblock(journal, 1); > >>+ if (!is_checkpoint_aborted(journal)) > >>+ journal_update_superblock(journal, 1); > >> brelse(journal->j_sb_buffer); > >> } > > > > I don't like this much. I'd rather completely avoid updating j_tail and > > j_tail_sequence above in case the journal is aborted but I'd write the > > journal superblock so that information about abortion gets written... > > log_do_checkpoint() calls cleanup_journal_tail(), which advances > j_tail and j_tail_sequence. So if we adopt the policy that we don't > modify j_tail and j_tail_sequence when checkpointing failed, we should > also fix cleanup_journal_tail(). > > I adopted the policy that we don't update the journal super block > when checkpointing failed. When checkpointing failed, journal_abort() > is called before cleanup_journal_tail(). So what we want to write > should be in the journal super block. I see. The thing I'm afraid of with this policy is, that when sometime later we add somewhere journal_update_superblock() and forget about checking whether journal isn't aborted, we will magically get filesystem corruption when IO error happens and that would be really hard to debug. So I'd rather refrain from updating j_tail and j_tail_sequence. > >>@@ -1333,7 +1334,6 @@ static int journal_convert_superblock_v1 > >> > >> int journal_flush(journal_t *journal) > >> { > >>- int err = 0; > >> transaction_t *transaction = NULL; > >> unsigned long old_tail; > >> > >>@@ -1356,14 +1356,19 @@ int journal_flush(journal_t *journal) > >> spin_unlock(&journal->j_state_lock); > >> } > >> > >>- /* ...and flush everything in the log out to disk. */ > >>+ /* ...and flush everything in the log out to disk. > >>+ * Even if an error occurs, we don't stop this loop. > >>+ * We do checkpoint as much as possible. */ > >> spin_lock(&journal->j_list_lock); > >>- while (!err && journal->j_checkpoint_transactions != NULL) { > >>+ while (journal->j_checkpoint_transactions != NULL) { > >> spin_unlock(&journal->j_list_lock); > >>- err = log_do_checkpoint(journal); > >>+ log_do_checkpoint(journal); > >> spin_lock(&journal->j_list_lock); > >> } > >> spin_unlock(&journal->j_list_lock); > >>+ if (is_checkpoint_aborted(journal)) > >>+ return -EIO; > >>+ > > > > Why do you change the loop? If we fail to checkpoint some buffer, we stop > > journaling anyway and so journal will be replayed when fsck is run... > > Certainly, fsck or journal_recover() replay the unwritten-back metadata > block. journal_destroy() also call log_do_checkpoint() against all > un-checkpointed transactions. I just thought `flush' means writing out > all data which we should write to disk. There is no problem if I > don't change the loop. "flush" means "make journal empty". If we are not able to do it all, it does not really matter how much do we manage to write out. So I wouldn't change the loop. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) 2008-05-19 3:38 ` Jan Kara @ 2008-05-21 1:34 ` Hidehiro Kawai 2008-05-23 22:28 ` Jan Kara 0 siblings, 1 reply; 23+ messages in thread From: Hidehiro Kawai @ 2008-05-21 1:34 UTC (permalink / raw) To: Jan Kara Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita Hi, Jan Kara wrote: > > On Fri 16-05-08 19:29:15, Hidehiro Kawai wrote: > >>Jan Kara wrote: >> >>>>/* >>>>+ * We couldn't write back some metadata buffers to the filesystem. >>>>+ * To avoid unwritten-back metadata buffers from losing, set >>>>+ * JFS_CP_ABORT flag and make sure that the journal space isn't >>>>+ * cleaned. This function also aborts journaling. >>>>+ */ >>>>+static void __journal_abort_checkpoint(journal_t *journal, int errno) >>>>+{ >>>>+ if (is_checkpoint_aborted(journal)) >>>>+ return; >>>>+ >>>>+ spin_lock(&journal->j_state_lock); >>>>+ journal->j_flags |= JFS_CP_ABORT; >>>>+ spin_unlock(&journal->j_state_lock); >>>>+ printk(KERN_WARNING "JBD: Checkpointing failed. Some metadata blocks " >>>>+ "are still old.\n"); >>>>+ journal_abort(journal, errno); >>>>+} >> >>[snip] >> >> >>>>Index: linux-2.6.26-rc2/include/linux/jbd.h >>>>=================================================================== >>>>--- linux-2.6.26-rc2.orig/include/linux/jbd.h >>>>+++ linux-2.6.26-rc2/include/linux/jbd.h >>>>@@ -816,6 +816,8 @@ struct journal_s >>>>#define JFS_FLUSHED 0x008 /* The journal superblock has been flushed */ >>>>#define JFS_LOADED 0x010 /* The journal superblock has been loaded */ >>>>#define JFS_BARRIER 0x020 /* Use IDE barriers */ >>>>+#define JFS_CP_ABORT 0x040 /* Checkpointing has failed. We don't have to >>>>+ * clean the journal space. */ >>>> >>>>/* >>>> * Function declarations for the journaling transaction and buffer >>>>@@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou >>>> return journal->j_flags & JFS_ABORT; >>>>} >>>> >>>>+static inline int is_checkpoint_aborted(journal_t *journal) >>>>+{ >>>>+ return journal->j_flags & JFS_CP_ABORT; >>>>+} >>>>+ >>> >>> Actually, I don't think this new flag is needed (not that it would really >>>harm anything). At least at the places where you check it you can as well >>>check whether the journal is aborted (maybe except for journal_destroy() >>>but see my comment there). >> >>As you say, JFS_CP_ABORT isn't necessarily needed in the most cases, >>but it is needed for __journal_abort_checkpoint() which report the >>checkpoint related error by printk only once. If we use JFS_ABORT >>to check whether this call is second time, the error message may be >>never printed out because the journal has been aborted by another >>reason. If we don't check for the second call, the error message >>will be printed out several times. > > Yes, I think that checking out for JFS_ABORT is the right thing to do. > Once the journal has aborted for some reason, it is enough that we print > some error message (and that is responsibility of the first caller of > journal_abort()). Printing that checkpointing has not succeeded as well is > IMO not needed. A checkpointing failure is a bit special. If the journal is aborted by journal_commit_transaction(), the integrity of the filesystem is ensured although the latest changes will be lost. However, if the journal is aborted by log_do_checkpoint() and the replay also failed, the filesystem may be corrupted because some of the metadata blocks are in old states. In this case, the user will have to recover the filesystem manually and carefully. So I think printing the special message is needed to inform users about that. >>>>@@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal) >>>> journal->j_tail = 0; >>>> journal->j_tail_sequence = ++journal->j_transaction_sequence; >>>> if (journal->j_sb_buffer) { >>>>- journal_update_superblock(journal, 1); >>>>+ if (!is_checkpoint_aborted(journal)) >>>>+ journal_update_superblock(journal, 1); >>>> brelse(journal->j_sb_buffer); >>>> } >>> >>> I don't like this much. I'd rather completely avoid updating j_tail and >>>j_tail_sequence above in case the journal is aborted but I'd write the >>>journal superblock so that information about abortion gets written... >> >>log_do_checkpoint() calls cleanup_journal_tail(), which advances >>j_tail and j_tail_sequence. So if we adopt the policy that we don't >>modify j_tail and j_tail_sequence when checkpointing failed, we should >>also fix cleanup_journal_tail(). >> >>I adopted the policy that we don't update the journal super block >>when checkpointing failed. When checkpointing failed, journal_abort() >>is called before cleanup_journal_tail(). So what we want to write >>should be in the journal super block. > > I see. The thing I'm afraid of with this policy is, that when sometime later > we add somewhere journal_update_superblock() and forget about checking > whether journal isn't aborted, we will magically get filesystem corruption > when IO error happens and that would be really hard to debug. So I'd > rather refrain from updating j_tail and j_tail_sequence. I can understand your concern as the current JBD updates the journal super block even if a checkpoint has failed. I think it will be improved by adding a comment to journal_update_superblock(). For example: don't invoke this function if checkpointing has failed, otherwise unwritten-back journaled data can be lost. Or we may stop both modifying j_tail and updating the super block when the journal has aborted (especially for a reason of checkpoint failure). Of course, __journal_abort_soft() still updates the super block once. Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) 2008-05-21 1:34 ` Hidehiro Kawai @ 2008-05-23 22:28 ` Jan Kara 2008-05-26 4:57 ` Hidehiro Kawai 0 siblings, 1 reply; 23+ messages in thread From: Jan Kara @ 2008-05-23 22:28 UTC (permalink / raw) To: Hidehiro Kawai Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita Hello, > Jan Kara wrote: > > > > On Fri 16-05-08 19:29:15, Hidehiro Kawai wrote: > > > >>Jan Kara wrote: > >> > >>>>/* > >>>>+ * We couldn't write back some metadata buffers to the filesystem. > >>>>+ * To avoid unwritten-back metadata buffers from losing, set > >>>>+ * JFS_CP_ABORT flag and make sure that the journal space isn't > >>>>+ * cleaned. This function also aborts journaling. > >>>>+ */ > >>>>+static void __journal_abort_checkpoint(journal_t *journal, int errno) > >>>>+{ > >>>>+ if (is_checkpoint_aborted(journal)) > >>>>+ return; > >>>>+ > >>>>+ spin_lock(&journal->j_state_lock); > >>>>+ journal->j_flags |= JFS_CP_ABORT; > >>>>+ spin_unlock(&journal->j_state_lock); > >>>>+ printk(KERN_WARNING "JBD: Checkpointing failed. Some metadata blocks " > >>>>+ "are still old.\n"); > >>>>+ journal_abort(journal, errno); > >>>>+} > >> > >>[snip] > >> > >> > >>>>Index: linux-2.6.26-rc2/include/linux/jbd.h > >>>>=================================================================== > >>>>--- linux-2.6.26-rc2.orig/include/linux/jbd.h > >>>>+++ linux-2.6.26-rc2/include/linux/jbd.h > >>>>@@ -816,6 +816,8 @@ struct journal_s > >>>>#define JFS_FLUSHED 0x008 /* The journal superblock has been flushed */ > >>>>#define JFS_LOADED 0x010 /* The journal superblock has been loaded */ > >>>>#define JFS_BARRIER 0x020 /* Use IDE barriers */ > >>>>+#define JFS_CP_ABORT 0x040 /* Checkpointing has failed. We don't have to > >>>>+ * clean the journal space. */ > >>>> > >>>>/* > >>>> * Function declarations for the journaling transaction and buffer > >>>>@@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou > >>>> return journal->j_flags & JFS_ABORT; > >>>>} > >>>> > >>>>+static inline int is_checkpoint_aborted(journal_t *journal) > >>>>+{ > >>>>+ return journal->j_flags & JFS_CP_ABORT; > >>>>+} > >>>>+ > >>> > >>> Actually, I don't think this new flag is needed (not that it would really > >>>harm anything). At least at the places where you check it you can as well > >>>check whether the journal is aborted (maybe except for journal_destroy() > >>>but see my comment there). > >> > >>As you say, JFS_CP_ABORT isn't necessarily needed in the most cases, > >>but it is needed for __journal_abort_checkpoint() which report the > >>checkpoint related error by printk only once. If we use JFS_ABORT > >>to check whether this call is second time, the error message may be > >>never printed out because the journal has been aborted by another > >>reason. If we don't check for the second call, the error message > >>will be printed out several times. > > > > Yes, I think that checking out for JFS_ABORT is the right thing to do. > > Once the journal has aborted for some reason, it is enough that we print > > some error message (and that is responsibility of the first caller of > > journal_abort()). Printing that checkpointing has not succeeded as well is > > IMO not needed. > > A checkpointing failure is a bit special. If the journal is aborted > by journal_commit_transaction(), the integrity of the filesystem is > ensured although the latest changes will be lost. However, if the > journal is aborted by log_do_checkpoint() and the replay also failed, > the filesystem may be corrupted because some of the metadata blocks > are in old states. In this case, the user will have to recover the > filesystem manually and carefully. So I think printing the special > message is needed to inform users about that. Is this really different? If we spot error during checkpointing, we abort journal (so data still remain in the journal because checkpointing has failed). On next mount, either replay succeeds and everything is fine, or we spot error during replay and we should just refuse to mount the filesystem. So we are correct again - user then has to run fsck to solve the problem and that will try it's best to get most data readable. So I don't see why checkpointing bug would need a special attention. > >>>>@@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal) > >>>> journal->j_tail = 0; > >>>> journal->j_tail_sequence = ++journal->j_transaction_sequence; > >>>> if (journal->j_sb_buffer) { > >>>>- journal_update_superblock(journal, 1); > >>>>+ if (!is_checkpoint_aborted(journal)) > >>>>+ journal_update_superblock(journal, 1); > >>>> brelse(journal->j_sb_buffer); > >>>> } > >>> > >>> I don't like this much. I'd rather completely avoid updating j_tail and > >>>j_tail_sequence above in case the journal is aborted but I'd write the > >>>journal superblock so that information about abortion gets written... > >> > >>log_do_checkpoint() calls cleanup_journal_tail(), which advances > >>j_tail and j_tail_sequence. So if we adopt the policy that we don't > >>modify j_tail and j_tail_sequence when checkpointing failed, we should > >>also fix cleanup_journal_tail(). > >> > >>I adopted the policy that we don't update the journal super block > >>when checkpointing failed. When checkpointing failed, journal_abort() > >>is called before cleanup_journal_tail(). So what we want to write > >>should be in the journal super block. > > > > I see. The thing I'm afraid of with this policy is, that when sometime later > > we add somewhere journal_update_superblock() and forget about checking > > whether journal isn't aborted, we will magically get filesystem corruption > > when IO error happens and that would be really hard to debug. So I'd > > rather refrain from updating j_tail and j_tail_sequence. > > I can understand your concern as the current JBD updates the journal > super block even if a checkpoint has failed. I think it will be > improved by adding a comment to journal_update_superblock(). > For example: don't invoke this function if checkpointing has > failed, otherwise unwritten-back journaled data can be lost. > > Or we may stop both modifying j_tail and updating the super block > when the journal has aborted (especially for a reason of checkpoint > failure). Of course, __journal_abort_soft() still updates the > super block once. OK, the last option is also fine with me. Thanks for your work. Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) 2008-05-23 22:28 ` Jan Kara @ 2008-05-26 4:57 ` Hidehiro Kawai 0 siblings, 0 replies; 23+ messages in thread From: Hidehiro Kawai @ 2008-05-26 4:57 UTC (permalink / raw) To: Jan Kara Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita Hello, >>Jan Kara wrote: >> >>>On Fri 16-05-08 19:29:15, Hidehiro Kawai wrote: >>> >>> >>>>Jan Kara wrote: >>>> >>>> >>>>>>/* >>>>>>+ * We couldn't write back some metadata buffers to the filesystem. >>>>>>+ * To avoid unwritten-back metadata buffers from losing, set >>>>>>+ * JFS_CP_ABORT flag and make sure that the journal space isn't >>>>>>+ * cleaned. This function also aborts journaling. >>>>>>+ */ >>>>>>+static void __journal_abort_checkpoint(journal_t *journal, int errno) >>>>>>+{ >>>>>>+ if (is_checkpoint_aborted(journal)) >>>>>>+ return; >>>>>>+ >>>>>>+ spin_lock(&journal->j_state_lock); >>>>>>+ journal->j_flags |= JFS_CP_ABORT; >>>>>>+ spin_unlock(&journal->j_state_lock); >>>>>>+ printk(KERN_WARNING "JBD: Checkpointing failed. Some metadata blocks " >>>>>>+ "are still old.\n"); >>>>>>+ journal_abort(journal, errno); >>>>>>+} >>>> >>>>[snip] >>>> >>>> >>>> >>>>>>Index: linux-2.6.26-rc2/include/linux/jbd.h >>>>>>=================================================================== >>>>>>--- linux-2.6.26-rc2.orig/include/linux/jbd.h >>>>>>+++ linux-2.6.26-rc2/include/linux/jbd.h >>>>>>@@ -816,6 +816,8 @@ struct journal_s >>>>>>#define JFS_FLUSHED 0x008 /* The journal superblock has been flushed */ >>>>>>#define JFS_LOADED 0x010 /* The journal superblock has been loaded */ >>>>>>#define JFS_BARRIER 0x020 /* Use IDE barriers */ >>>>>>+#define JFS_CP_ABORT 0x040 /* Checkpointing has failed. We don't have to >>>>>>+ * clean the journal space. */ >>>>>> >>>>>>/* >>>>>>* Function declarations for the journaling transaction and buffer >>>>>>@@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou >>>>>> return journal->j_flags & JFS_ABORT; >>>>>>} >>>>>> >>>>>>+static inline int is_checkpoint_aborted(journal_t *journal) >>>>>>+{ >>>>>>+ return journal->j_flags & JFS_CP_ABORT; >>>>>>+} >>>>>>+ >>>>> >>>>> Actually, I don't think this new flag is needed (not that it would really >>>>>harm anything). At least at the places where you check it you can as well >>>>>check whether the journal is aborted (maybe except for journal_destroy() >>>>>but see my comment there). >>>> >>>>As you say, JFS_CP_ABORT isn't necessarily needed in the most cases, >>>>but it is needed for __journal_abort_checkpoint() which report the >>>>checkpoint related error by printk only once. If we use JFS_ABORT >>>>to check whether this call is second time, the error message may be >>>>never printed out because the journal has been aborted by another >>>>reason. If we don't check for the second call, the error message >>>>will be printed out several times. >>> >>> Yes, I think that checking out for JFS_ABORT is the right thing to do. >>>Once the journal has aborted for some reason, it is enough that we print >>>some error message (and that is responsibility of the first caller of >>>journal_abort()). Printing that checkpointing has not succeeded as well is >>>IMO not needed. >> >>A checkpointing failure is a bit special. If the journal is aborted >>by journal_commit_transaction(), the integrity of the filesystem is >>ensured although the latest changes will be lost. However, if the >>journal is aborted by log_do_checkpoint() and the replay also failed, >>the filesystem may be corrupted because some of the metadata blocks >>are in old states. In this case, the user will have to recover the >>filesystem manually and carefully. So I think printing the special >>message is needed to inform users about that. > > Is this really different? If we spot error during checkpointing, we > abort journal (so data still remain in the journal because checkpointing > has failed). On next mount, either replay succeeds and everything is fine, > or we spot error during replay and we should just refuse to mount the > filesystem. So we are correct again - user then has to run fsck to solve > the problem and that will try it's best to get most data readable. So I > don't see why checkpointing bug would need a special attention. I understand. I have thought the following scenario. When we failed to checkpoint and replay, copy all readable blocks from the bad disk to a good disk, and we can replay all journaled data on the good disk. But it will be overkill. Normally we run the fsck when we failed to mount, and it's enough for us. I'm going to just use JFS_ABORT instead of introducing JFS_CP_ABORT in the next revision. Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2008-05-26 4:57 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-05-14 4:43 [PATCH 0/4] jbd: possible filesystem corruption fixes (rebased) Hidehiro Kawai 2008-05-14 4:47 ` [PATCH 1/4] jbd: strictly check for write errors on data buffers (rebased) Hidehiro Kawai 2008-05-14 12:56 ` Jan Kara 2008-05-14 4:48 ` [PATCH 2/4] jbd: ordered data integrity fix (rebased) Hidehiro Kawai 2008-05-14 13:10 ` Jan Kara 2008-05-16 10:25 ` Hidehiro Kawai 2008-05-19 3:11 ` Jan Kara 2008-05-14 4:49 ` [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased) Hidehiro Kawai 2008-05-14 13:15 ` Jan Kara 2008-05-16 10:26 ` Hidehiro Kawai 2008-05-19 3:14 ` Jan Kara 2008-05-21 1:33 ` Hidehiro Kawai 2008-05-14 4:50 ` [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) Hidehiro Kawai 2008-05-14 13:16 ` Josef Bacik 2008-05-14 14:44 ` Jan Kara 2008-05-14 14:37 ` Josef Bacik 2008-05-16 10:28 ` Hidehiro Kawai 2008-05-14 14:32 ` Jan Kara 2008-05-16 10:29 ` Hidehiro Kawai 2008-05-19 3:38 ` Jan Kara 2008-05-21 1:34 ` Hidehiro Kawai 2008-05-23 22:28 ` Jan Kara 2008-05-26 4:57 ` Hidehiro Kawai
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).