From: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
To: akpm@linux-foundation.org, sct@redhat.com
Cc: adilger@clusterfs.com, linux-kernel@vger.kernel.org,
linux-ext4@vger.kernel.org, jack@suse.cz, jbacik@redhat.com,
cmm@us.ibm.com, tytso@mit.edu,
sugita <yumiko.sugita.yf@hitachi.com>,
Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com>
Subject: [PATCH 4/5] jbd: fix error handling for checkpoint io
Date: Mon, 02 Jun 2008 19:47:25 +0900 [thread overview]
Message-ID: <4843CFBD.7040706@hitachi.com> (raw)
In-Reply-To: <4843CE15.6080506@hitachi.com>
Subject: [PATCH 4/5] jbd: fix error handling for checkpoint io
When a checkpointing IO fails, current JBD code doesn't check the
error and continue journaling. This means latest metadata can be
lost from both the journal and filesystem.
This patch leaves the failed metadata blocks in the journal space
and aborts journaling in the case of log_do_checkpoint().
To achieve this, we need to do:
1. don't remove the failed buffer from the checkpoint list where in
the case of __try_to_free_cp_buf() because it may be released or
overwritten by a later transaction
2. log_do_checkpoint() is the last chance, remove the failed buffer
from the checkpoint list and abort the journal
3. when checkpointing fails, don't update the journal super block to
prevent the journaled contents from being cleaned. For safety,
don't update j_tail and j_tail_sequence either
4. when checkpointing fails, notify this error to the ext3 layer so
that ext3 don't clear the needs_recovery flag, otherwise the
journaled contents are ignored and cleaned in the recovery phase
5. if the recovery fails, keep the needs_recovery flag
6. prevent cleanup_journal_tail() from being called between
__journal_drop_transaction() and journal_abort() (a race issue
between journal_flush() and __log_wait_for_space()
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
fs/jbd/checkpoint.c | 48 +++++++++++++++++++++++++++++++-----------
fs/jbd/journal.c | 28 +++++++++++++++++++-----
fs/jbd/recovery.c | 7 ++++--
include/linux/jbd.h | 2 -
4 files changed, 64 insertions(+), 21 deletions(-)
Index: linux-2.6.26-rc4/fs/jbd/checkpoint.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/checkpoint.c
+++ linux-2.6.26-rc4/fs/jbd/checkpoint.c
@@ -93,7 +93,8 @@ static int __try_to_free_cp_buf(struct j
int ret = 0;
struct buffer_head *bh = jh2bh(jh);
- if (jh->b_jlist == BJ_None && !buffer_locked(bh) && !buffer_dirty(bh)) {
+ if (jh->b_jlist == BJ_None && !buffer_locked(bh) &&
+ !buffer_dirty(bh) && buffer_uptodate(bh)) {
JBUFFER_TRACE(jh, "remove from checkpoint list");
ret = __journal_remove_checkpoint(jh) + 1;
jbd_unlock_bh_state(bh);
@@ -160,21 +161,25 @@ static void jbd_sync_bh(journal_t *journ
* buffers. Note that we take the buffers in the opposite ordering
* from the one in which they were submitted for IO.
*
+ * Return 0 on success, and return <0 if some buffers have failed
+ * to be written out.
+ *
* Called with j_list_lock held.
*/
-static void __wait_cp_io(journal_t *journal, transaction_t *transaction)
+static int __wait_cp_io(journal_t *journal, transaction_t *transaction)
{
struct journal_head *jh;
struct buffer_head *bh;
tid_t this_tid;
int released = 0;
+ int ret = 0;
this_tid = transaction->t_tid;
restart:
/* Did somebody clean up the transaction in the meanwhile? */
if (journal->j_checkpoint_transactions != transaction ||
transaction->t_tid != this_tid)
- return;
+ return ret;
while (!released && transaction->t_checkpoint_io_list) {
jh = transaction->t_checkpoint_io_list;
bh = jh2bh(jh);
@@ -194,6 +199,9 @@ restart:
spin_lock(&journal->j_list_lock);
goto restart;
}
+ if (unlikely(!buffer_uptodate(bh)))
+ ret = -EIO;
+
/*
* Now in whatever state the buffer currently is, we know that
* it has been written out and so we can drop it from the list
@@ -203,6 +211,8 @@ restart:
journal_remove_journal_head(bh);
__brelse(bh);
}
+
+ return ret;
}
#define NR_BATCH 64
@@ -226,7 +236,8 @@ __flush_batch(journal_t *journal, struct
* Try to flush one buffer from the checkpoint list to disk.
*
* Return 1 if something happened which requires us to abort the current
- * scan of the checkpoint list.
+ * scan of the checkpoint list. Return <0 if the buffer has failed to
+ * be written out.
*
* Called with j_list_lock held and drops it if 1 is returned
* Called under jbd_lock_bh_state(jh2bh(jh)), and drops it
@@ -256,6 +267,9 @@ static int __process_buffer(journal_t *j
log_wait_commit(journal, tid);
ret = 1;
} else if (!buffer_dirty(bh)) {
+ ret = 1;
+ if (unlikely(!buffer_uptodate(bh)))
+ ret = -EIO;
J_ASSERT_JH(jh, !buffer_jbddirty(bh));
BUFFER_TRACE(bh, "remove from checkpoint");
__journal_remove_checkpoint(jh);
@@ -263,7 +277,6 @@ static int __process_buffer(journal_t *j
jbd_unlock_bh_state(bh);
journal_remove_journal_head(bh);
__brelse(bh);
- ret = 1;
} else {
/*
* Important: we are about to write the buffer, and
@@ -318,6 +331,7 @@ int log_do_checkpoint(journal_t *journal
* OK, we need to start writing disk blocks. Take one transaction
* and write it.
*/
+ result = 0;
spin_lock(&journal->j_list_lock);
if (!journal->j_checkpoint_transactions)
goto out;
@@ -334,7 +348,7 @@ restart:
int batch_count = 0;
struct buffer_head *bhs[NR_BATCH];
struct journal_head *jh;
- int retry = 0;
+ int retry = 0, err;
while (!retry && transaction->t_checkpoint_list) {
struct buffer_head *bh;
@@ -347,6 +361,8 @@ restart:
break;
}
retry = __process_buffer(journal, jh, bhs,&batch_count);
+ if (retry < 0)
+ result = retry;
if (!retry && (need_resched() ||
spin_needbreak(&journal->j_list_lock))) {
spin_unlock(&journal->j_list_lock);
@@ -371,14 +387,18 @@ restart:
* Now we have cleaned up the first transaction's checkpoint
* list. Let's clean up the second one
*/
- __wait_cp_io(journal, transaction);
+ err = __wait_cp_io(journal, transaction);
+ if (!result)
+ result = err;
}
out:
spin_unlock(&journal->j_list_lock);
- result = cleanup_journal_tail(journal);
if (result < 0)
- return result;
- return 0;
+ journal_abort(journal, result);
+ else
+ result = cleanup_journal_tail(journal);
+
+ return (result < 0) ? result : 0;
}
/*
@@ -394,8 +414,9 @@ out:
* This is the only part of the journaling code which really needs to be
* aware of transaction aborts. Checkpointing involves writing to the
* main filesystem area rather than to the journal, so it can proceed
- * even in abort state, but we must not update the journal superblock if
- * we have an abort error outstanding.
+ * even in abort state, but we must not update the super block if
+ * checkpointing may have failed. Otherwise, we would lose some metadata
+ * buffers which should be written-back to the filesystem.
*/
int cleanup_journal_tail(journal_t *journal)
@@ -404,6 +425,9 @@ int cleanup_journal_tail(journal_t *jour
tid_t first_tid;
unsigned long blocknr, freed;
+ if (is_journal_aborted(journal))
+ return 1;
+
/* OK, work out the oldest transaction remaining in the log, and
* the log block it starts at.
*
Index: linux-2.6.26-rc4/fs/jbd/journal.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/journal.c
+++ linux-2.6.26-rc4/fs/jbd/journal.c
@@ -1122,9 +1122,12 @@ recovery_error:
*
* Release a journal_t structure once it is no longer in use by the
* journaled object.
+ * Return <0 if we couldn't clean up the journal.
*/
-void journal_destroy(journal_t *journal)
+int journal_destroy(journal_t *journal)
{
+ int err = 0;
+
/* Wait for the commit thread to wake up and die. */
journal_kill_thread(journal);
@@ -1147,11 +1150,16 @@ void journal_destroy(journal_t *journal)
J_ASSERT(journal->j_checkpoint_transactions == NULL);
spin_unlock(&journal->j_list_lock);
- /* We can now mark the journal as empty. */
- journal->j_tail = 0;
- journal->j_tail_sequence = ++journal->j_transaction_sequence;
if (journal->j_sb_buffer) {
- journal_update_superblock(journal, 1);
+ if (!is_journal_aborted(journal)) {
+ /* We can now mark the journal as empty. */
+ journal->j_tail = 0;
+ journal->j_tail_sequence =
+ ++journal->j_transaction_sequence;
+ journal_update_superblock(journal, 1);
+ } else {
+ err = -EIO;
+ }
brelse(journal->j_sb_buffer);
}
@@ -1161,6 +1169,8 @@ void journal_destroy(journal_t *journal)
journal_destroy_revoke(journal);
kfree(journal->j_wbuf);
kfree(journal);
+
+ return err;
}
@@ -1360,10 +1370,16 @@ int journal_flush(journal_t *journal)
spin_lock(&journal->j_list_lock);
while (!err && journal->j_checkpoint_transactions != NULL) {
spin_unlock(&journal->j_list_lock);
+ mutex_lock(&journal->j_checkpoint_mutex);
err = log_do_checkpoint(journal);
+ mutex_unlock(&journal->j_checkpoint_mutex);
spin_lock(&journal->j_list_lock);
}
spin_unlock(&journal->j_list_lock);
+
+ if (is_journal_aborted(journal))
+ return -EIO;
+
cleanup_journal_tail(journal);
/* Finally, mark the journal as really needing no recovery.
@@ -1385,7 +1401,7 @@ int journal_flush(journal_t *journal)
J_ASSERT(journal->j_head == journal->j_tail);
J_ASSERT(journal->j_tail_sequence == journal->j_transaction_sequence);
spin_unlock(&journal->j_state_lock);
- return err;
+ return 0;
}
/**
Index: linux-2.6.26-rc4/fs/jbd/recovery.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/recovery.c
+++ linux-2.6.26-rc4/fs/jbd/recovery.c
@@ -223,7 +223,7 @@ do { \
*/
int journal_recover(journal_t *journal)
{
- int err;
+ int err, err2;
journal_superblock_t * sb;
struct recovery_info info;
@@ -261,7 +261,10 @@ int journal_recover(journal_t *journal)
journal->j_transaction_sequence = ++info.end_transaction;
journal_clear_revoke(journal);
- sync_blockdev(journal->j_fs_dev);
+ err2 = sync_blockdev(journal->j_fs_dev);
+ if (!err)
+ err = err2;
+
return err;
}
Index: linux-2.6.26-rc4/include/linux/jbd.h
===================================================================
--- linux-2.6.26-rc4.orig/include/linux/jbd.h
+++ linux-2.6.26-rc4/include/linux/jbd.h
@@ -908,7 +908,7 @@ extern int journal_set_features
(journal_t *, unsigned long, unsigned long, unsigned long);
extern int journal_create (journal_t *);
extern int journal_load (journal_t *journal);
-extern void journal_destroy (journal_t *);
+extern int journal_destroy (journal_t *);
extern int journal_recover (journal_t *journal);
extern int journal_wipe (journal_t *, int);
extern int journal_skip_recovery (journal_t *);
next prev parent reply other threads:[~2008-06-02 10:47 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-06-02 10:40 [PATCH 0/5] jbd: possible filesystem corruption fixes (take 2) Hidehiro Kawai
2008-06-02 10:43 ` [PATCH 1/5] jbd: strictly check for write errors on data buffers Hidehiro Kawai
2008-06-03 22:30 ` Andrew Morton
2008-06-04 10:19 ` Jan Kara
2008-06-04 18:19 ` Andrew Morton
2008-06-04 21:22 ` Theodore Tso
2008-06-04 21:58 ` Andrew Morton
2008-06-04 22:51 ` Theodore Tso
2008-06-05 9:35 ` Jan Kara
2008-06-05 11:33 ` Hidehiro Kawai
2008-06-05 14:29 ` Theodore Tso
2008-06-05 16:20 ` Andrew Morton
2008-06-05 18:49 ` Andreas Dilger
2008-06-09 10:09 ` Hidehiro Kawai
2008-06-11 12:35 ` Jan Kara
2008-06-12 13:19 ` Hidehiro Kawai
2008-06-05 3:28 ` Mike Snitzer
2008-06-04 21:58 ` Andreas Dilger
2008-06-04 10:53 ` Hidehiro Kawai
2008-06-02 10:45 ` [PATCH 2/5] jbd: ordered data integrity fix Hidehiro Kawai
2008-06-02 11:59 ` Jan Kara
2008-06-03 22:33 ` Andrew Morton
2008-06-04 10:55 ` Hidehiro Kawai
2008-06-02 10:46 ` [PATCH 3/5] jbd: abort when failed to log metadata buffers Hidehiro Kawai
2008-06-02 12:00 ` Jan Kara
2008-06-03 22:35 ` Andrew Morton
2008-06-04 10:57 ` Hidehiro Kawai
2008-06-02 10:47 ` Hidehiro Kawai [this message]
2008-06-02 12:44 ` [PATCH 4/5] jbd: fix error handling for checkpoint io Jan Kara
2008-06-03 4:31 ` Hidehiro Kawai
2008-06-03 4:40 ` Hidehiro Kawai
2008-06-03 5:11 ` Hidehiro Kawai
2008-06-03 5:20 ` Andrew Morton
2008-06-03 8:02 ` Jan Kara
2008-06-23 11:14 ` Hidehiro Kawai
2008-06-23 12:22 ` Jan Kara
2008-06-24 11:52 ` Hidehiro Kawai
2008-06-24 13:33 ` Jan Kara
2008-06-27 8:06 ` Hidehiro Kawai
2008-06-27 10:24 ` Jan Kara
2008-06-30 5:09 ` Hidehiro Kawai
2008-07-07 10:07 ` Jan Kara
2008-06-02 10:48 ` [PATCH 5/5] ext3: abort ext3 if the journal has aborted Hidehiro Kawai
2008-06-02 12:49 ` Jan Kara
2008-06-02 12:05 ` [PATCH 0/5] jbd: possible filesystem corruption fixes (take 2) Jan Kara
2008-06-03 4:30 ` Hidehiro Kawai
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4843CFBD.7040706@hitachi.com \
--to=hidehiro.kawai.ez@hitachi.com \
--cc=adilger@clusterfs.com \
--cc=akpm@linux-foundation.org \
--cc=cmm@us.ibm.com \
--cc=jack@suse.cz \
--cc=jbacik@redhat.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=satoshi.oshima.fk@hitachi.com \
--cc=sct@redhat.com \
--cc=tytso@mit.edu \
--cc=yumiko.sugita.yf@hitachi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).