linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andreas Dilger <adilger@sun.com>
To: Theodore Tso <tytso@MIT.EDU>
Cc: linux-ext4@vger.kernel.org, Girish Shilamkar <Girish.Shilamkar@sun.com>
Subject: Re: What to do when the journal checksum is incorrect
Date: Mon, 26 May 2008 12:24:28 -0600	[thread overview]
Message-ID: <20080526182428.GT3516@webber.adilger.int> (raw)
In-Reply-To: <20080525113842.GE5970@mit.edu>

On May 25, 2008  07:38 -0400, Theodore Ts'o wrote:
> Well, what are the alternatives?  Remember, we could have potentially
> 50-100 megabytes of stale metadata that haven't been written to
> filesystem.  And unlike ext2, we've deliberately held back writing
> back metadata by pinning it so, things could be much worse.  So let's
> tick off the possibilities:
> 
> * An individual data block is bad --- we write complete garbage into
>   the filesystem, which means in the worst case we lose 32 inodes
>   (unless that inode table block is repeated later in the journal), 1
>   directory block (causing files to land in lost+found), one bitmap
>   block (which e2fsck can regenerate), or a data block (if data=jouranalled).
> 
> * A journal descriptor block is bad --- if it's just a bit-flip, we
>   could end up writing a data block in the wrong place, which would be
>   bad; if it's complete garbage, we will probably assume the journal
>   ended early, and leave the filesystem silently badly corrupted.
> 
> * The journal commit block is bad --- probably we will just silently
>   assume the journal ended early, unless the bit-flip happened exactly
>   in the CRC field.
> 
> The most common case is that one or more individual data blocks in the
> journal are bad, and the question is whether writing that garbage into
> the filesystem is better or worse than aborting the journal right then
> and there.

You are focussing on the case where 1 or 2 filesystem blocks in the
journal are bad, but I suspect the real-world cases are more likely to
be 1 or 2MB of data are bad, or more.  Considering that a disk sector
is at least 4 or 64kB in size, and problems like track misalignment
(overpowered seek), write failure (high-flying write), or device cache
reordering problems will result in a large number of bad blocks in the
journal, I don't think 1 or 2 filesystem is a realistic failure scenario
anymore.

> The problem with only replaying the "good" part of the journal is the
> kernel then truncates the journal, and it leaves e2fsck with no way of
> doing anything intelligent afterwards.  So another possibility is to
> not replay the journal at all, and fail the mount unless the
> filesystem is being mounted read-only; but the question is whether we
> are better off not replaying the journal at *all*, or just replaying
> part of it.

I'd think at a minimum to replay the journal up to the bad transaction.
That the current code is broken and also replays the bad transaction is
of course incorrect.  The probability that later transactions have
begun checkpointing their blocks to the filesystem is decreasing for
each later transaction after the bad one, so the probability of those
changes corrupting the filesystem are correspondingly lower.

> Consider that if /boot/grub/menu.lst got written, and one of its data
> block was previously directory block that had since gotten deleted,
> but in the journal and had been revoked, replaying part of the journal
> might make the system non-bootable.

Sure, such scenarios exist, but the architecture of ext3/4 is that the
data block will _likely_ have been rewritten in the same place.  The
more likely case is that some important filesystem metadata (itable,
indirect blocks of files, etc) is being overwritten and corruption in
the journal is a laser-guided missile to finding all of the important
blocks in the filesystem to spread that corruption to.

> So the other alternative I seriously considered was not replaying the
> journal at all, and bailing out after seeing the bad checksum --- but
> that just defers the problem to e2fsck, and e2fsck can't really do
> anything much different, and the tools to allow a human to make a
> decision on a block by block basis in the journal don't exist, and
> even if they did would make more system administrators run screaming.
> 
> I suspect the *best* approach is to change the journal format one more
> time, and include a CRC on a per-block basis in the descriptor blocks,
> and a CRC for the entire descriptor block.  That way, we can decide
> what to replay or not on a per-block basis.

Yes, I was thinking exactly this same thing.  This would give the maximum
probability of the correct outcome, because only "correct" blocks are
checkpointed into the filesystem, and at least an old version of the
block is present in the filesystem (unless it is a new block).  The chance
also exists that a later transaction will even overwrite the bad block,
which will avoid even the need to invoke e2fsck.

This would need:
- a checksum in the per-block transaction record (tag).  One option is
  to keep an 8- or 16-bit checksum in the "flags" field, to keep it
  compatible with older JBD implementations.
- a checksum of the commit header and tags to ensure we can trust the
  per-block checksums, and we don't need a huge checksum for each block.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


  parent reply	other threads:[~2008-05-26 18:24 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-05-24 22:34 What to do when the journal checksum is incorrect Theodore Ts'o
2008-05-25  6:30 ` Andreas Dilger
2008-05-25 11:38   ` Theodore Tso
2008-05-26 14:54     ` Theodore Tso
2008-05-26 18:24     ` Andreas Dilger [this message]
2008-05-26 21:28       ` Ric Wheeler
2008-06-03 10:22 ` Girish Shilamkar
2008-06-03 21:27   ` Andreas Dilger
2008-06-04 23:40   ` Theodore Tso
2008-06-04 23:56     ` [PATCH] jbd2: Fix memory leak when verifying checksums in the journal Theodore Ts'o
2008-06-04 23:56       ` [PATCH] jbd2: If a journal checksum error is detected, propagate the error to ext4 Theodore Ts'o
2008-06-05  3:17         ` Andreas Dilger
2008-06-05 16:21           ` Theodore Tso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080526182428.GT3516@webber.adilger.int \
    --to=adilger@sun.com \
    --cc=Girish.Shilamkar@sun.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@MIT.EDU \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).