All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ric Wheeler <ricwheeler@gmail.com>
To: Andreas Dilger <adilger@sun.com>
Cc: Theodore Tso <tytso@MIT.EDU>,
	linux-ext4@vger.kernel.org,
	Girish Shilamkar <Girish.Shilamkar@sun.com>
Subject: Re: What to do when the journal checksum is incorrect
Date: Mon, 26 May 2008 17:28:30 -0400	[thread overview]
Message-ID: <483B2B7E.3060906@gmail.com> (raw)
In-Reply-To: <20080526182428.GT3516@webber.adilger.int>

Andreas Dilger wrote:
> On May 25, 2008  07:38 -0400, Theodore Ts'o wrote:
>> Well, what are the alternatives?  Remember, we could have potentially
>> 50-100 megabytes of stale metadata that haven't been written to
>> filesystem.  And unlike ext2, we've deliberately held back writing
>> back metadata by pinning it so, things could be much worse.  So let's
>> tick off the possibilities:
>>
>> * An individual data block is bad --- we write complete garbage into
>>   the filesystem, which means in the worst case we lose 32 inodes
>>   (unless that inode table block is repeated later in the journal), 1
>>   directory block (causing files to land in lost+found), one bitmap
>>   block (which e2fsck can regenerate), or a data block (if data=jouranalled).
>>
>> * A journal descriptor block is bad --- if it's just a bit-flip, we
>>   could end up writing a data block in the wrong place, which would be
>>   bad; if it's complete garbage, we will probably assume the journal
>>   ended early, and leave the filesystem silently badly corrupted.
>>
>> * The journal commit block is bad --- probably we will just silently
>>   assume the journal ended early, unless the bit-flip happened exactly
>>   in the CRC field.
>>
>> The most common case is that one or more individual data blocks in the
>> journal are bad, and the question is whether writing that garbage into
>> the filesystem is better or worse than aborting the journal right then
>> and there.
> 
> You are focussing on the case where 1 or 2 filesystem blocks in the
> journal are bad, but I suspect the real-world cases are more likely to
> be 1 or 2MB of data are bad, or more.  Considering that a disk sector
> is at least 4 or 64kB in size, and problems like track misalignment
> (overpowered seek), write failure (high-flying write), or device cache
> reordering problems will result in a large number of bad blocks in the
> journal, I don't think 1 or 2 filesystem is a realistic failure scenario
> anymore.

Disk sectors are still (almost always) 512 bytes today, but the industry is 
pushing hard to get 4k byte sectors out since that has a promise of getting 
better data protection and denser layout. Disk arrays have internal "sectors" 
that can be really big (64k or bigger).

What seems to be most common is a small number of bad sectors that will be 
unreadable (IO errors on read). I would be surprised to see megabytes of 
continuous errors, but you could see 10's of kilobytes.

What the checksums will probably be most useful in catching is problems with 
memory parts - either non-ECC DRAM in your server, bad DRAM in the disk cache 
itself, etc. The interesting thing about these errors is that they will tend to 
repeat (depending on where that stuck bit is) and you can see it all over the place.

One thing that will be really neat is to actually put in counters to track the 
rate and validate these assumptions.


ric

  reply	other threads:[~2008-05-26 21:28 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-05-24 22:34 What to do when the journal checksum is incorrect Theodore Ts'o
2008-05-25  6:30 ` Andreas Dilger
2008-05-25 11:38   ` Theodore Tso
2008-05-26 14:54     ` Theodore Tso
2008-05-26 18:24     ` Andreas Dilger
2008-05-26 21:28       ` Ric Wheeler [this message]
2008-06-03 10:22 ` Girish Shilamkar
2008-06-03 21:27   ` Andreas Dilger
2008-06-04 23:40   ` Theodore Tso
2008-06-04 23:56     ` [PATCH] jbd2: Fix memory leak when verifying checksums in the journal Theodore Ts'o
2008-06-04 23:56       ` [PATCH] jbd2: If a journal checksum error is detected, propagate the error to ext4 Theodore Ts'o
2008-06-05  3:17         ` Andreas Dilger
2008-06-05 16:21           ` Theodore Tso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=483B2B7E.3060906@gmail.com \
    --to=ricwheeler@gmail.com \
    --cc=Girish.Shilamkar@sun.com \
    --cc=adilger@sun.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@MIT.EDU \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.