Re: Bad blocks are killing us!

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Bruce Lowekamp <brucelowekamp@gmail.com>
To: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Guy Watkins <guy@watkins-home.com>, linux-raid@vger.kernel.org
Subject: Re: Bad blocks are killing us!
Date: Wed, 17 Nov 2004 16:58:21 -0500	[thread overview]
Message-ID: <e9132f82041117135873807c77@mail.gmail.com> (raw)
In-Reply-To: <16793.11589.337840.541169@cse.unsw.edu.au>

2: Thanks for devoting the time for getting this done.  Personally,
for the PATA arrays I use, this approach is a bit overkill---if the
rewrite succeeds, it's ok (unless I start to see repeated errors, in
which case I yank the drive), if the rewrite doesn't succeed, it's
dead and I have to yank the drive.   I don't have any useful
diagnostic tools at linux user-level other than smart badblocks scans,
which would just confirm the bad sectors.  Personally, I wouldn't go
to the effort to keep (parts of) the drive in the array if it can't be
rewritten successfully---I've never seen a drive last long in that
situation, and I think that drive is really dead.  The only problems
I've had in practice have been with mutliple accumulated read
errors---and rewriting those would make them go away quickly.  I would
just want the data rewritten at user level, and log the event so I can
monitor the array for failures and look at the smart output or take a
drive offline for testing (with vendor diag tools) if it starts to
have frequent errors.  Naturally, as long as the more complex approach
of kicking to user level allows the user-level to return immediately
to let the kernel rewrite the stripe, I think it's fine.

I agree that writing several megabytes is not an issue in any way. 
IMHO, feel free to hang the whole system for a few seconds if
necessary---no one should be using md in an RT-critical application,
and bad blocks are relatively rare.

3: The data scans is an interesting idea.  Right now I run daily smart
short scans and weekly smart long scans to try to catch any bad blocks
before I get multiple errors.  Assuming there aren't any uncaught CRC
errors, I feel comfortable with that approach, but the md-level
approach might be better.  But I'm not sure I see the point of
it---unless you have raid 6 with multiple parity blocks, if a disk
actually has the wrong information recorded on it I don't think you
can detect which drive is bad, just that one of them is.  So I don't
think you gain anything beyond what a standard smart long scan or just
cat'ing the raw device would give you in terms of forcing the whole
drive to be read.

Bruce

On Tue, 16 Nov 2004 09:27:17 +1100, Neil Brown <neilb@cse.unsw.edu.au> wrote:

>  2/ Look at recovering from failed reads that can be fixed by a
>     write.  I am considering leveraging the "bitmap resync" stuff for
>     this.  With the bitmap stuff in place, you can let the kernel kick
>     out a drive that has a read error, let user-space have a quick
>     look at the drive and see if it might be a recoverable error, and
>     then give the drive back to the kernel.  It will then do a partial
>     resync based on the bitmap information, thus writing the bad
>     blocks, and all should be fine.  This would mean re-writing
>     several megabytes instead of a few sectors, but I don't think that
>     is a big cost.  There are a few issues that make it a bit less
>     trivial than that, but it will probably be my starting point.
>     The new "faulty" personality will allow this to be tested easily.

-- 
Bruce Lowekamp  (lowekamp@cs.wm.edu)
Computer Science Dept, College of William and Mary

next prev parent reply	other threads:[~2004-11-17 21:58 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <200411150522.iAF5MNN18341@www.watkins-home.com>
2004-11-15 22:27 ` Bad blocks are killing us! Neil Brown
2004-11-16 16:28   ` Maurilio Longo
2004-11-16 18:18   ` Guy
2004-11-16 23:04     ` Neil Brown
2004-11-16 23:07       ` Guy
2004-11-17 13:21         ` Badstripe proposal (was Re: Bad blocks are killing us!) David Greaves
2004-11-18  9:59           ` Maurilio Longo
2004-11-18 10:29             ` Robin Bowes
2004-11-19 17:12             ` Jure Pe_ar
2004-11-20 13:15               ` Maurilio Longo
2004-11-21 18:23                 ` Jure Pe_ar
2004-11-16 23:29       ` Bad blocks are killing us! dean gaudet
2004-11-17 21:58   ` Bruce Lowekamp [this message]
2004-11-18  1:46     ` Guy Watkins
2004-11-18 16:03       ` Bruce Lowekamp
2004-11-19 18:47       ` Dieter Stueken
2004-11-22  8:22       ` Dieter Stueken
2004-11-22  9:17         ` Guy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e9132f82041117135873807c77@mail.gmail.com \
    --to=brucelowekamp@gmail.com \
    --cc=guy@watkins-home.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@cse.unsw.edu.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).