linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Checking consistency of Linux software RAID
@ 2003-06-30 12:58 Martin Bene
  2003-06-30 13:13 ` Gordon Henderson
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Martin Bene @ 2003-06-30 12:58 UTC (permalink / raw)
  To: linux-raid

Hi,

Administrationg quite a few systems with HW raid controllers, I've come to
really like a feature that seems to be missing from current SW raid: 

Scheduling a (weekly) complete media scan where all surfaces of all drives
get read; in case of read errors a repair is tried: the content for the
failed sector is reconstructed just as if the drive had completely failed and
rewritten to the failed sector; if reading works afterwards, regard the
repair as successfull and continue using the drive.

Is there any way to do this with SW raid? I truly hate situations where some
sectors on a drive fail silently and you don't notice until a 2nd drive dies
and you find you can't recostruct your raid data becaus of silent "bitrot".

Tnaks for any hints,

Martin

^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: Checking consistency of Linux software RAID
@ 2003-07-09  8:06 Martin Bene
  0 siblings, 0 replies; 14+ messages in thread
From: Martin Bene @ 2003-07-09  8:06 UTC (permalink / raw)
  To: Bernd Schubert, Corey McGuire, linux-raid

Hi Bernd,

> /proc/mdstat is to monitor the status of your raid, so when 
> one drive fails it becomes dropped out of the raid-array. 
> Using mdadm you can monitor /proc/mdstat and it even can 
> send you a mail when one of your disks fails.

Yes, but it can only notify you of errors that it actually detects; as writen
before my concern is silent "bitrot", i.e unaccessed data on the disks going
bad.

> So if you really want to scan your disk once a week, why not 
> running 'dd if=/dev/mdX of=/dev/null'? 
> So every block of every raid-disk should become
> read and the md-driver should automatically drop a failing 
> disk  out of the raid.

Umm, no: this only reads each logical block but doesn't read the redundant
information on raid1 or raid5. Meaning: even if a read of the whole MD device
works, it doesn't guarantee that all sectors of all physical devices can
actually be read.

To check for errors, scanning the lowlevel devices (/dev/sd??) could work,
but still won't help for errors as described further down.

> I guess you could even try to repair a disk when it became 
> dropped out of the raid by running some scripts, but since 
> I never trusted any disk that had failed ones, I never worried 
> about it.

If a write is in progress during a power failure, chances are quite high that
you end up with at least one unreadable sector on the drive; repairing these
is quite OK and not a sign of the drive going bad. So having one sector bad
on drive 0 and another sector on drive 1 is not too farfetched - currently,
there's no good way to recover from such a situation: if you hit the bad
sector on drive 0, drive0 will be kicked from the array; when you hit the bad
sector on drive1 during resync, resync will fail.

With (some) hardware raid, solutions, a media scan will find both errors and
rewrite the bad sectors with recostructed data from the other drives. Quite a
useful feature but not yet possible with linux SW raid.

Bye, Martin

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2003-07-09  8:06 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-06-30 12:58 Checking consistency of Linux software RAID Martin Bene
2003-06-30 13:13 ` Gordon Henderson
2003-06-30 13:16   ` Lars Marowsky-Bree
2003-06-30 13:28     ` Gordon Henderson
2003-06-30 13:36       ` Lars Marowsky-Bree
2003-06-30 13:16 ` Lars Marowsky-Bree
2003-07-07 18:29 ` Bernd Schubert
2003-07-07 18:42   ` Corey McGuire
2003-07-08 16:51     ` Bernd Schubert
2003-07-08 21:23       ` software raid hangs Donghui Wen
2003-07-08 21:38         ` Matt Simonsen
2003-07-08 21:41           ` Donghui Wen
2003-07-08 21:47       ` Checking consistency of Linux software RAID Corey McGuire
  -- strict thread matches above, loose matches on Subject: below --
2003-07-09  8:06 Martin Bene

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).