How to avoid complete rebuild of RAID 6 array (6/8 active devices)

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* How to avoid complete rebuild of RAID 6 array (6/8 active devices)
@ 2008-06-25  6:37 Dave Moon
  2008-06-25 16:13 ` Andre Noll
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Moon @ 2008-06-25  6:37 UTC (permalink / raw)
  To: linux-raid

Hi Everyone,

First off, a little background on my setup.

Ubuntu 7.10 i386 Server (2.6.22-14-server)
upgraded to
Ubuntu 8.04 i386 Server (2.6.24-19-server)

I have 8 SATA drives connected and the drives are organized into three  
md RAID arrays as follows:

/dev/md1: ext3 partition mounted as /boot, composed of 8 members (RAID  
1) (sda1/b1/c1/d1/e1/f1/g1/h1)
/dev/md2: ext3 partition mounted as /root, composed of 8 members (RAID  
1) (sda2/b2/c2/d2/e2/f2/g2/h2)
/dev/md3: ext3 partition mounted as /mnt/raid-md3, composed of 8  
members (RAID 6) (sda3/b3/c3/d3/e3/f3/g3/h3), this is the main data  
partition holding 2.7TiBs worth of data

All the raid member partitions are set to type "fd" (Linux RAID  
Autodetect).

Important Note: 6 of the drives are connected to two Sil3114 SATA  
controller cards whilst 2 of the drives are connected to the on-board  
SATA controller (I don't know which model it is).

After upgrading my Ubuntu installation to 8.04, upon system restart  
there was an error message saying that my RAID arrays were degraded  
and thus the system was unable to boot from it.

At the time, not knowing the cause of the sudden RAID failure, I  
attempted to force mdadm to start the arrays anyways (the RAID 1  
arrays with 8 members each were no causes for concern, of course, but  
I wanted to back up my data on the degraded md3 array as soon as  
possible).

Then it hit me, why would it recognize only 6 drives? Apparently the  
kernel has some compatibility problems with certain SATA controllers  
and my on-board controller chip was one of them.

Sure enough, after moving all 8 drives to the Silicon Image  
controllers, the drives were all recognized without any problems.

If the missing drives were recognized again before the array was ever  
brought up again, everything would've been fine. But unfortunately I  
forced mdadm (--run switch) to bring it online with 2 missing members.

This is when the problem began. I know that as soon as I re-add the  
two missing drives back into the md3 (RAID 6) array, the system will  
attempt to rebuild the array, using the information from the remaining  
6 drives.

Given the size of the array and the type of the disk drives being used  
(off-the-shelf SATA drives with bit error rate of 1 out of 10^14  
bits), I think it is highly likely that the system will encounter one  
or more bit errors during the rebuild.

Anyway, I panicked and brought the md3 array down first to prevent  
possible further damage.

So, at this stage what I'm wondering is:

1. If mdadm encounters a bit error during a RAID 6 rebuild, will it  
just give up on that particular file and move on to recover other data  
on the array? Or will it trash the entire array?

2. Is it possible to cheat mdadm by somehow replacing the new "raid  
metadata" on the 6 drives with the old data on the 2 drives? Will it  
make mdadm think the array is clean, consistent and nothing ever  
happened?

Please do note that I did not write ANY new data onto the RAID 6 array  
from the time it was degraded until the time I brought it down with (-- 
stop).

Sorry for the long post and thank you for your time in advance. I  
really hope to get this RAID array back up without data corruption  
because I don't have a working backup of the array (I know, very  
stupid of me).

Dave

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to avoid complete rebuild of RAID 6 array (6/8 active devices)
  2008-06-25  6:37 How to avoid complete rebuild of RAID 6 array (6/8 active devices) Dave Moon
@ 2008-06-25 16:13 ` Andre Noll
  2008-06-27 10:40   ` Neil Brown
  0 siblings, 1 reply; 11+ messages in thread
From: Andre Noll @ 2008-06-25 16:13 UTC (permalink / raw)
  To: Dave Moon; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 881 bytes --]

On 15:37, Dave Moon wrote:

> 1. If mdadm encounters a bit error during a RAID 6 rebuild, will it  
> just give up on that particular file and move on to recover other data  
> on the array? Or will it trash the entire array?

The kernel will stop the array and give up.

> 2. Is it possible to cheat mdadm by somehow replacing the new "raid  
> metadata" on the 6 drives with the old data on the 2 drives? Will it  
> make mdadm think the array is clean, consistent and nothing ever  
> happened?

> Please do note that I did not write ANY new data onto the RAID 6 array  
> from the time it was degraded until the time I brought it down with (-- 
> stop).

Use --force, Luke. Man mdadm(8):

	-f, --force Assemble the array even if some superblocks
	appear out-of-date

Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to avoid complete rebuild of RAID 6 array (6/8 active devices)
  2008-06-25 16:13 ` Andre Noll
@ 2008-06-27 10:40   ` Neil Brown
  2008-06-29 21:58     ` Bill Davidsen
  0 siblings, 1 reply; 11+ messages in thread
From: Neil Brown @ 2008-06-27 10:40 UTC (permalink / raw)
  To: Andre Noll; +Cc: Dave Moon, linux-raid

On Wednesday June 25, maan@systemlinux.org wrote:
> On 15:37, Dave Moon wrote:
> 
> > 1. If mdadm encounters a bit error during a RAID 6 rebuild, will it  
> > just give up on that particular file and move on to recover other data  
> > on the array? Or will it trash the entire array?
> 
> The kernel will stop the array and give up.

Not quite.  It will stop the recovery.  It won't stop the whole array
though (I think...).

> 
> > 2. Is it possible to cheat mdadm by somehow replacing the new "raid  
> > metadata" on the 6 drives with the old data on the 2 drives? Will it  
> > make mdadm think the array is clean, consistent and nothing ever  
> > happened?
> 
> > Please do note that I did not write ANY new data onto the RAID 6 array  
> > from the time it was degraded until the time I brought it down with (-- 
> > stop).
> 
> Use --force, Luke. Man mdadm(8):
> 
> 	-f, --force Assemble the array even if some superblocks
> 	appear out-of-date

--force only updates enough superblocks to assemble a working array.
For raid6, that mean n-2 drives.  As there are n-2 drive, it won't try
any harder.

You best bet is to recreate the array with --assume-clean.
Providing you have the chunksize, order of devices, etc the same, you
should get your array back.

NeilBrown

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to avoid complete rebuild of RAID 6 array (6/8 active devices)
  2008-06-27 10:40   ` Neil Brown
@ 2008-06-29 21:58     ` Bill Davidsen
  2008-07-14 10:44       ` Matthias Urlichs
  0 siblings, 1 reply; 11+ messages in thread
From: Bill Davidsen @ 2008-06-29 21:58 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andre Noll, Dave Moon, linux-raid

Neil Brown wrote:
> On Wednesday June 25, maan@systemlinux.org wrote:
>   
>> On 15:37, Dave Moon wrote:
>>
>>     
>>> 1. If mdadm encounters a bit error during a RAID 6 rebuild, will it  
>>> just give up on that particular file and move on to recover other data  
>>> on the array? Or will it trash the entire array?
>>>       
>> The kernel will stop the array and give up.
>>     
>
> Not quite.  It will stop the recovery.  It won't stop the whole array
> though (I think...).
>
>   
>>> 2. Is it possible to cheat mdadm by somehow replacing the new "raid  
>>> metadata" on the 6 drives with the old data on the 2 drives? Will it  
>>> make mdadm think the array is clean, consistent and nothing ever  
>>> happened?
>>>       
>>> Please do note that I did not write ANY new data onto the RAID 6 array  
>>> from the time it was degraded until the time I brought it down with (-- 
>>> stop).
>>>       
>> Use --force, Luke. Man mdadm(8):
>>
>> 	-f, --force Assemble the array even if some superblocks
>> 	appear out-of-date
>>     
>
> --force only updates enough superblocks to assemble a working array.
> For raid6, that mean n-2 drives.  As there are n-2 drive, it won't try
> any harder.
>
> You best bet is to recreate the array with --assume-clean.
> Providing you have the chunksize, order of devices, etc the same, you
> should get your array back.
>   

Then what? What's going to happen when he does a check?

Of course build an array out of drives so unstable that you can't safelt 
*run* a check is another topic.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to avoid complete rebuild of RAID 6 array (6/8 active devices)
  2008-06-29 21:58     ` Bill Davidsen
@ 2008-07-14 10:44       ` Matthias Urlichs
  2008-07-14 16:14         ` David Greaves
  0 siblings, 1 reply; 11+ messages in thread
From: Matthias Urlichs @ 2008-07-14 10:44 UTC (permalink / raw)
  To: linux-raid

On Sun, 29 Jun 2008 17:58:30 -0400, Bill Davidsen wrote:

> Of course build an array out of drives so unstable that you can't safelt
> *run* a check is another topic.

It's a topic that needs to be addressed sooner or later, however.

Let's face it, drives do develop bad spots.

Tossing a perfectly good drive because 0.0000064% of the data cannot be
read is wasteful (assuming a 64-kByte area of an 1-terabyte disk).

My basic approach would be, whenever a read error is encountered, to tell
the disk drive to fix the bad area (either by rewriting the problem area
or by hardware reallocation or by using devmapper), fix the data (either
tell the RAID driver that this particular area needs to be recovered or
do it in userspace), and re-add the drive.

So ... is there some userspace code which, given a bunch of RAID disks,
can rebuild the array? Limiting said rebuild to one particular area on
one particular disk should then be reasonably easy.

-- 
Matthias Urlichs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to avoid complete rebuild of RAID 6 array (6/8 active devices)
  2008-07-14 10:44       ` Matthias Urlichs
@ 2008-07-14 16:14         ` David Greaves
  2008-07-14 16:54           ` David Lethe
  2008-07-14 22:58           ` Matthias Urlichs
  0 siblings, 2 replies; 11+ messages in thread
From: David Greaves @ 2008-07-14 16:14 UTC (permalink / raw)
  To: Matthias Urlichs; +Cc: linux-raid

Matthias Urlichs wrote:
> On Sun, 29 Jun 2008 17:58:30 -0400, Bill Davidsen wrote:
> 
>> Of course build an array out of drives so unstable that you can't safelt
>> *run* a check is another topic.
> 
> It's a topic that needs to be addressed sooner or later, however.
> 
> Let's face it, drives do develop bad spots.
> 
> Tossing a perfectly good drive because 0.0000064% of the data cannot be
> read is wasteful (assuming a 64-kByte area of an 1-terabyte disk).

I've found that once a disk starts to go bad there is a very strong tendency for
it to continue to deteriorate.

So I don't replace disks because they have a bad sector; I replace them because
I suspect they will fail more as time goes by.

Sure, some don't - I don't want to take that chance.

David


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: How to avoid complete rebuild of RAID 6 array (6/8 active devices)
  2008-07-14 16:14         ` David Greaves
@ 2008-07-14 16:54           ` David Lethe
  2008-07-14 22:58           ` Matthias Urlichs
  1 sibling, 0 replies; 11+ messages in thread
From: David Lethe @ 2008-07-14 16:54 UTC (permalink / raw)
  To: David Greaves, Matthias Urlichs; +Cc: linux-raid

-----Original Message-----
From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of David Greaves
Sent: Monday, July 14, 2008 11:15 AM
To: Matthias Urlichs
Cc: linux-raid@vger.kernel.org
Subject: Re: How to avoid complete rebuild of RAID 6 array (6/8 active devices)

>I've found that once a disk starts to go bad there is a very strong tendency for
>it to continue to deteriorate.

>So I don't replace disks because they have a bad sector; I replace them because
>I suspect they will fail more as time goes by.

>Sure, some don't - I don't want to take that chance.

>David

Different disk technologies (SAS, SATA, FC, etc) and the root cause for a "bad" sector vary widely.  It is unwise to assume that a certain disk will have a "strong" tendency for a failure after a "bad" sector.  

Without analyzing the sense code or S.M.A.R.T. logs returned by the failed read you can't make any assumptions about how the error affects the health of the drive or even just the media. 

Now if your experience is limited to consumer-class ATA/SATA disks, then I agree there is a higher probability of a 2nd failure then on enterprise-class fibre channel disks, but it is certainly not a "strong tendency" of failure.  There isn't even a likelihood of failure.  (But cause for concern, especially if the disk was just put into service).

It could be an ECC error that was caused by improper power off or a bad cable.  

David @ santools.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to avoid complete rebuild of RAID 6 array (6/8 active devices)
  2008-07-14 16:14         ` David Greaves
  2008-07-14 16:54           ` David Lethe
@ 2008-07-14 22:58           ` Matthias Urlichs
  2008-07-14 23:54             ` Richard Scobie
  2008-07-15 14:24             ` Keld Jørn Simonsen
  1 sibling, 2 replies; 11+ messages in thread
From: Matthias Urlichs @ 2008-07-14 22:58 UTC (permalink / raw)
  To: David Greaves; +Cc: linux-raid

Hi,

David Greaves:
> I've found that once a disk starts to go bad there is a very strong
> tendency for it to continue to deteriorate.
> 
In my experience, that's true for older disks, but not necessarily for
those that are new and simply have a spot or two where the magnetizable
layer is a wee bit too thin.

However, even if they do in fact continue to deteriorate, the ability to
re-map the offending areas and continue gives me an order of magnitude
more time to deal with the problem.

In fact, as I said, there may be problems lurking on other disks which I
just haven't found yet (how often do you read all 5TB of your data?),
which means that a feature like this is the difference between being
able to recover and certain data loss, RAID-6 nonwithstanding.

NB, one other problem I've observed (older kernel, I don't know if it's
been fixed) is that a resync is restarted from the beginning when a
fault on a second disk is encountered. BAD idea.

NB2, my ideal RAID recovery scenario looks like this:
* When a disk access fails, the offender is switched to write-only mode.
  I.e., the kernel ignores it when reading, but still tries to write
  correct data when something's updated.
* In order to re-sync a new disk, simply duplicate the old one if it
  hasn't been removed yet; of course, you need to do "real" recovery for
  the bad spots, and you need the aforementioned write-only code to
  update both (when writing to the area that's already synced up).

The _huge_ advantage of this process would be that a re-sync does not
affect the array's read performance at all (other than the higher CPU
usage). For some people, that can be quite important.

Now where can I get the largish chunk of time required to implement all
of this ... oh well.

-- 
Matthias Urlichs   |   {M:U} IT Design @ m-u-it.de   |  smurf@smurf.noris.de
Disclaimer: The quote was selected randomly. Really. | http://smurf.noris.de
 - -
The way to a man's heart is through the left ventricle.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to avoid complete rebuild of RAID 6 array (6/8 active  devices)
  2008-07-14 22:58           ` Matthias Urlichs
@ 2008-07-14 23:54             ` Richard Scobie
  2008-07-15  0:05               ` Matthias Urlichs
  2008-07-15 14:24             ` Keld Jørn Simonsen
  1 sibling, 1 reply; 11+ messages in thread
From: Richard Scobie @ 2008-07-14 23:54 UTC (permalink / raw)
  To: Matthias Urlichs; +Cc: David Greaves, linux-raid

Matthias Urlichs wrote:

> In fact, as I said, there may be problems lurking on other disks which I
> just haven't found yet (how often do you read all 5TB of your data?),
> which means that a feature like this is the difference between being
> able to recover and certain data loss, RAID-6 nonwithstanding.

This is why scheduled "check" or "repairs" should be done - see 
Documentation/md.txt in the kernel source for details.

Regards,

Richard

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to avoid complete rebuild of RAID 6 array (6/8 active devices)
  2008-07-14 23:54             ` Richard Scobie
@ 2008-07-15  0:05               ` Matthias Urlichs
  0 siblings, 0 replies; 11+ messages in thread
From: Matthias Urlichs @ 2008-07-15  0:05 UTC (permalink / raw)
  To: Richard Scobie; +Cc: David Greaves, linux-raid

Hi,

Richard Scobie:
> This is why scheduled "check" or "repairs" should be done - see  
> Documentation/md.txt in the kernel source for details.
>
While I agree with that idea in general, the time required for such
checks tends to grow faster than the time to read a whole disk.

Besides: even if you regular checks, your data is still going to be
safer when you keep the disk online until its replacement is ready.

-- 
Matthias Urlichs   |   {M:U} IT Design @ m-u-it.de   |  smurf@smurf.noris.de
Disclaimer: The quote was selected randomly. Really. | http://smurf.noris.de
 - -
Cartoon Law IV: The time required for an object to fall twenty stories
is greater than or equal to the time it takes for whoever knocked it off
the ledge to spiral down twenty flights to attempt to capture it unbroken.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How to avoid complete rebuild of RAID 6 array (6/8 active devices)
  2008-07-14 22:58           ` Matthias Urlichs
  2008-07-14 23:54             ` Richard Scobie
@ 2008-07-15 14:24             ` Keld Jørn Simonsen
  1 sibling, 0 replies; 11+ messages in thread
From: Keld Jørn Simonsen @ 2008-07-15 14:24 UTC (permalink / raw)
  To: Matthias Urlichs; +Cc: David Greaves, linux-raid

On Tue, Jul 15, 2008 at 12:58:16AM +0200, Matthias Urlichs wrote:
> Hi,
> 
> However, even if they do in fact continue to deteriorate, the ability to
> re-map the offending areas and continue gives me an order of magnitude
> more time to deal with the problem.
> 
> In fact, as I said, there may be problems lurking on other disks which I
> just haven't found yet (how often do you read all 5TB of your data?),
> which means that a feature like this is the difference between being
> able to recover and certain data loss, RAID-6 nonwithstanding.

One idea about this - One could read and write the disks perodically,
say once a month. In this way single bit errors that could have evolved
on the disks coule be repaired, as the CRC saves the one bit error, and 
gets it corrected when writing. For a raid - if an error occurs, then
the sound data could be used, and if the error persists after a rewrite
on the bad disk, that data should then be remapped to a sound area on
the drive. Maybe people already have implemented this. SMART data could
also be consulted. 

I thought of badblocks -n to do this, but also raid check could be a
place to do it. When writing ons should of cause take care that nobody
else is writing the same data.  

best regards
keld

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-07-15 14:24 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-25  6:37 How to avoid complete rebuild of RAID 6 array (6/8 active devices) Dave Moon
2008-06-25 16:13 ` Andre Noll
2008-06-27 10:40   ` Neil Brown
2008-06-29 21:58     ` Bill Davidsen
2008-07-14 10:44       ` Matthias Urlichs
2008-07-14 16:14         ` David Greaves
2008-07-14 16:54           ` David Lethe
2008-07-14 22:58           ` Matthias Urlichs
2008-07-14 23:54             ` Richard Scobie
2008-07-15  0:05               ` Matthias Urlichs
2008-07-15 14:24             ` Keld Jørn Simonsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).