Do I understand my RAID6 correctly?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Do I understand my RAID6 correctly?
@ 2011-04-12  8:39 Stefan G. Weichinger
  2011-04-12  9:23 ` Yann Ormanns
  0 siblings, 1 reply; 4+ messages in thread
From: Stefan G. Weichinger @ 2011-04-12  8:39 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org

Today I had a drive fail in a customers server.

It was part of a RAID6 which seems to have rebuilt onto a spare drive now.

Right now it looks like:

# mdadm -D /dev/md3
/dev/md3:
        Version : 00.90.03
  Creation Time : Thu Dec 20 17:47:07 2007
     Raid Level : raid6
     Array Size : 4391334912 (4187.90 GiB 4496.73 GB)
  Used Dev Size : 731889152 (697.98 GiB 749.45 GB)
   Raid Devices : 8
  Total Devices : 9
Preferred Minor : 3
    Persistence : Superblock is persistent

    Update Time : Tue Apr 12 10:27:45 2011
          State : clean
 Active Devices : 8
Working Devices : 8
 Failed Devices : 1
  Spare Devices : 0

     Chunk Size : 64K

           UUID : e848b637:ca2bde73:9f92f3cc:128cdbad
         Events : 0.47127534

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8      177        1      active sync   /dev/sdl1
       2       8       65        2      active sync   /dev/sde1
       3       8       81        3      active sync   /dev/sdf1
       4       8       97        4      active sync   /dev/sdg1
       5       8      113        5      active sync   /dev/sdh1
       6       8      129        6      active sync   /dev/sdi1
       7       8      145        7      active sync   /dev/sdj1

       8       8      161        -      faulty spare   /dev/sdk1

My question (just to be sure):

Do I understand it correctly that the system has substituted the failed
/dev/sdk1 by a former spare drive (dunno the device name now) and that I
now I have a valid RAID6-device with 8 drives in it?

So out of the 8 drives there could fail another 2 now without losing
data ...

correct?

I have to tell the customer what to do and the grade of redundancy
available also relates to how urgent it is to get a new drive into the
system.

I assume I would remove /dev/sdk1 from md3, swap the drive, fdisk it and
re-add sdk1 to md3 (it is failed already now, so the fail-step isn't
necessary anymore). It would the be the new spare drive ... ?

Thanks for refreshing my RAID-knowledge ;-)
Stefan

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Do I understand my RAID6 correctly?
  2011-04-12  8:39 Do I understand my RAID6 correctly? Stefan G. Weichinger
@ 2011-04-12  9:23 ` Yann Ormanns
  2011-04-12 10:42   ` Stefan G. Weichinger
  0 siblings, 1 reply; 4+ messages in thread
From: Yann Ormanns @ 2011-04-12  9:23 UTC (permalink / raw)
  To: lists; +Cc: linux-raid@vger.kernel.org

Subject: Do I understand my RAID6 correctly?
From: Stefan G. Weichinger <lists@xunil.at>
To: Yann Ormanns <yann-ormanns@web.de>
Date: 2011-04-12 11:11 (+0200)

>      Array Size : 4391334912 (4187.90 GiB 4496.73 GB)
>   Used Dev Size : 731889152 (697.98 GiB 749.45 GB)
>    Raid Devices : 8
>   Total Devices : 9
> Preferred Minor : 3
>     Persistence : Superblock is persistent
> 
>  Active Devices : 8
> Working Devices : 8
>  Failed Devices : 1

You now have 9 devices in this Array (750GB*8 = 6 TB, 6 TB - (2*750GB) =
4.5 TB). One of them is the failed spare disk. That means, that this
array can "lose" two disks without losing data, as you already wrote.

Of course you can re-use /dev/sdk as a spare disk, but before, you
should check, why it failed (SMART data for example).

You should also have a look at the used drive models. E.g. if this array
uses 9x model XYZ from manucaturer ABC, perhaps more drives will fail in
the next time.
If the array uses mixed models, it should not be THAT urgent - but that
depends on the importance of the data...
I've read several times of people losing their RAID6, because they did
not mix the hard drive models. Then, a manufacturing fault have very bad
consequences.

Best regards,
Yann

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Do I understand my RAID6 correctly?
  2011-04-12  9:23 ` Yann Ormanns
@ 2011-04-12 10:42   ` Stefan G. Weichinger
  2011-04-12 17:17     ` Yann Ormanns
  0 siblings, 1 reply; 4+ messages in thread
From: Stefan G. Weichinger @ 2011-04-12 10:42 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org; +Cc: Yann Ormanns

Am 12.04.2011 11:23, schrieb Yann Ormanns:

> You now have 9 devices in this Array (750GB*8 = 6 TB, 6 TB - (2*750GB) =
> 4.5 TB). One of them is the failed spare disk. That means, that this
> array can "lose" two disks without losing data, as you already wrote.

Yep, fine.

> Of course you can re-use /dev/sdk as a spare disk, but before, you
> should check, why it failed (SMART data for example).

I already exported the controller-logs and will look through for SMART
info. Unfortunately the controller does not allow the use of
smartmontools, I can only use the specific ICP Storage Manager.

> You should also have a look at the used drive models. E.g. if this array
> uses 9x model XYZ from manucaturer ABC, perhaps more drives will fail in
> the next time.

uuuh

> If the array uses mixed models, it should not be THAT urgent - but that
> depends on the importance of the data...
> I've read several times of people losing their RAID6, because they did
> not mix the hard drive models. Then, a manufacturing fault have very bad
> consequences.

Scary. Yes, the server uses the same model for all 9 devices.
From your domain I see that you seem to be located in germany, so you
might know the manufacturer of the server: transtec.

I already opened a support ticket there, we still have a valid support
contract. Last time they sent a new drive, we'll see.

Thanks, Stefan

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Do I understand my RAID6 correctly?
  2011-04-12 10:42   ` Stefan G. Weichinger
@ 2011-04-12 17:17     ` Yann Ormanns
  0 siblings, 0 replies; 4+ messages in thread
From: Yann Ormanns @ 2011-04-12 17:17 UTC (permalink / raw)
  To: lists; +Cc: linux-raid@vger.kernel.org

Subject: Re: Do I understand my RAID6 correctly?
From: Stefan G. Weichinger <lists@xunil.at>
To: Yann Ormanns <yann-ormanns@web.de>
Date: 2011-04-12 19:03 (+0200)
>
> I already exported the controller-logs and will look through for SMART
> info. Unfortunately the controller does not allow the use of
> smartmontools, I can only use the specific ICP Storage Manager.
> 

I recommend to compare the active hours and the serial numbers of the
disks. So you can _perhaps_ predict the next disk with problems.
Unfortunately, SMART-data is no credible basis for any hard disk failure
predictions.
For further information, you may want to take a look at this german
link:
http://www.heise.de/newsticker/meldung/Google-Studie-zur-Ausfallursache-von-Festplatten-147178.html
and / or at this document http://labs.google.com/papers/disk_failures.pdf

> 
> Scary. Yes, the server uses the same model for all 9 devices.

Yeah, that's really scary - it shows, that even with a RAID6 your data
is not absolutely safe. But I have to admit that this is "only" the
worst case scenario - the chance that this situation occurs, is really
small (but not impossible).
I suppose, that your customer keeps his backups up to date although he
uses a RAID6?

> From your domain I see that you seem to be located in germany, so you
> might know the manufacturer of the server: transtec.
> 

No, I do not know this manucaturer - but I'm just a private user, so I
don't really have any experiences with "real" servers :)

Best regards,
Yann

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-04-12 17:17 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-12  8:39 Do I understand my RAID6 correctly? Stefan G. Weichinger
2011-04-12  9:23 ` Yann Ormanns
2011-04-12 10:42   ` Stefan G. Weichinger
2011-04-12 17:17     ` Yann Ormanns

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).