* Do I understand my RAID6 correctly? @ 2011-04-12 8:39 Stefan G. Weichinger 2011-04-12 9:23 ` Yann Ormanns 0 siblings, 1 reply; 4+ messages in thread From: Stefan G. Weichinger @ 2011-04-12 8:39 UTC (permalink / raw) To: linux-raid@vger.kernel.org Today I had a drive fail in a customers server. It was part of a RAID6 which seems to have rebuilt onto a spare drive now. Right now it looks like: # mdadm -D /dev/md3 /dev/md3: Version : 00.90.03 Creation Time : Thu Dec 20 17:47:07 2007 Raid Level : raid6 Array Size : 4391334912 (4187.90 GiB 4496.73 GB) Used Dev Size : 731889152 (697.98 GiB 749.45 GB) Raid Devices : 8 Total Devices : 9 Preferred Minor : 3 Persistence : Superblock is persistent Update Time : Tue Apr 12 10:27:45 2011 State : clean Active Devices : 8 Working Devices : 8 Failed Devices : 1 Spare Devices : 0 Chunk Size : 64K UUID : e848b637:ca2bde73:9f92f3cc:128cdbad Events : 0.47127534 Number Major Minor RaidDevice State 0 8 33 0 active sync /dev/sdc1 1 8 177 1 active sync /dev/sdl1 2 8 65 2 active sync /dev/sde1 3 8 81 3 active sync /dev/sdf1 4 8 97 4 active sync /dev/sdg1 5 8 113 5 active sync /dev/sdh1 6 8 129 6 active sync /dev/sdi1 7 8 145 7 active sync /dev/sdj1 8 8 161 - faulty spare /dev/sdk1 My question (just to be sure): Do I understand it correctly that the system has substituted the failed /dev/sdk1 by a former spare drive (dunno the device name now) and that I now I have a valid RAID6-device with 8 drives in it? So out of the 8 drives there could fail another 2 now without losing data ... correct? I have to tell the customer what to do and the grade of redundancy available also relates to how urgent it is to get a new drive into the system. I assume I would remove /dev/sdk1 from md3, swap the drive, fdisk it and re-add sdk1 to md3 (it is failed already now, so the fail-step isn't necessary anymore). It would the be the new spare drive ... ? Thanks for refreshing my RAID-knowledge ;-) Stefan ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Do I understand my RAID6 correctly? 2011-04-12 8:39 Do I understand my RAID6 correctly? Stefan G. Weichinger @ 2011-04-12 9:23 ` Yann Ormanns 2011-04-12 10:42 ` Stefan G. Weichinger 0 siblings, 1 reply; 4+ messages in thread From: Yann Ormanns @ 2011-04-12 9:23 UTC (permalink / raw) To: lists; +Cc: linux-raid@vger.kernel.org Subject: Do I understand my RAID6 correctly? From: Stefan G. Weichinger <lists@xunil.at> To: Yann Ormanns <yann-ormanns@web.de> Date: 2011-04-12 11:11 (+0200) > Array Size : 4391334912 (4187.90 GiB 4496.73 GB) > Used Dev Size : 731889152 (697.98 GiB 749.45 GB) > Raid Devices : 8 > Total Devices : 9 > Preferred Minor : 3 > Persistence : Superblock is persistent > > Active Devices : 8 > Working Devices : 8 > Failed Devices : 1 You now have 9 devices in this Array (750GB*8 = 6 TB, 6 TB - (2*750GB) = 4.5 TB). One of them is the failed spare disk. That means, that this array can "lose" two disks without losing data, as you already wrote. Of course you can re-use /dev/sdk as a spare disk, but before, you should check, why it failed (SMART data for example). You should also have a look at the used drive models. E.g. if this array uses 9x model XYZ from manucaturer ABC, perhaps more drives will fail in the next time. If the array uses mixed models, it should not be THAT urgent - but that depends on the importance of the data... I've read several times of people losing their RAID6, because they did not mix the hard drive models. Then, a manufacturing fault have very bad consequences. Best regards, Yann ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Do I understand my RAID6 correctly? 2011-04-12 9:23 ` Yann Ormanns @ 2011-04-12 10:42 ` Stefan G. Weichinger 2011-04-12 17:17 ` Yann Ormanns 0 siblings, 1 reply; 4+ messages in thread From: Stefan G. Weichinger @ 2011-04-12 10:42 UTC (permalink / raw) To: linux-raid@vger.kernel.org; +Cc: Yann Ormanns Am 12.04.2011 11:23, schrieb Yann Ormanns: > You now have 9 devices in this Array (750GB*8 = 6 TB, 6 TB - (2*750GB) = > 4.5 TB). One of them is the failed spare disk. That means, that this > array can "lose" two disks without losing data, as you already wrote. Yep, fine. > Of course you can re-use /dev/sdk as a spare disk, but before, you > should check, why it failed (SMART data for example). I already exported the controller-logs and will look through for SMART info. Unfortunately the controller does not allow the use of smartmontools, I can only use the specific ICP Storage Manager. > You should also have a look at the used drive models. E.g. if this array > uses 9x model XYZ from manucaturer ABC, perhaps more drives will fail in > the next time. uuuh > If the array uses mixed models, it should not be THAT urgent - but that > depends on the importance of the data... > I've read several times of people losing their RAID6, because they did > not mix the hard drive models. Then, a manufacturing fault have very bad > consequences. Scary. Yes, the server uses the same model for all 9 devices. From your domain I see that you seem to be located in germany, so you might know the manufacturer of the server: transtec. I already opened a support ticket there, we still have a valid support contract. Last time they sent a new drive, we'll see. Thanks, Stefan ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Do I understand my RAID6 correctly? 2011-04-12 10:42 ` Stefan G. Weichinger @ 2011-04-12 17:17 ` Yann Ormanns 0 siblings, 0 replies; 4+ messages in thread From: Yann Ormanns @ 2011-04-12 17:17 UTC (permalink / raw) To: lists; +Cc: linux-raid@vger.kernel.org Subject: Re: Do I understand my RAID6 correctly? From: Stefan G. Weichinger <lists@xunil.at> To: Yann Ormanns <yann-ormanns@web.de> Date: 2011-04-12 19:03 (+0200) > > I already exported the controller-logs and will look through for SMART > info. Unfortunately the controller does not allow the use of > smartmontools, I can only use the specific ICP Storage Manager. > I recommend to compare the active hours and the serial numbers of the disks. So you can _perhaps_ predict the next disk with problems. Unfortunately, SMART-data is no credible basis for any hard disk failure predictions. For further information, you may want to take a look at this german link: http://www.heise.de/newsticker/meldung/Google-Studie-zur-Ausfallursache-von-Festplatten-147178.html and / or at this document http://labs.google.com/papers/disk_failures.pdf > > Scary. Yes, the server uses the same model for all 9 devices. Yeah, that's really scary - it shows, that even with a RAID6 your data is not absolutely safe. But I have to admit that this is "only" the worst case scenario - the chance that this situation occurs, is really small (but not impossible). I suppose, that your customer keeps his backups up to date although he uses a RAID6? > From your domain I see that you seem to be located in germany, so you > might know the manufacturer of the server: transtec. > No, I do not know this manucaturer - but I'm just a private user, so I don't really have any experiences with "real" servers :) Best regards, Yann ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2011-04-12 17:17 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-04-12 8:39 Do I understand my RAID6 correctly? Stefan G. Weichinger 2011-04-12 9:23 ` Yann Ormanns 2011-04-12 10:42 ` Stefan G. Weichinger 2011-04-12 17:17 ` Yann Ormanns
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).