* RAID6 reshape, 2 disk failures @ 2012-10-16 22:57 Mathias Burén 2012-10-17 2:29 ` Stan Hoeppner 2012-10-17 3:06 ` Chris Murphy 0 siblings, 2 replies; 18+ messages in thread From: Mathias Burén @ 2012-10-16 22:57 UTC (permalink / raw) To: Linux-RAID Hi list, I started a reshape from 64K chunk size to 512K (now default IIRC). During this time 2 disks failed with some time in between. The first one was removed by MD, so I shut down and removed the HDD, continued the reshape. After a while the second HDD failed. This is what it looks liek right now, the second failed HDD still in as you can see: $ iostat -m Linux 3.5.5-1-ck (ion) 10/16/2012 _x86_64_ (4 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 8.93 7.81 5.40 15.57 0.00 62.28 Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sda 38.93 0.00 13.09 939 8134936 sdb 59.37 5.19 2.60 3224158 1613418 sdf 59.37 5.19 2.60 3224136 1613418 sdc 59.37 5.19 2.60 3224134 1613418 sdd 59.37 5.19 2.60 3224151 1613418 sde 42.17 3.68 1.84 2289332 1145595 sdg 59.37 5.19 2.60 3224061 1613418 sdh 0.00 0.00 0.00 9 0 md0 0.06 0.00 0.00 2023 0 dm-0 0.06 0.00 0.00 2022 0 $ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sde1[0](F) sdg1[8] sdc1[5] sdd1[3] sdb1[4] sdf1[9] 9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2 [7/5] [_UUUUU_] [================>....] reshape = 84.6% (1650786304/1950351360) finish=2089.2min speed=2389K/sec unused devices: <none> $ sudo mdadm -D /dev/md0 [sudo] password for x: /dev/md0: Version : 1.2 Creation Time : Tue Oct 19 08:58:41 2010 Raid Level : raid6 Array Size : 9751756800 (9300.00 GiB 9985.80 GB) Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB) Raid Devices : 7 Total Devices : 6 Persistence : Superblock is persistent Update Time : Tue Oct 16 23:55:28 2012 State : clean, degraded, reshaping Active Devices : 5 Working Devices : 5 Failed Devices : 1 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K Reshape Status : 84% complete New Chunksize : 512K Name : ion:0 (local to host ion) UUID : e6595c64:b3ae90b3:f01133ac:3f402d20 Events : 8386010 Number Major Minor RaidDevice State 0 8 65 0 faulty spare rebuilding /dev/sde1 9 8 81 1 active sync /dev/sdf1 4 8 17 2 active sync /dev/sdb1 3 8 49 3 active sync /dev/sdd1 5 8 33 4 active sync /dev/sdc1 8 8 97 5 active sync /dev/sdg1 6 0 0 6 removed What is confusing to me is that /dev/sde1 (which is failing) is currently marked as rebuilding. But when I check iostat, it's far behind the other drives in total I/O since the reshape started, and the I/O hasn't actually changed for a few hours. This together with _ instead of U leads me to believe that it's not actually being used. So why does it say rebuilding? I guess my question is if it's possible for me to remove the drive, or would I mess the array up? I am not going to anything until the reshape finishes though. Thanks, Mathias ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID6 reshape, 2 disk failures 2012-10-16 22:57 RAID6 reshape, 2 disk failures Mathias Burén @ 2012-10-17 2:29 ` Stan Hoeppner 2012-10-17 3:06 ` Chris Murphy 1 sibling, 0 replies; 18+ messages in thread From: Stan Hoeppner @ 2012-10-17 2:29 UTC (permalink / raw) To: Linux-RAID On 10/16/2012 5:57 PM, Mathias Burén wrote: > Hi list, > > I started a reshape from 64K chunk size to 512K (now default IIRC). > During this time 2 disks failed with some time in between. The first > one was removed by MD, so I shut down and removed the HDD, continued > the reshape. After a while the second HDD failed. This is what it > looks liek right now, the second failed HDD still in as you can see: Apparently you don't realize you're going through all of this for the sake of a senseless change that will gain you nothing, and cost you performance. Large chunk sizes are murder for parity RAID due to the increased IO bandwidth required during RMW cycles. The new 512KB default is way too big. And with many random IO workloads even 64KB is a bit large. This was discussed on this list in detail not long ago. I guess one positive aspect is you've discovered problems with a couple of drives. Better now than later I guess. -- Stan > $ iostat -m > Linux 3.5.5-1-ck (ion) 10/16/2012 _x86_64_ (4 CPU) > > avg-cpu: %user %nice %system %iowait %steal %idle > 8.93 7.81 5.40 15.57 0.00 62.28 > > Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn > sda 38.93 0.00 13.09 939 8134936 > sdb 59.37 5.19 2.60 3224158 1613418 > sdf 59.37 5.19 2.60 3224136 1613418 > sdc 59.37 5.19 2.60 3224134 1613418 > sdd 59.37 5.19 2.60 3224151 1613418 > sde 42.17 3.68 1.84 2289332 1145595 > sdg 59.37 5.19 2.60 3224061 1613418 > sdh 0.00 0.00 0.00 9 0 > md0 0.06 0.00 0.00 2023 0 > dm-0 0.06 0.00 0.00 2022 0 > > $ cat /proc/mdstat > Personalities : [raid6] [raid5] [raid4] > md0 : active raid6 sde1[0](F) sdg1[8] sdc1[5] sdd1[3] sdb1[4] sdf1[9] > 9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2 > [7/5] [_UUUUU_] > [================>....] reshape = 84.6% (1650786304/1950351360) > finish=2089.2min speed=2389K/sec > > unused devices: <none> > > $ sudo mdadm -D /dev/md0 > [sudo] password for x: > /dev/md0: > Version : 1.2 > Creation Time : Tue Oct 19 08:58:41 2010 > Raid Level : raid6 > Array Size : 9751756800 (9300.00 GiB 9985.80 GB) > Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB) > Raid Devices : 7 > Total Devices : 6 > Persistence : Superblock is persistent > > Update Time : Tue Oct 16 23:55:28 2012 > State : clean, degraded, reshaping > Active Devices : 5 > Working Devices : 5 > Failed Devices : 1 > Spare Devices : 0 > > Layout : left-symmetric > Chunk Size : 64K > > Reshape Status : 84% complete > New Chunksize : 512K > > Name : ion:0 (local to host ion) > UUID : e6595c64:b3ae90b3:f01133ac:3f402d20 > Events : 8386010 > > Number Major Minor RaidDevice State > 0 8 65 0 faulty spare rebuilding /dev/sde1 > 9 8 81 1 active sync /dev/sdf1 > 4 8 17 2 active sync /dev/sdb1 > 3 8 49 3 active sync /dev/sdd1 > 5 8 33 4 active sync /dev/sdc1 > 8 8 97 5 active sync /dev/sdg1 > 6 0 0 6 removed > > > What is confusing to me is that /dev/sde1 (which is failing) is > currently marked as rebuilding. But when I check iostat, it's far > behind the other drives in total I/O since the reshape started, and > the I/O hasn't actually changed for a few hours. This together with _ > instead of U leads me to believe that it's not actually being used. So > why does it say rebuilding? > > I guess my question is if it's possible for me to remove the drive, or > would I mess the array up? I am not going to anything until the > reshape finishes though. > > Thanks, > Mathias > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID6 reshape, 2 disk failures 2012-10-16 22:57 RAID6 reshape, 2 disk failures Mathias Burén 2012-10-17 2:29 ` Stan Hoeppner @ 2012-10-17 3:06 ` Chris Murphy 2012-10-17 8:03 ` Mathias Burén 1 sibling, 1 reply; 18+ messages in thread From: Chris Murphy @ 2012-10-17 3:06 UTC (permalink / raw) To: linux-raid RAID On Oct 16, 2012, at 4:57 PM, Mathias Burén wrote: > I started a reshape from 64K chunk size to 512K I agree with Stan, not a good idea, and also a waste of time. Do you have check scrubs and extended offline smart tests scheduled for these drives periodically? > > I guess my question is if it's possible for me to remove the drive, or > would I mess the array up? I am not going to anything until the > reshape finishes though. I think you should put in a replacement drive for sda (#6) and get it rebuilding, as sde seems rather tenuous, before you decide to remove sde. You should find out why it's slow 'smartctl -A /dev/sde' might reveal this now, which you can issue even while the reshape is occurring - the command just polls for existing smart attribute values for the drive. If it's the same model disk, connected the same way, as all the other drives, I'd get rid of it. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID6 reshape, 2 disk failures 2012-10-17 3:06 ` Chris Murphy @ 2012-10-17 8:03 ` Mathias Burén 2012-10-17 9:09 ` Chris Murphy 2012-10-21 22:31 ` NeilBrown 0 siblings, 2 replies; 18+ messages in thread From: Mathias Burén @ 2012-10-17 8:03 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-raid RAID On 17 October 2012 04:06, Chris Murphy <lists@colorremedies.com> wrote: > > On Oct 16, 2012, at 4:57 PM, Mathias Burén wrote: > >> I started a reshape from 64K chunk size to 512K > > I agree with Stan, not a good idea, and also a waste of time. Do you have check scrubs and extended offline smart tests scheduled for these drives periodically? > Weekly scrubs and weekly offline self-tests. SMART always looked good, until 1 drive died completely, the other has 5 uncorrectable sectors. LCC is under 250K. WD20EARS There are basically no files under 8GB on the array so therefore I thought the new chunk size made sense. >> >> I guess my question is if it's possible for me to remove the drive, or >> would I mess the array up? I am not going to anything until the >> reshape finishes though. > > I think you should put in a replacement drive for sda (#6) and get it rebuilding, as sde seems rather tenuous, before you decide to remove sde. > > You should find out why it's slow 'smartctl -A /dev/sde' might reveal this now, which you can issue even while the reshape is occurring - the command just polls for existing smart attribute values for the drive. If it's the same model disk, connected the same way, as all the other drives, I'd get rid of it. It's slow because it's broken (see above). Any idea why it says rebuilding, when it's not? Is it going to attempt a rebuild after the reshape? > > > Chris Murphy-- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Regards, Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID6 reshape, 2 disk failures 2012-10-17 8:03 ` Mathias Burén @ 2012-10-17 9:09 ` Chris Murphy [not found] ` <CADNH=7GaGCLdK2Rk_A6vPN+Th0z0QYT7mRV0KJH=CoAffuvb6w@mail.gmail.com> 2012-10-21 22:31 ` NeilBrown 1 sibling, 1 reply; 18+ messages in thread From: Chris Murphy @ 2012-10-17 9:09 UTC (permalink / raw) To: linux-raid RAID On Oct 17, 2012, at 2:03 AM, Mathias Burén wrote: > Weekly scrubs and weekly offline self-tests. SMART always looked good, > until 1 drive died completely, the other has 5 uncorrectable sectors. > LCC is under 250K. WD20EARS. Color me confused. Uncorrectable sectors should produce a read error on a check, which will cause the data to be reconstructed from parity, and written back to those sectors. A write to a bad sector, if persistent, will cause it to be relocated. If this isn't possible, the disk is toast if it can't reliably deal with bad sectors (out of reserve sectors?) The smartmontools page has information on how to clear uncorrectable sectors manually. But I'd think check would do this. > There are basically no files under 8GB on the array so therefore I > thought the new chunk size made sense. Yeah it seems reasonable in that case. But unless it's benchmarked you don't actually know if it matters. > It's slow because it's broken (see above). > Any idea why it says rebuilding, when it's not? Is it going to attempt > a rebuild after the reshape? Not sure. With two drives missing, you're in a very precarious situation. I would not worry about this detail until you have the sda (#6) replaced and rebuilt. Presumably the reshape must finish before the rebuild will start but I'm not sure of this. What's dmesg reporting while all of this is going on? Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
[parent not found: <CADNH=7GaGCLdK2Rk_A6vPN+Th0z0QYT7mRV0KJH=CoAffuvb6w@mail.gmail.com>]
* Re: RAID6 reshape, 2 disk failures [not found] ` <CADNH=7GaGCLdK2Rk_A6vPN+Th0z0QYT7mRV0KJH=CoAffuvb6w@mail.gmail.com> @ 2012-10-17 18:46 ` Chris Murphy 2012-10-17 19:03 ` Mathias Burén 0 siblings, 1 reply; 18+ messages in thread From: Chris Murphy @ 2012-10-17 18:46 UTC (permalink / raw) To: linux-raid RAID On Oct 17, 2012, at 10:27 AM, Mathias Burén wrote: > [419246.582409] end_request: I/O error, dev sde, sector 2343892776 > [419246.582492] md/raid:md0: read error not correctable (sector > 2343890728 on sde1). > [419246.582502] md/raid:md0: read error not correctable (sector > 2343890736 on sde1). > [419246.582511] md/raid:md0: read error not correctable (sector > 2343890744 on sde1). > [419246.582519] md/raid:md0: read error not correctable (sector > 2343890752 on sde1). > [419246.582527] md/raid:md0: read error not correctable (sector > 2343890760 on sde1). > [419246.582535] md/raid:md0: read error not correctable (sector > 2343890768 on sde1). > [419246.582543] md/raid:md0: read error not correctable (sector > 2343890776 on sde1). > [419246.582552] md/raid:md0: read error not correctable (sector > 2343890784 on sde1). > [419246.582560] md/raid:md0: read error not correctable (sector > 2343890792 on sde1). > [419246.582568] md/raid:md0: read error not correctable (sector ... > > You can see the first start of the reshape, then sde started freaking out. A lot more than just 5 sectors. I'd replace the drive and the cable. If it's under warranty, have it replaced. If not, maybe ata secure erase it, extended offline smart test, and use it down the road for something not so important if it passes without further problems. So basically, replace the dead sda drive ASAP so it can start rebuilding. I'd consider marking sde faulty so that it's neither being reshaped or rebuilt and then replace it once the new sda is rebuilt. You can probably replace them both at the same time and have them both rebuilding; but I'm being a little conservative on how many changes you make until you get yourself back to some level of redundancy. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID6 reshape, 2 disk failures 2012-10-17 18:46 ` Chris Murphy @ 2012-10-17 19:03 ` Mathias Burén 2012-10-17 19:35 ` Chris Murphy 2012-10-18 11:56 ` Stan Hoeppner 0 siblings, 2 replies; 18+ messages in thread From: Mathias Burén @ 2012-10-17 19:03 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-raid RAID On 17 October 2012 19:46, Chris Murphy <lists@colorremedies.com> wrote: > > On Oct 17, 2012, at 10:27 AM, Mathias Burén wrote: > >> [419246.582409] end_request: I/O error, dev sde, sector 2343892776 >> [419246.582492] md/raid:md0: read error not correctable (sector >> 2343890728 on sde1). >> [419246.582502] md/raid:md0: read error not correctable (sector >> 2343890736 on sde1). >> [419246.582511] md/raid:md0: read error not correctable (sector >> 2343890744 on sde1). >> [419246.582519] md/raid:md0: read error not correctable (sector >> 2343890752 on sde1). >> [419246.582527] md/raid:md0: read error not correctable (sector >> 2343890760 on sde1). >> [419246.582535] md/raid:md0: read error not correctable (sector >> 2343890768 on sde1). >> [419246.582543] md/raid:md0: read error not correctable (sector >> 2343890776 on sde1). >> [419246.582552] md/raid:md0: read error not correctable (sector >> 2343890784 on sde1). >> [419246.582560] md/raid:md0: read error not correctable (sector >> 2343890792 on sde1). >> [419246.582568] md/raid:md0: read error not correctable (sector > > > ... >> >> You can see the first start of the reshape, then sde started freaking out. > > A lot more than just 5 sectors. I'd replace the drive and the cable. If it's under warranty, have it replaced. If not, maybe ata secure erase it, extended offline smart test, and use it down the road for something not so important if it passes without further problems. There are no CRC errors so I doubt the cable is at fault. In any way, I've RMA'd drives for less, and an RMA is underway for this drive. Just need to wait for the reshape to finish so I can get in the server. Btw, with a few holes drilled this bad boy holds 7 3.5" HDDs no problem: http://www.antec.com/productPSU.php?id=30&pid=3 > > So basically, replace the dead sda drive ASAP so it can start rebuilding. > Hm where do you get sda from? sda is the OS disk, an old SSD. (it currently holds the reshape backup file) > I'd consider marking sde faulty so that it's neither being reshaped or rebuilt and then replace it once the new sda is rebuilt. You can probably replace them both at the same time and have them both rebuilding; but I'm being a little conservative on how many changes you make until you get yourself back to some level of redundancy. > > > Chris Murphy-- Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID6 reshape, 2 disk failures 2012-10-17 19:03 ` Mathias Burén @ 2012-10-17 19:35 ` Chris Murphy 2012-10-18 11:56 ` Stan Hoeppner 1 sibling, 0 replies; 18+ messages in thread From: Chris Murphy @ 2012-10-17 19:35 UTC (permalink / raw) To: linux-raid RAID On Oct 17, 2012, at 1:03 PM, Mathias Burén wrote: > > Hm where do you get sda from? sda is the OS disk, an old SSD. (it > currently holds the reshape backup file) Oh. Misread. So it's sdh that's dead, and sde that's dying/pooping bad sector bullets. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID6 reshape, 2 disk failures 2012-10-17 19:03 ` Mathias Burén 2012-10-17 19:35 ` Chris Murphy @ 2012-10-18 11:56 ` Stan Hoeppner 2012-10-18 12:17 ` Mathias Burén 1 sibling, 1 reply; 18+ messages in thread From: Stan Hoeppner @ 2012-10-18 11:56 UTC (permalink / raw) To: linux-raid RAID On 10/17/2012 2:03 PM, Mathias Burén wrote: > There are no CRC errors so I doubt the cable is at fault. In any way, > I've RMA'd drives for less, and an RMA is underway for this drive. > Just need to wait for the reshape to finish so I can get in the > server. Btw, with a few holes drilled this bad boy holds 7 3.5" HDDs > no problem: http://www.antec.com/productPSU.php?id=30&pid=3 It would seem you didn't mod the airflow of the case along with the increased drive count. The NSK1380 has really poor airflow to begin with: a single PSU mounted 120mm super low RPM fan. Antec is currently shipping the NSK1380 with an additional PCI slot centrifugal fan to help overcome the limitations of the native design. You bought crap drives, WD20EARS, then improperly modded a case to house more than twice the design limit of HDDs. I'd say you stacked the deck against yourself here Mathias. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID6 reshape, 2 disk failures 2012-10-18 11:56 ` Stan Hoeppner @ 2012-10-18 12:17 ` Mathias Burén 2012-10-18 17:11 ` Mathias Burén 0 siblings, 1 reply; 18+ messages in thread From: Mathias Burén @ 2012-10-18 12:17 UTC (permalink / raw) To: stan; +Cc: linux-raid RAID On 18 October 2012 12:56, Stan Hoeppner <stan@hardwarefreak.com> wrote: > On 10/17/2012 2:03 PM, Mathias Burén wrote: > >> There are no CRC errors so I doubt the cable is at fault. In any way, >> I've RMA'd drives for less, and an RMA is underway for this drive. >> Just need to wait for the reshape to finish so I can get in the >> server. Btw, with a few holes drilled this bad boy holds 7 3.5" HDDs >> no problem: http://www.antec.com/productPSU.php?id=30&pid=3 > > It would seem you didn't mod the airflow of the case along with the > increased drive count. The NSK1380 has really poor airflow to begin > with: a single PSU mounted 120mm super low RPM fan. Antec is currently > shipping the NSK1380 with an additional PCI slot centrifugal fan to help > overcome the limitations of the native design. > > You bought crap drives, WD20EARS, then improperly modded a case to house > more than twice the design limit of HDDs. > > I'd say you stacked the deck against yourself here Mathias. > > -- > Stan > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Now now, the setup is working like a charm. Disk failures happen all the time. There's an additional 120mm at the bottom, blowing up towards the 7 HDDs. I bought "crap" drives because they were cheap. In the 2 years a total of 3 drives have failed, but the array has never failed. I'm very pleased with it (HTPC, with an ION board and 4x SATA PCI-E controller for E10) Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID6 reshape, 2 disk failures 2012-10-18 12:17 ` Mathias Burén @ 2012-10-18 17:11 ` Mathias Burén 2012-10-18 19:54 ` Chris Murphy 0 siblings, 1 reply; 18+ messages in thread From: Mathias Burén @ 2012-10-18 17:11 UTC (permalink / raw) To: stan; +Cc: linux-raid RAID On 18 October 2012 13:17, Mathias Burén <mathias.buren@gmail.com> wrote: > On 18 October 2012 12:56, Stan Hoeppner <stan@hardwarefreak.com> wrote: >> On 10/17/2012 2:03 PM, Mathias Burén wrote: >> >>> There are no CRC errors so I doubt the cable is at fault. In any way, >>> I've RMA'd drives for less, and an RMA is underway for this drive. >>> Just need to wait for the reshape to finish so I can get in the >>> server. Btw, with a few holes drilled this bad boy holds 7 3.5" HDDs >>> no problem: http://www.antec.com/productPSU.php?id=30&pid=3 >> >> It would seem you didn't mod the airflow of the case along with the >> increased drive count. The NSK1380 has really poor airflow to begin >> with: a single PSU mounted 120mm super low RPM fan. Antec is currently >> shipping the NSK1380 with an additional PCI slot centrifugal fan to help >> overcome the limitations of the native design. >> >> You bought crap drives, WD20EARS, then improperly modded a case to house >> more than twice the design limit of HDDs. >> >> I'd say you stacked the deck against yourself here Mathias. >> >> -- >> Stan >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > Now now, the setup is working like a charm. Disk failures happen all > the time. There's an additional 120mm at the bottom, blowing up > towards the 7 HDDs. I bought "crap" drives because they were cheap. > > In the 2 years a total of 3 drives have failed, but the array has > never failed. I'm very pleased with it (HTPC, with an ION board and 4x > SATA PCI-E controller for E10) > > Mathias Just to follow up, the reshape succeeded and I'll now shutdown and RMA /dev/sde. Thanks all for the answers. [748891.476091] md: md0: reshape done. [748891.505225] RAID conf printout: [748891.505235] --- level:6 rd:7 wd:5 [748891.505241] disk 0, o:0, dev:sde1 [748891.505246] disk 1, o:1, dev:sdf1 [748891.505251] disk 2, o:1, dev:sdb1 [748891.505257] disk 3, o:1, dev:sdd1 [748891.505263] disk 4, o:1, dev:sdc1 [748891.505268] disk 5, o:1, dev:sdg1 [748891.535219] RAID conf printout: [748891.535229] --- level:6 rd:7 wd:5 [748891.535236] disk 0, o:0, dev:sde1 [748891.535242] disk 1, o:1, dev:sdf1 [748891.535246] disk 2, o:1, dev:sdb1 [748891.535251] disk 3, o:1, dev:sdd1 [748891.535256] disk 4, o:1, dev:sdc1 [748891.535261] disk 5, o:1, dev:sdg1 [748891.548477] RAID conf printout: [748891.548483] --- level:6 rd:7 wd:5 [748891.548487] disk 1, o:1, dev:sdf1 [748891.548491] disk 2, o:1, dev:sdb1 [748891.548494] disk 3, o:1, dev:sdd1 [748891.548498] disk 4, o:1, dev:sdc1 [748891.548501] disk 5, o:1, dev:sdg1 ion ~ $ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sde1[0](F) sdg1[8] sdc1[5] sdd1[3] sdb1[4] sdf1[9] 9751756800 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/5] [_UUUUU_] unused devices: <none> ion ~ $ sudo mdadm -D /dev/md0 [sudo] password for: /dev/md0: Version : 1.2 Creation Time : Tue Oct 19 08:58:41 2010 Raid Level : raid6 Array Size : 9751756800 (9300.00 GiB 9985.80 GB) Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB) Raid Devices : 7 Total Devices : 6 Persistence : Superblock is persistent Update Time : Thu Oct 18 11:19:35 2012 State : clean, degraded Active Devices : 5 Working Devices : 5 Failed Devices : 1 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Name : ion:0 (local to host ion) UUID : e6595c64:b3ae90b3:f01133ac:3f402d20 Events : 8678539 Number Major Minor RaidDevice State 0 0 0 0 removed 9 8 81 1 active sync /dev/sdf1 4 8 17 2 active sync /dev/sdb1 3 8 49 3 active sync /dev/sdd1 5 8 33 4 active sync /dev/sdc1 8 8 97 5 active sync /dev/sdg1 6 0 0 6 removed 0 8 65 - faulty spare /dev/sde1 ion ~ $ sudo mdadm --manage /dev/md0 --remove /dev/sde1 mdadm: hot removed /dev/sde1 from /dev/md0 ion ~ $ sudo mdadm -D /dev/md0 /dev/md0: Version : 1.2 Creation Time : Tue Oct 19 08:58:41 2010 Raid Level : raid6 Array Size : 9751756800 (9300.00 GiB 9985.80 GB) Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB) Raid Devices : 7 Total Devices : 5 Persistence : Superblock is persistent Update Time : Thu Oct 18 18:09:54 2012 State : clean, degraded Active Devices : 5 Working Devices : 5 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Name : ion:0 (local to host ion) UUID : e6595c64:b3ae90b3:f01133ac:3f402d20 Events : 8678542 Number Major Minor RaidDevice State 0 0 0 0 removed 9 8 81 1 active sync /dev/sdf1 4 8 17 2 active sync /dev/sdb1 3 8 49 3 active sync /dev/sdd1 5 8 33 4 active sync /dev/sdc1 8 8 97 5 active sync /dev/sdg1 6 0 0 6 removed ion ~ $ -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID6 reshape, 2 disk failures 2012-10-18 17:11 ` Mathias Burén @ 2012-10-18 19:54 ` Chris Murphy 2012-10-18 20:17 ` Mathias Burén 0 siblings, 1 reply; 18+ messages in thread From: Chris Murphy @ 2012-10-18 19:54 UTC (permalink / raw) To: linux-raid RAID On Oct 18, 2012, at 11:11 AM, Mathias Burén wrote: > Just to follow up, the reshape succeeded and I'll now shutdown and RMA > /dev/sde. Thanks all for the answers. Yeah but two days later and you still are critically degraded without either failed disk replaced and rebuilding. You're one tiny problem away from that whole array collapsing and you're worried about this one fussy disk? I don't understand your delay in immediately getting a replacement drive in this array unless you really don't care about the data at all, in which case why have a RAID6? Sure, what are the odds of a 3rd drive dying… *shrug* Seems like an unwise risk tempting fate like this. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID6 reshape, 2 disk failures 2012-10-18 19:54 ` Chris Murphy @ 2012-10-18 20:17 ` Mathias Burén 2012-10-18 20:58 ` Stan Hoeppner 2012-10-18 21:28 ` RAID6 reshape, 2 disk failures Chris Murphy 0 siblings, 2 replies; 18+ messages in thread From: Mathias Burén @ 2012-10-18 20:17 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-raid RAID On 18 October 2012 20:54, Chris Murphy <lists@colorremedies.com> wrote: > > On Oct 18, 2012, at 11:11 AM, Mathias Burén wrote: > >> Just to follow up, the reshape succeeded and I'll now shutdown and RMA >> /dev/sde. Thanks all for the answers. > > Yeah but two days later and you still are critically degraded without either failed disk replaced and rebuilding. You're one tiny problem away from that whole array collapsing and you're worried about this one fussy disk? I don't understand your delay in immediately getting a replacement drive in this array unless you really don't care about the data at all, in which case why have a RAID6? > > Sure, what are the odds of a 3rd drive dying… *shrug* Seems like an unwise risk tempting fate like this. > > > Chris Murphy > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html There's no more dying drives in the array, 2 out of 7 died, they are on RMA soon. (when I can get to the post office). I did see 5 pending sectors on 1 HDD after the reshape finished though. I don't care much about the data (it's not critical), RAID6 is just so I can have one large volume, some speed increase and a bit of redundancy. If I had them all as single volumes I'd have to use mhddfs or something to make it look like 1 logical volume. Or even use some kind of LVM perhaps. Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID6 reshape, 2 disk failures 2012-10-18 20:17 ` Mathias Burén @ 2012-10-18 20:58 ` Stan Hoeppner 2012-10-19 14:32 ` Offtopic: on case (was: R: RAID6 reshape, 2 disk failures) Carabetta Giulio 2012-10-18 21:28 ` RAID6 reshape, 2 disk failures Chris Murphy 1 sibling, 1 reply; 18+ messages in thread From: Stan Hoeppner @ 2012-10-18 20:58 UTC (permalink / raw) To: Mathias Burén; +Cc: Linux RAID Lack of a List-Post header got me again...sorry for the dup Mathias. On 10/18/2012 3:17 PM, Mathias Burén wrote: > If I had them all as single volumes I'd have to use mhddfs > or something to make it look like 1 logical volume. Or even use some > kind of LVM perhaps. Or md/RAID --linear. Given the list you're posting to I'm surprised you forgot this option. But running without redundancy of any kind will cause more trouble than you currently have. Last words of advice: In the future, spend a little more per drive and get units that will live longer. Also, mounting a 120mm fan in the bottom of that Antec cube chassis blowing "up" on the drives simply circulates the hot air already inside the chassis. It does not increase CFM of cool air intake nor exhaust of hot air. So T_case is pretty much the same as before you put the 2nd 120mm fan in there. And T_case is the temp that determines drive life. "silent chassis" and RAID are mutually exclusive. You'll rarely, if ever, properly cool multiple HDDs, of any persuasion, in a "silent" chassis. To make it silent the fans must turn at very low RPM, thus yielding very low CFM, thus yielding high device temps. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Offtopic: on case (was: R: RAID6 reshape, 2 disk failures) 2012-10-18 20:58 ` Stan Hoeppner @ 2012-10-19 14:32 ` Carabetta Giulio 2012-10-19 16:44 ` Offtopic: on case Stan Hoeppner 0 siblings, 1 reply; 18+ messages in thread From: Carabetta Giulio @ 2012-10-19 14:32 UTC (permalink / raw) To: 'stan@hardwarefreak.com', 'Mathias Burén' Cc: 'Linux RAID' Sorry for the OT, but... > Lack of a List-Post header got me again...sorry for the dup Mathias. > > On 10/18/2012 3:17 PM, Mathias Burén wrote: > > > If I had them all as single volumes I'd have to use mhddfs or > > something to make it look like 1 logical volume. Or even use some kind > > of LVM perhaps. > > Or md/RAID --linear. Given the list you're posting to I'm surprised you forgot this option. > > But running without redundancy of any kind will cause more trouble than you currently have. > > Last words of advice: > > In the future, spend a little more per drive and get units that will live longer. Also, mounting a 120mm fan in the bottom of that Antec > cube chassis blowing "up" on the drives simply circulates the hot air already inside the chassis. It does not increase CFM of cool air > intake nor exhaust of hot air. So T_case is pretty much the same as before you put the 2nd 120mm fan in there. And T_case is the temp > that determines drive life. > > "silent chassis" and RAID are mutually exclusive. You'll rarely, if ever, properly cool multiple HDDs, of any persuasion, in a "silent" > chassis. To make it silent the fans must turn at very low RPM, thus yielding very low CFM, thus yielding high device temps. You are right, I know that very well... Also I'm looking for a compromise between temperature and noise: what do you think about this case? http://www.lian-li.com/v2/en/product/product06.php?pr_index=480&cl_index=1&sc_index=26&ss_index=67&g=f > > -- > Stan > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo > info at http://vger.kernel.org/majordomo-info.html Giulio Carabetta-- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Offtopic: on case 2012-10-19 14:32 ` Offtopic: on case (was: R: RAID6 reshape, 2 disk failures) Carabetta Giulio @ 2012-10-19 16:44 ` Stan Hoeppner 0 siblings, 0 replies; 18+ messages in thread From: Stan Hoeppner @ 2012-10-19 16:44 UTC (permalink / raw) To: Carabetta Giulio; +Cc: 'Mathias Burén', 'Linux RAID' On 10/19/2012 9:32 AM, Carabetta Giulio wrote: > Sorry for the OT, but... ... > You are right, I know that very well... > Also I'm looking for a compromise between temperature and noise: what do you think about this case? > http://www.lian-li.com/v2/en/product/product06.php?pr_index=480&cl_index=1&sc_index=26&ss_index=67&g=f It should keep six 3.5" SATA drives within normal operating temp range, even though the hole punched cage frame is inefficient, along with the lateral vs longitudinal orientation. Going lateral saved them 2" on case depth, which is critical to their aesthetics. They could have eliminated the side and bottom intake grilles and the top exhaust fan, by rotating the PSU 180 degrees, reversing its fan, and adding a small director vane on the back of the case. This would decrease total noise by 3-5 dB without impacting cooling capacity. There are two reasons I've never been big on Lian Li cases: 1. You pay a 3-5x premium for aesthetics and the name 2. Airflow is an engineering afterthought--aesthetics comes first Point 2 is interesting regarding this case. I'm surprised to see a huge blue glowing front intake grille on a Lian Li. They've heretofore always been about the Apple clean lines look, brushed aluminum with as few interruptions as possible, which is they they had typically located media bays and device connectors on the sides, not the front. -- Stan ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID6 reshape, 2 disk failures 2012-10-18 20:17 ` Mathias Burén 2012-10-18 20:58 ` Stan Hoeppner @ 2012-10-18 21:28 ` Chris Murphy 1 sibling, 0 replies; 18+ messages in thread From: Chris Murphy @ 2012-10-18 21:28 UTC (permalink / raw) To: linux-raid RAID On Oct 18, 2012, at 2:17 PM, Mathias Burén wrote: > There's no more dying drives in the array, 2 out of 7 died, they are > on RMA soon. (when I can get to the post office). Right but 2 of those 7 you're saying gave you no warning they were about to die, which means you have 5 of 7 which could easily do the same thing any moment now. For having such cheap drives you'd think you could have at least one on standby, hotspare or not. You realize that WDC is within their right, if they knew you were using raid6 with these drives, refusing the RMA? These are not raid5/6 drives. They're not 24x7 use drives. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: RAID6 reshape, 2 disk failures 2012-10-17 8:03 ` Mathias Burén 2012-10-17 9:09 ` Chris Murphy @ 2012-10-21 22:31 ` NeilBrown 1 sibling, 0 replies; 18+ messages in thread From: NeilBrown @ 2012-10-21 22:31 UTC (permalink / raw) To: Mathias Burén; +Cc: Chris Murphy, linux-raid RAID [-- Attachment #1: Type: text/plain, Size: 922 bytes --] On Wed, 17 Oct 2012 09:03:11 +0100 Mathias Burén <mathias.buren@gmail.com> wrote: > Any idea why it says rebuilding, when it's not? Is it going to attempt > a rebuild after the reshape? Minor bug in mdadm. The device is clearly faulty: Number Major Minor RaidDevice State 0 8 65 0 faulty spare rebuilding /dev/sde1 so mdadm should never suggest that it is also spare and rebuilding. I'll fix that. What is a little odd is that 'RaidDevice' is still '0'. Normally when a device fails the RaidDevice gets set to '-1'. This doesn't happen immediately though - md waits until all pending requests have completed and then disassociated with the device and sets raid_disk to -1. So you seem to have caught it before the device was fully quiesced .... or some bug has slipped in and devices that get errors don't fully quiesce any more.... NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2012-10-21 22:31 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-10-16 22:57 RAID6 reshape, 2 disk failures Mathias Burén 2012-10-17 2:29 ` Stan Hoeppner 2012-10-17 3:06 ` Chris Murphy 2012-10-17 8:03 ` Mathias Burén 2012-10-17 9:09 ` Chris Murphy [not found] ` <CADNH=7GaGCLdK2Rk_A6vPN+Th0z0QYT7mRV0KJH=CoAffuvb6w@mail.gmail.com> 2012-10-17 18:46 ` Chris Murphy 2012-10-17 19:03 ` Mathias Burén 2012-10-17 19:35 ` Chris Murphy 2012-10-18 11:56 ` Stan Hoeppner 2012-10-18 12:17 ` Mathias Burén 2012-10-18 17:11 ` Mathias Burén 2012-10-18 19:54 ` Chris Murphy 2012-10-18 20:17 ` Mathias Burén 2012-10-18 20:58 ` Stan Hoeppner 2012-10-19 14:32 ` Offtopic: on case (was: R: RAID6 reshape, 2 disk failures) Carabetta Giulio 2012-10-19 16:44 ` Offtopic: on case Stan Hoeppner 2012-10-18 21:28 ` RAID6 reshape, 2 disk failures Chris Murphy 2012-10-21 22:31 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).