* RAID problems @ 2002-07-25 11:54 Roy Sigurd Karlsbakk 2002-07-25 12:03 ` Neil Brown 2002-07-25 12:11 ` Jakob Oestergaard 0 siblings, 2 replies; 12+ messages in thread From: Roy Sigurd Karlsbakk @ 2002-07-25 11:54 UTC (permalink / raw) To: Kernel mailing list hi all What is there to do when the following happens: a 16 drive RAID fails, giving me an error message telling 4 drives have gone dead. In fact only one has. How can I hack the superblock on the reminding disks to bring them "up", so the kernel can start using the spare? roy -- Roy Sigurd Karlsbakk, Datavaktmester Computers are like air conditioners. They stop working when you open Windows. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID problems 2002-07-25 11:54 RAID problems Roy Sigurd Karlsbakk @ 2002-07-25 12:03 ` Neil Brown 2002-07-25 12:11 ` Jakob Oestergaard 1 sibling, 0 replies; 12+ messages in thread From: Neil Brown @ 2002-07-25 12:03 UTC (permalink / raw) To: Roy Sigurd Karlsbakk; +Cc: Kernel mailing list On Thursday July 25, roy@karlsbakk.net wrote: > hi all > > What is there to do when the following happens: > > a 16 drive RAID fails, giving me an error message telling 4 drives have gone > dead. In fact only one has. > > How can I hack the superblock on the reminding disks to bring them "up", so > the kernel can start using the spare? Get mdadm http://www.cse.unsw.edu.au/~neilb/source/mdadm/ read man page, particular in reference to --assemble with --force Use mdadm to re-assemble the array. NeilBrown ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID problems 2002-07-25 11:54 RAID problems Roy Sigurd Karlsbakk 2002-07-25 12:03 ` Neil Brown @ 2002-07-25 12:11 ` Jakob Oestergaard 2002-07-30 17:25 ` Bill Davidsen 1 sibling, 1 reply; 12+ messages in thread From: Jakob Oestergaard @ 2002-07-25 12:11 UTC (permalink / raw) To: Roy Sigurd Karlsbakk; +Cc: Kernel mailing list On Thu, Jul 25, 2002 at 01:54:04PM +0200, Roy Sigurd Karlsbakk wrote: > hi all > > What is there to do when the following happens: > > a 16 drive RAID fails, giving me an error message telling 4 drives have gone > dead. In fact only one has. > > How can I hack the superblock on the reminding disks to bring them "up", so > the kernel can start using the spare? Just like the last time you asked this question on linux-kernel, the answer is in the Software RAID HOWTO, section 6.1, and it is still available at http://unthought.net/Software-RAID.HOWTO/Software-RAID.HOWTO-6.html#ss6.1 Nothing has changed :) If you feel that the answer there is inadequate, please let me know. -- ................................................................ : jakob@unthought.net : And I see the elder races, : :.........................: putrid forms of man : : Jakob Østergaard : See him rise and claim the earth, : : OZ9ABN : his downfall is at hand. : :.........................:............{Konkhra}...............: ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID problems 2002-07-25 12:11 ` Jakob Oestergaard @ 2002-07-30 17:25 ` Bill Davidsen 2002-07-30 17:55 ` Jakob Oestergaard 0 siblings, 1 reply; 12+ messages in thread From: Bill Davidsen @ 2002-07-30 17:25 UTC (permalink / raw) To: Jakob Oestergaard; +Cc: Roy Sigurd Karlsbakk, Kernel mailing list On Thu, 25 Jul 2002, Jakob Oestergaard wrote: > Just like the last time you asked this question on linux-kernel, the > answer is in the Software RAID HOWTO, section 6.1, and it is still > available at > > http://unthought.net/Software-RAID.HOWTO/Software-RAID.HOWTO-6.html#ss6.1 > > Nothing has changed :) > > If you feel that the answer there is inadequate, please let me know. I think he did by asking for help again. You might well have pointed him at a newsgroup or mailing list (there was one for RAID) so he could get some interractive support. Why does no one seem surprised that one drive failed and the system marked four bad instead of using the spare in the first place? That's a more interesting question. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID problems 2002-07-30 17:25 ` Bill Davidsen @ 2002-07-30 17:55 ` Jakob Oestergaard 2002-07-31 2:46 ` Bill Davidsen 0 siblings, 1 reply; 12+ messages in thread From: Jakob Oestergaard @ 2002-07-30 17:55 UTC (permalink / raw) To: Bill Davidsen; +Cc: Kernel mailing list On Tue, Jul 30, 2002 at 01:25:37PM -0400, Bill Davidsen wrote: > On Thu, 25 Jul 2002, Jakob Oestergaard wrote: > > > Just like the last time you asked this question on linux-kernel, the > > answer is in the Software RAID HOWTO, section 6.1, and it is still > > available at > > > > http://unthought.net/Software-RAID.HOWTO/Software-RAID.HOWTO-6.html#ss6.1 > > > > Nothing has changed :) > > > > If you feel that the answer there is inadequate, please let me know. > > I think he did by asking for help again. You might well have pointed him > at a newsgroup or mailing list (there was one for RAID) so he could get > some interractive support. linux-raid@vger.kernel.org Roy knows this. > > Why does no one seem surprised that one drive failed and the system marked > four bad instead of using the spare in the first place? That's a more > interesting question. As stated in the above URL (no shit, really!), this can happen for example if some device hangs the SCSI bus. Did *anyone* read that section ?!? ;) If someone feels the explanation there could be better, please just send the better explanation to me and it will get in. Really, this section is one of the few sections that I did *not* update in the HOWTO, because I really felt that it was still both adequate (since no-one has demanded elaboration) and correct. -- ................................................................ : jakob@unthought.net : And I see the elder races, : :.........................: putrid forms of man : : Jakob Østergaard : See him rise and claim the earth, : : OZ9ABN : his downfall is at hand. : :.........................:............{Konkhra}...............: ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID problems 2002-07-30 17:55 ` Jakob Oestergaard @ 2002-07-31 2:46 ` Bill Davidsen 2002-07-31 3:18 ` jeff millar 2002-07-31 3:42 ` Jakob Oestergaard 0 siblings, 2 replies; 12+ messages in thread From: Bill Davidsen @ 2002-07-31 2:46 UTC (permalink / raw) To: Jakob Oestergaard; +Cc: Kernel mailing list On Tue, 30 Jul 2002, Jakob Oestergaard wrote: > On Tue, Jul 30, 2002 at 01:25:37PM -0400, Bill Davidsen wrote: > > Why does no one seem surprised that one drive failed and the system marked > > four bad instead of using the spare in the first place? That's a more > > interesting question. > > As stated in the above URL (no shit, really!), this can happen for > example if some device hangs the SCSI bus. I think you misread my comment, it was not "why doesn't the documentation say this" but rather "why does software RAID have this problem?" I know this can happen in theory, but it seems that the docs imply that this isn't a surprise in practice. I've been running systems with SCSI RAID and hardware controllers since 1989 (or maybe 1990), I've got EMC and HDS boxes, aaraid and ipc controllers, Promise and CMC(??) on system boards, and a number of systems running software RAID. And of all the drive failures I've had, exactly one had more than one drive fail, and that was in a power {something} issue which took out multiple drives and system boards on many systems. I just surprised that the software RAID doesn't have better luck with this, I don't see any magic other than maybe a bus reset the firmware would be doing, and I'm wondering why this seems to be common with Linux. Or am I misreading the frequency with which it happens? > Did *anyone* read that section ?!? ;) > > If someone feels the explanation there could be better, please just send > the better explanation to me and it will get in. Really, this section > is one of the few sections that I did *not* update in the HOWTO, because > I really felt that it was still both adequate (since no-one has demanded > elaboration) and correct. Thye words are clear, I'm surprised at the behaviour. Yes, I know that's not your thing. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID problems 2002-07-31 2:46 ` Bill Davidsen @ 2002-07-31 3:18 ` jeff millar 2002-07-31 3:32 ` Neil Brown ` (2 more replies) 2002-07-31 3:42 ` Jakob Oestergaard 1 sibling, 3 replies; 12+ messages in thread From: jeff millar @ 2002-07-31 3:18 UTC (permalink / raw) To: Bill Davidsen, Jakob Oestergaard; +Cc: Kernel mailing list ----- Original Message ----- From: "Bill Davidsen" <davidsen@tmr.com> To: "Jakob Oestergaard" <jakob@unthought.net> Cc: "Kernel mailing list" <linux-kernel@vger.kernel.org> Sent: Tuesday, July 30, 2002 10:46 PM Subject: Re: RAID problems > On Tue, 30 Jul 2002, Jakob Oestergaard wrote: > > > On Tue, Jul 30, 2002 at 01:25:37PM -0400, Bill Davidsen wrote: > > > > Why does no one seem surprised that one drive failed and the system marked > > > four bad instead of using the spare in the first place? That's a more > > > interesting question. > > > > As stated in the above URL (no shit, really!), this can happen for > > example if some device hangs the SCSI bus. > > I think you misread my comment, it was not "why doesn't the documentation > say this" but rather "why does software RAID have this problem?" I know > this can happen in theory, but it seems that the docs imply that this > isn't a surprise in practice. I've been running systems with SCSI RAID and > hardware controllers since 1989 (or maybe 1990), I've got EMC and HDS > boxes, aaraid and ipc controllers, Promise and CMC(??) on system boards, > and a number of systems running software RAID. And of all the drive > failures I've had, exactly one had more than one drive fail, and that was > in a power {something} issue which took out multiple drives and system > boards on many systems. > > I just surprised that the software RAID doesn't have better luck with > this, I don't see any magic other than maybe a bus reset the firmware > would be doing, and I'm wondering why this seems to be common with Linux. > Or am I misreading the frequency with which it happens? > > > Did *anyone* read that section ?!? ;) > > > > If someone feels the explanation there could be better, please just send > > the better explanation to me and it will get in. Really, this section > > is one of the few sections that I did *not* update in the HOWTO, because > > I really felt that it was still both adequate (since no-one has demanded > > elaboration) and correct. > > Thye words are clear, I'm surprised at the behaviour. Yes, I know that's > not your thing. > > -- > bill davidsen <davidsen@tmr.com> > CTO, TMR Associates, Inc > Doing interesting things with little computers since 1979. In the 3 weeks since installing Linux software raid-5 (3 x 80 GB), I had to reinitialize the raid devices twice (mkraid --force)...once due to an abrupt shutdown and once due to some weird ATA/ATAPI/drive problem that caused a disk to begin "clicking" spasmodically...and left the raid array all out of whack.. Linux software raid seems very fragile and very scary to recover. I feel a much stronger need for backup with raid than without it. Raid needs an automatic way to maintain device synchronization. Why should I have to... manually examine the device data (lsraid) find two devices that match mark the others failed in /etc/raidtab reinitialize the raid devices...putting all data at risk hot add the "failed" device wait for it to recover (hours) change /etc/raidtab again retest everything This is 10 times worse that e2fsck and much more error prone. The file system guru's worked hard on journalling to minimize this kind of risk. jeff ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID problems 2002-07-31 3:18 ` jeff millar @ 2002-07-31 3:32 ` Neil Brown 2002-07-31 14:54 ` Bill Davidsen 2002-07-31 13:35 ` Eyal Lebedinsky 2002-07-31 18:21 ` Bill Davidsen 2 siblings, 1 reply; 12+ messages in thread From: Neil Brown @ 2002-07-31 3:32 UTC (permalink / raw) To: jeff millar; +Cc: Bill Davidsen, Jakob Oestergaard, Kernel mailing list On Tuesday July 30, wa1hco@adelphia.net wrote: > > Raid needs an automatic way to maintain device synchronization. Why should > I have to... > manually examine the device data (lsraid) > find two devices that match > mark the others failed in /etc/raidtab > reinitialize the raid devices...putting all data at risk > hot add the "failed" device > wait for it to recover (hours) > change /etc/raidtab again > retest everything > > This is 10 times worse that e2fsck and much more error prone. The file > system guru's worked hard on journalling to minimize this kind of risk. > Part of the answer is to use mdadm http://www.cse.unsw.edu.au/~neilb/source/mdadm/ mdadm --assemble --force .... will do a lot of that for you. Another part of the answer is that raid5 should never mark two drives as failed. There really isn't any point. If they are both really failed, you've lost your data anyway. If it is really a cable failure, then it should be easier to get back to where you started from. I hope to have raid5 working better in this respect in 2.6. A finally part of the answer is that even perfect raid software cannot make up for buggy drivers, shoddy hard drives, or flacky cabling. NeilBrown ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID problems 2002-07-31 3:32 ` Neil Brown @ 2002-07-31 14:54 ` Bill Davidsen 0 siblings, 0 replies; 12+ messages in thread From: Bill Davidsen @ 2002-07-31 14:54 UTC (permalink / raw) To: Neil Brown; +Cc: jeff millar, Jakob Oestergaard, Kernel mailing list On Wed, 31 Jul 2002, Neil Brown wrote: > On Tuesday July 30, wa1hco@adelphia.net wrote: > > > > Raid needs an automatic way to maintain device synchronization. Why should > > I have to... > > manually examine the device data (lsraid) > > find two devices that match > > mark the others failed in /etc/raidtab > > reinitialize the raid devices...putting all data at risk > > hot add the "failed" device > > wait for it to recover (hours) > > change /etc/raidtab again > > retest everything > > > > This is 10 times worse that e2fsck and much more error prone. The file > > system guru's worked hard on journalling to minimize this kind of risk. > > > > Part of the answer is to use mdadm > http://www.cse.unsw.edu.au/~neilb/source/mdadm/ > > mdadm --assemble --force .... > > will do a lot of that for you. Gotit! Thank you, I'll give this a cautious try as soon as I have a system I can boink, perhaps this weekend, definitely if it rains and I don't have to do things outside;-) > Another part of the answer is that raid5 should never mark two drives > as failed. There really isn't any point. > If they are both really failed, you've lost your data anyway. > If it is really a cable failure, then it should be easier to get back > to where you started from. > I hope to have raid5 working better in this respect in 2.6. Do the drivers currently do a bus reset after a device fail? That is, after the last attempt to access the device before returning an error? If the retries cause the failed sevice to hang the bus again, the final reset might give function on other drives. Clearly that's not a cure-all, and I think it would have to be at the driver level. I know some drivers do try a reset, because they log it, but I don't remember if they do a final reset when they give up. > A finally part of the answer is that even perfect raid software cannot > make up for buggy drivers, shoddy hard drives, or flacky cabling. No question there, but not always shoddy... I got some drives from a major vendor I'd not name without asking a lawyer, and until I got a firmware upgrade two weeks later they were total crap. It would seem that Linux does some thing which Win/NT don't in terms of command sequences. I'm guessing an error in the state table stuff, but {vendor} wouldn't say. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID problems 2002-07-31 3:18 ` jeff millar 2002-07-31 3:32 ` Neil Brown @ 2002-07-31 13:35 ` Eyal Lebedinsky 2002-07-31 18:21 ` Bill Davidsen 2 siblings, 0 replies; 12+ messages in thread From: Eyal Lebedinsky @ 2002-07-31 13:35 UTC (permalink / raw) To: Kernel mailing list jeff millar wrote: [trimmed] > Raid needs an automatic way to maintain device synchronization. Why should > I have to... > manually examine the device data (lsraid) > find two devices that match > mark the others failed in /etc/raidtab > reinitialize the raid devices...putting all data at risk > hot add the "failed" device > wait for it to recover (hours) There is no need to wait here, go a head and remount it now if you need it. > change /etc/raidtab again > retest everything -- Eyal Lebedinsky (eyal@eyal.emu.id.au) <http://samba.org/eyal/> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID problems 2002-07-31 3:18 ` jeff millar 2002-07-31 3:32 ` Neil Brown 2002-07-31 13:35 ` Eyal Lebedinsky @ 2002-07-31 18:21 ` Bill Davidsen 2 siblings, 0 replies; 12+ messages in thread From: Bill Davidsen @ 2002-07-31 18:21 UTC (permalink / raw) To: jeff millar; +Cc: Jakob Oestergaard, Kernel mailing list On Tue, 30 Jul 2002, jeff millar wrote: > In the 3 weeks since installing Linux software raid-5 (3 x 80 GB), I had to > reinitialize the raid devices twice (mkraid --force)...once due to an abrupt > shutdown and once due to some weird ATA/ATAPI/drive problem that caused a > disk to begin "clicking" spasmodically...and left the raid array all out of > whack.. > > Linux software raid seems very fragile and very scary to recover. I feel a > much stronger need for backup with raid than without it. I'm happy to say my experience has been better, when swraid was a patch I built a kernel: -rw-r--r-- 1 root 763429 Dec 14 1999 k2.2.13s3r And set up a four drive RAID-0+1. It has recovered from every problem with nothing more that a hot add. You certainly have had bad luck, but I don't think it's typical. The situation when a drive fails and the system stays up seems to be pretty good, back in the days of 340MB IDE drives I tested it more than I wanted;-) As you note, recovery if the system goes down is somewhat painful and manual. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID problems 2002-07-31 2:46 ` Bill Davidsen 2002-07-31 3:18 ` jeff millar @ 2002-07-31 3:42 ` Jakob Oestergaard 1 sibling, 0 replies; 12+ messages in thread From: Jakob Oestergaard @ 2002-07-31 3:42 UTC (permalink / raw) To: Bill Davidsen; +Cc: Kernel mailing list, linux-raid On Tue, Jul 30, 2002 at 10:46:55PM -0400, Bill Davidsen wrote: ... > I think you misread my comment, it was not "why doesn't the documentation > say this" but rather "why does software RAID have this problem?" Ok, my bad. - btw. I CC'ed linux-raid as this is getting a little OT for linux-kernel. Let's let the thread migrate to linux-raid instead... > I know > this can happen in theory, but it seems that the docs imply that this > isn't a surprise in practice. I've been running systems with SCSI RAID and I would say it's not a surprise as such, but it's something that really should be a very very rare occurrance. I've seen it maybe once on a production system, having run MD on quite a few computers for the past half decade. And I've had a few handfulls of people asking me about it over the years. ... > I just surprised that the software RAID doesn't have better luck with > this, I don't see any magic other than maybe a bus reset the firmware > would be doing, and I'm wondering why this seems to be common with Linux. I don't have the impression that it is common on stable hardware. Can anyone who runs SW RAID on a number (greater than 1) of machines comment on this ? However, some people run their RAID-5 arrays on the same SCSI busses as their Zip drives, their scanners, and five other el-cheapo almost-scsi devices, and that is just *bound* to cause this kind of mess when one of the devices decide to lock up the bus. You don't see this with HW raid because you don't put your $15 almost-scsi magic-foo device on your $2k HW RAID controller. There might be other simple reasons why some HW cards don't show this behaviour - they might simply maintain their superblocks differently from Linux SW RAID. I have *no* idea how current controllers do this. > Or am I misreading the frequency with which it happens? I hope ;) At least in the "stable hardware" situation. Comments, please... ... > Thye words are clear, I'm surprised at the behaviour. Yes, I know that's > not your thing. I *will* be surprised if it turns out that this is really a common occurrence for people. :) -- ................................................................ : jakob@unthought.net : And I see the elder races, : :.........................: putrid forms of man : : Jakob Østergaard : See him rise and claim the earth, : : OZ9ABN : his downfall is at hand. : :.........................:............{Konkhra}...............: ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2002-07-31 18:24 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-07-25 11:54 RAID problems Roy Sigurd Karlsbakk 2002-07-25 12:03 ` Neil Brown 2002-07-25 12:11 ` Jakob Oestergaard 2002-07-30 17:25 ` Bill Davidsen 2002-07-30 17:55 ` Jakob Oestergaard 2002-07-31 2:46 ` Bill Davidsen 2002-07-31 3:18 ` jeff millar 2002-07-31 3:32 ` Neil Brown 2002-07-31 14:54 ` Bill Davidsen 2002-07-31 13:35 ` Eyal Lebedinsky 2002-07-31 18:21 ` Bill Davidsen 2002-07-31 3:42 ` Jakob Oestergaard
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox