* 5 drives lost in an inactive 15 drive raid 6 system due to cable problem - how to recover? @ 2010-09-08 17:22 Norman White 2010-09-08 18:47 ` Stan Hoeppner ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Norman White @ 2010-09-08 17:22 UTC (permalink / raw) To: linux-raid We have a 15 drive addonics array with 3 5 port sata multiplexors, one of the sas cables was knocked out to one of the port multiplexors and now mdadm sees 9 drives , a spare, and 5 failed, removed drives (after fixing the cabling problem). A mdadm -E on each of the drives, see 5 drives (the ones that were uncabled) as seeing the original configuration with 14 drives and a spare, while the other 10 drives report 9 drives, a spare and 5 failed , removed drives. We are very confident that there was no io going on at the time, but are not sure how to proceed. One obvious thing to do is to just do a: mdadm --assemble --force --assume-clean /dev/md0 sd[b,c, ... , p] but we are getting different advice about what force will do in this situation. The last thing we want to do is wipe the array. Another option would be to fiddle with the super blocks with mddump, so that they all see the same 15 drives in the same configuration, and then assemble it. Yet another suggestion was to recreate the array configuration and hope that the data wouldn't be touched. And even another suggestion is to create the array with one drive missing (so it is degraded and won't rebuild) Any pointers on how to proceed would be helpful. Restoring 30TB takes along time. Best, Norman White ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 5 drives lost in an inactive 15 drive raid 6 system due to cable problem - how to recover? 2010-09-08 17:22 5 drives lost in an inactive 15 drive raid 6 system due to cable problem - how to recover? Norman White @ 2010-09-08 18:47 ` Stan Hoeppner 2010-09-08 20:22 ` Mikael Abrahamsson 2010-09-08 21:35 ` Neil Brown 2 siblings, 0 replies; 9+ messages in thread From: Stan Hoeppner @ 2010-09-08 18:47 UTC (permalink / raw) To: linux-raid Norman White put forth on 9/8/2010 12:22 PM: > We have a 15 drive addonics array with 3 5 port sata multiplexors, one > of the sas cables was knocked out to one of the port multiplexors... Exactly how/why was the SAS cable "knocked out". Is this racked gear? Fix that problem before worrying about the array state, as you may find yourself here again in the future. Bringing the array back up with guess work vs starting a restore now, if you blow the array trying to fix it with mdadm hacks, you can always still run the restore later. Just make sure you do a lot of verification after you bring the arrays back up and they _seem_ fine. Don't trust surface appearances, esp when talking about 30TB of data. Just my $0.02. -- Stan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 5 drives lost in an inactive 15 drive raid 6 system due to cable problem - how to recover? 2010-09-08 17:22 5 drives lost in an inactive 15 drive raid 6 system due to cable problem - how to recover? Norman White 2010-09-08 18:47 ` Stan Hoeppner @ 2010-09-08 20:22 ` Mikael Abrahamsson 2010-09-08 21:35 ` Neil Brown 2 siblings, 0 replies; 9+ messages in thread From: Mikael Abrahamsson @ 2010-09-08 20:22 UTC (permalink / raw) To: Norman White; +Cc: linux-raid On Wed, 8 Sep 2010, Norman White wrote: > Any pointers on how to proceed would be helpful. Restoring 30TB takes > along time. Look into the archives of this list, this has happened to numerous people and most have gotten out of it ok. Make sure you have a recent kernel and the latest mdadm. At this point I've several times successfully done --assemble --force in your situation with good result. --create --assume-clean is the absolute last resort when everything else fails, is my recommendation. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 5 drives lost in an inactive 15 drive raid 6 system due to cable problem - how to recover? 2010-09-08 17:22 5 drives lost in an inactive 15 drive raid 6 system due to cable problem - how to recover? Norman White 2010-09-08 18:47 ` Stan Hoeppner 2010-09-08 20:22 ` Mikael Abrahamsson @ 2010-09-08 21:35 ` Neil Brown 2010-09-10 15:18 ` Norman White 2010-09-10 19:39 ` CoolCold 2 siblings, 2 replies; 9+ messages in thread From: Neil Brown @ 2010-09-08 21:35 UTC (permalink / raw) To: Norman White; +Cc: linux-raid On Wed, 08 Sep 2010 13:22:30 -0400 Norman White <nwhite@stern.nyu.edu> wrote: > We have a 15 drive addonics array with 3 5 port sata multiplexors, one > of the sas cables was knocked out to one of the port multiplexors and now > mdadm sees 9 drives , a spare, and 5 failed, removed drives (after > fixing the cabling problem). > > A mdadm -E on each of the drives, see 5 drives (the ones that were > uncabled) as seeing the original configuration with 14 drives and a > spare, while the other 10 drives report > 9 drives, a spare and 5 failed , removed drives. > > We are very confident that there was no io going on at the time, but are > not sure how to proceed. > > One obvious thing to do is to just do a: > > mdadm --assemble --force --assume-clean /dev/md0 sd[b,c, ... , p] > but we are getting different advice about what force will do in this > situation. The last thing we want to do is wipe the array. What sort of different advice? From whom? This should either do exactly what you want, or nothing at all. I suspect the former. To be more confident I would need to see the output of mdadm -E /dev/sd[b-p] NeilBrown > > Another option would be to fiddle with the super blocks with mddump, so > that they all see the same 15 drives in the same configuration, and then > assemble it. > > Yet another suggestion was to recreate the array configuration and hope > that the data wouldn't be touched. > > And even another suggestion is to create the array with one drive > missing (so it is degraded and won't rebuild) > > Any pointers on how to proceed would be helpful. Restoring 30TB takes > along time. > > Best, > Norman White > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 5 drives lost in an inactive 15 drive raid 6 system due to cable problem - how to recover? 2010-09-08 21:35 ` Neil Brown @ 2010-09-10 15:18 ` Norman White 2010-09-10 17:47 ` Mr. James W. Laferriere 2010-09-10 19:39 ` CoolCold 1 sibling, 1 reply; 9+ messages in thread From: Norman White @ 2010-09-10 15:18 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid On 9/8/2010 5:35 PM, Neil Brown wrote: > On Wed, 08 Sep 2010 13:22:30 -0400 > Norman White<nwhite@stern.nyu.edu> wrote: > >> We have a 15 drive addonics array with 3 5 port sata multiplexors, one >> of the sas cables was knocked out to one of the port multiplexors and now >> mdadm sees 9 drives , a spare, and 5 failed, removed drives (after >> fixing the cabling problem). >> >> A mdadm -E on each of the drives, see 5 drives (the ones that were >> uncabled) as seeing the original configuration with 14 drives and a >> spare, while the other 10 drives report >> 9 drives, a spare and 5 failed , removed drives. >> >> We are very confident that there was no io going on at the time, but are >> not sure how to proceed. >> >> One obvious thing to do is to just do a: >> >> mdadm --assemble --force --assume-clean /dev/md0 sd[b,c, ... , p] >> but we are getting different advice about what force will do in this >> situation. The last thing we want to do is wipe the array. > What sort of different advice? From whom? > > This should either do exactly what you want, or nothing at all. I suspect > the former. To be more confident I would need to see the output of > mdadm -E /dev/sd[b-p] > > NeilBrown > > Just to close this out, I sent Neil Brown the output of mdadm -E /dev/sd[b-p] and he agreed it looked clean. I then did an mdadm --assemble --force /dev/md0 sd[b-p] and got the message /dev/sdb was busy, no super block. Rebooted the system, and reissued the mdadm --assemble --force. Voila, /dev/md0 was back.. Initial tests indicate no data loss. We have, of course, (as suggested by some on this list), more securely attached the sas cables to the back of the addonics array so this can't happen again. The siIicon image port multiplers only seem to have push in connections that don't lock at all. Just a pressure fit. We have to be very careful working around the box. On the other hand, we have a 30TB raid 6 array (about 21 TB formatted with a hot spare) that is extremely fast and inexpensive. (~ $4k ). We are considering buying another and having a dedicated server with several arrays connected to it and put in a protected environment. Thank you very much Neil. We owe you. Best, Norman White >> Another option would be to fiddle with the super blocks with mddump, so >> that they all see the same 15 drives in the same configuration, and then >> assemble it. >> >> Yet another suggestion was to recreate the array configuration and hope >> that the data wouldn't be touched. >> >> And even another suggestion is to create the array with one drive >> missing (so it is degraded and won't rebuild) >> >> Any pointers on how to proceed would be helpful. Restoring 30TB takes >> along time. >> >> Best, >> Norman White >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 5 drives lost in an inactive 15 drive raid 6 system due to cable problem - how to recover? 2010-09-10 15:18 ` Norman White @ 2010-09-10 17:47 ` Mr. James W. Laferriere 2010-09-10 18:51 ` Norman White 0 siblings, 1 reply; 9+ messages in thread From: Mr. James W. Laferriere @ 2010-09-10 17:47 UTC (permalink / raw) To: Norman White; +Cc: linux-raid maillist Hello Norman , On Fri, 10 Sep 2010, Norman White wrote: > On 9/8/2010 5:35 PM, Neil Brown wrote: >> On Wed, 08 Sep 2010 13:22:30 -0400 >> Norman White<nwhite@stern.nyu.edu> wrote: >> ...snip... > > On the other hand, we have a 30TB raid 6 array (about 21 TB formatted with a > hot spare) that is extremely fast and inexpensive. (~ $4k ). > We are considering buying another and having a dedicated server with several > arrays connected to it and put in a protected environment. > Would you please elaborate on the construction of this array ? Or if purchased it's manufacturer & product number ? I find such a device very interesting . Maybe many others will as well . Tia , JimL -- +------------------------------------------------------------------+ | James W. Laferriere | System Techniques | Give me VMS | | Network&System Engineer | 3237 Holden Road | Give me Linux | | babydr@baby-dragons.com | Fairbanks, AK. 99709 | only on AXP | +------------------------------------------------------------------+ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 5 drives lost in an inactive 15 drive raid 6 system due to cable problem - how to recover? 2010-09-10 17:47 ` Mr. James W. Laferriere @ 2010-09-10 18:51 ` Norman White 0 siblings, 0 replies; 9+ messages in thread From: Norman White @ 2010-09-10 18:51 UTC (permalink / raw) To: Mr. James W. Laferriere; +Cc: linux-raid maillist On 9/10/2010 1:47 PM, Mr. James W. Laferriere wrote: > Hello Norman , > > On Fri, 10 Sep 2010, Norman White wrote: >> On 9/8/2010 5:35 PM, Neil Brown wrote: >>> On Wed, 08 Sep 2010 13:22:30 -0400 >>> Norman White<nwhite@stern.nyu.edu> wrote: >>> > ...snip... >> >> On the other hand, we have a 30TB raid 6 array (about 21 TB formatted >> with a hot spare) that is extremely fast and inexpensive. (~ $4k ). >> We are considering buying another and having a dedicated server with >> several arrays connected to it and put in a protected environment. >> > Would you please elaborate on the construction of this array ? Or > if purchased it's manufacturer & product number ? > I find such a device very interesting . Maybe many others will as > well . > Tia , JimL You can find more detail on the construction of the array, costs, components, performance, issues etc on my blog at http://researchcomputing.blogspot.com/ Best, Norman White ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 5 drives lost in an inactive 15 drive raid 6 system due to cable problem - how to recover? 2010-09-08 21:35 ` Neil Brown 2010-09-10 15:18 ` Norman White @ 2010-09-10 19:39 ` CoolCold 2010-09-10 21:24 ` Neil Brown 1 sibling, 1 reply; 9+ messages in thread From: CoolCold @ 2010-09-10 19:39 UTC (permalink / raw) To: Neil Brown; +Cc: Norman White, linux-raid Neil can you share that decision making algorithm ? We have servers with "lucky" aic9410 & LSI 1068E controllers which hang system sometimes, then drive(s) are dropped. In simple cases, when one drive has different Events count it's enough to force assemble, but in other cases when, say 8 drives are dropped like in Kyler's case ( http://marc.info/?l=linux-raid&m=127534131202696&w=2 ) and examine shows different info for drives from the same array, ie http://lairds.us/temp/ucmeng_md/20100526/examine_sdj1 , http://lairds.us/temp/ucmeng_md/20100526/examine_sda1 . Keeping in mind that drives can be exported in different order on every boot, it's not so straightforward to detect "right" options. I promise to put that knowledge on wiki. On Thu, Sep 9, 2010 at 1:35 AM, Neil Brown <neilb@suse.de> wrote: > On Wed, 08 Sep 2010 13:22:30 -0400 > Norman White <nwhite@stern.nyu.edu> wrote: > >> We have a 15 drive addonics array with 3 5 port sata multiplexors, one >> of the sas cables was knocked out to one of the port multiplexors and now >> mdadm sees 9 drives , a spare, and 5 failed, removed drives (after >> fixing the cabling problem). >> >> A mdadm -E on each of the drives, see 5 drives (the ones that were >> uncabled) as seeing the original configuration with 14 drives and a >> spare, while the other 10 drives report >> 9 drives, a spare and 5 failed , removed drives. >> >> We are very confident that there was no io going on at the time, but are >> not sure how to proceed. >> >> One obvious thing to do is to just do a: >> >> mdadm --assemble --force --assume-clean /dev/md0 sd[b,c, ... , p] >> but we are getting different advice about what force will do in this >> situation. The last thing we want to do is wipe the array. > > What sort of different advice? From whom? > > This should either do exactly what you want, or nothing at all. I suspect > the former. To be more confident I would need to see the output of > mdadm -E /dev/sd[b-p] > > NeilBrown > > >> >> Another option would be to fiddle with the super blocks with mddump, so >> that they all see the same 15 drives in the same configuration, and then >> assemble it. >> >> Yet another suggestion was to recreate the array configuration and hope >> that the data wouldn't be touched. >> >> And even another suggestion is to create the array with one drive >> missing (so it is degraded and won't rebuild) >> >> Any pointers on how to proceed would be helpful. Restoring 30TB takes >> along time. >> >> Best, >> Norman White >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Best regards, [COOLCOLD-RIPN] -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 5 drives lost in an inactive 15 drive raid 6 system due to cable problem - how to recover? 2010-09-10 19:39 ` CoolCold @ 2010-09-10 21:24 ` Neil Brown 0 siblings, 0 replies; 9+ messages in thread From: Neil Brown @ 2010-09-10 21:24 UTC (permalink / raw) To: CoolCold; +Cc: Norman White, linux-raid On Fri, 10 Sep 2010 23:39:08 +0400 CoolCold <coolthecold@gmail.com> wrote: > Neil can you share that decision making algorithm ? Nothing special really. I just confirmed that the metadata contents agreed with what Norman said had happened. Sometime people do other things to arrays first and don't realise the consequences and don't report them. So if I based my "just use --force, that should work" on what they tell me I am sometimes wrong - because what they tell me is flawed. So I am not more cautious and try to only base it on what mdadm tells me. And I didn't really need to see the mdadm output. As I said before mdadm -Af should either do the right thing or nothing. But I know people can be very protective of there multi-gigabyte data sets so if I can be a bit more definite, it helps them. Also, I like to see want mdadm -E output from odd failures as it might show me something that needs fixing. Largely as a consequence of watching how people interact with their failed arrays, the next release of mdadm will be a lot more cautious about letting you add a device to an array - particularly a failed array. People often seem to do this thinking it means 'add this back in', but really it means 'make this a spare and attach it to the array'. In this particular case, I checked that what each device thought of its own role in the array was compatible with with what the newest device thought, where 'compatible' means either they agreed, or the newer reported a clean failure. If any device thought it was a spare and shouldn't have done, or two devices both thought they filled the same roles, that would have been a concern. NeilBrown > We have servers with "lucky" aic9410 & LSI 1068E controllers which > hang system sometimes, then drive(s) are dropped. > In simple cases, when one drive has different Events count it's enough > to force assemble, but in other cases when, say 8 drives are dropped > like in Kyler's case ( > http://marc.info/?l=linux-raid&m=127534131202696&w=2 ) and examine > shows different info for drives from the same array, ie > http://lairds.us/temp/ucmeng_md/20100526/examine_sdj1 , > http://lairds.us/temp/ucmeng_md/20100526/examine_sda1 . Keeping in > mind that drives can be exported in different order on every boot, > it's not so straightforward to detect "right" options. > > I promise to put that knowledge on wiki. > > On Thu, Sep 9, 2010 at 1:35 AM, Neil Brown <neilb@suse.de> wrote: > > On Wed, 08 Sep 2010 13:22:30 -0400 > > Norman White <nwhite@stern.nyu.edu> wrote: > > > >> We have a 15 drive addonics array with 3 5 port sata multiplexors, one > >> of the sas cables was knocked out to one of the port multiplexors and now > >> mdadm sees 9 drives , a spare, and 5 failed, removed drives (after > >> fixing the cabling problem). > >> > >> A mdadm -E on each of the drives, see 5 drives (the ones that were > >> uncabled) as seeing the original configuration with 14 drives and a > >> spare, while the other 10 drives report > >> 9 drives, a spare and 5 failed , removed drives. > >> > >> We are very confident that there was no io going on at the time, but are > >> not sure how to proceed. > >> > >> One obvious thing to do is to just do a: > >> > >> mdadm --assemble --force --assume-clean /dev/md0 sd[b,c, ... , p] > >> but we are getting different advice about what force will do in this > >> situation. The last thing we want to do is wipe the array. > > > > What sort of different advice? From whom? > > > > This should either do exactly what you want, or nothing at all. I suspect > > the former. To be more confident I would need to see the output of > > mdadm -E /dev/sd[b-p] > > > > NeilBrown > > > > > >> > >> Another option would be to fiddle with the super blocks with mddump, so > >> that they all see the same 15 drives in the same configuration, and then > >> assemble it. > >> > >> Yet another suggestion was to recreate the array configuration and hope > >> that the data wouldn't be touched. > >> > >> And even another suggestion is to create the array with one drive > >> missing (so it is degraded and won't rebuild) > >> > >> Any pointers on how to proceed would be helpful. Restoring 30TB takes > >> along time. > >> > >> Best, > >> Norman White > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2010-09-10 21:24 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-09-08 17:22 5 drives lost in an inactive 15 drive raid 6 system due to cable problem - how to recover? Norman White 2010-09-08 18:47 ` Stan Hoeppner 2010-09-08 20:22 ` Mikael Abrahamsson 2010-09-08 21:35 ` Neil Brown 2010-09-10 15:18 ` Norman White 2010-09-10 17:47 ` Mr. James W. Laferriere 2010-09-10 18:51 ` Norman White 2010-09-10 19:39 ` CoolCold 2010-09-10 21:24 ` Neil Brown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).