* recovering from a controller failure @ 2010-05-29 19:07 Kyler Laird 2010-05-29 19:46 ` Berkey B Walker 2010-05-29 21:18 ` Richard 0 siblings, 2 replies; 23+ messages in thread From: Kyler Laird @ 2010-05-29 19:07 UTC (permalink / raw) To: linux-raid Recently a drive failed on one of our file servers. The machine has three RAID6 arrays (15 1TB each plus spares). I let the spare rebuild and then started the process of replacing the drive. Unfortunately I'd misplaced the list of drive IDs so I generated a new list in order to identify the failed drive. I used "smartctl" and made a quick script to scan all 48 drives and generate pretty output. That was a mistake. After running it a couple times one of the controllers failed and several disks in the first array were failed. I worked on the machine for awhile. (It has an NFS root.) I got some information from it before it rebooted (via watchdog). I've dumped all of the information here. http://lairds.us/temp/ucmeng_md/ In mdstat_0 you can see the status of the arrays right after the controller failure. mdstat_1 shows the status after reboot. sys_block shows a listing of the block devices so you can see that the problem drives are on controller 1. The examine_sd?1 files show -E output from each drive in md0. Note that the Events count is different for the drives on the problem controller. I'd like to know if this is something I can recover. I do have backups but it's a huge pain to recover this much data. Thank you. --kyler ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: recovering from a controller failure 2010-05-29 19:07 recovering from a controller failure Kyler Laird @ 2010-05-29 19:46 ` Berkey B Walker 2010-05-29 20:44 ` Kyler Laird 2010-05-29 21:18 ` Richard 1 sibling, 1 reply; 23+ messages in thread From: Berkey B Walker @ 2010-05-29 19:46 UTC (permalink / raw) To: Kyler Laird; +Cc: linux-raid To me, things do not look good for a quick fix. It kinda looks like you killed it. Any info about the details of how things died, and exactly what you did after things atarted going south? What are you using for a controller? It sounds like it is ready for the dump. Any messages from the controller, itself? b- Kyler Laird wrote: > Recently a drive failed on one of our file servers. The machine has > three RAID6 arrays (15 1TB each plus spares). I let the spare rebuild > and then started the process of replacing the drive. > > Unfortunately I'd misplaced the list of drive IDs so I generated a new > list in order to identify the failed drive. I used "smartctl" and made > a quick script to scan all 48 drives and generate pretty output. That > was a mistake. After running it a couple times one of the controllers > failed and several disks in the first array were failed. > > I worked on the machine for awhile. (It has an NFS root.) I got some > information from it before it rebooted (via watchdog). I've dumped all > of the information here. > http://lairds.us/temp/ucmeng_md/ > > In mdstat_0 you can see the status of the arrays right after the > controller failure. mdstat_1 shows the status after reboot. > > sys_block shows a listing of the block devices so you can see that the > problem drives are on controller 1. > > The examine_sd?1 files show -E output from each drive in md0. Note that > the Events count is different for the drives on the problem controller. > > I'd like to know if this is something I can recover. I do have backups > but it's a huge pain to recover this much data. > > Thank you. > > --kyler > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: recovering from a controller failure 2010-05-29 19:46 ` Berkey B Walker @ 2010-05-29 20:44 ` Kyler Laird 0 siblings, 0 replies; 23+ messages in thread From: Kyler Laird @ 2010-05-29 20:44 UTC (permalink / raw) To: linux-raid On Sat, May 29, 2010 at 03:46:31PM -0400, Berkey B Walker wrote: > To me, things do not look good for a quick fix. It kinda looks like > you killed it. Any info about the details of how things died, I used smartctl multiple times on all drives in quick succession. > and > exactly what you did after things atarted going south? I collected information. http://lairds.us/temp/ucmeng_md/mdstat_0 > What are you > using for a controller? http://lairds.us/temp/ucmeng_md/lspci 03:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 04) > It sounds like it is ready for the dump. > Any messages from the controller, itself? I didn't capture any before the reboot. Thank you. --kyler ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: recovering from a controller failure 2010-05-29 19:07 recovering from a controller failure Kyler Laird 2010-05-29 19:46 ` Berkey B Walker @ 2010-05-29 21:18 ` Richard 2010-05-29 21:36 ` Kyler Laird 2010-05-29 21:43 ` Berkey B Walker 1 sibling, 2 replies; 23+ messages in thread From: Richard @ 2010-05-29 21:18 UTC (permalink / raw) To: Kyler Laird; +Cc: linux-raid Kyler Laird wrote: > I'd like to know if this is something I can recover. I do have backups > but it's a huge pain to recover this much data. This happened to me before I discovered that LSI SAS1068E no longer reliably tolerate querying via smartd/smartctl. Have a look at https://bugzilla.kernel.org/show_bug.cgi?id=14831 and there is a patch that seems to fix it here: http://lkml.org/lkml/2010/4/26/335 Use hdparm if you need serial numbers. In the the half dozen or so tests I have done, where more than 2 drives have been thrown out of md RAID6 arrays due to these controller resets, reassembly using --force has worked with no data corruption, but this may have been good luck. Regards, Richard ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: recovering from a controller failure 2010-05-29 21:18 ` Richard @ 2010-05-29 21:36 ` Kyler Laird 2010-05-29 21:38 ` Richard 2010-05-29 21:43 ` Berkey B Walker 1 sibling, 1 reply; 23+ messages in thread From: Kyler Laird @ 2010-05-29 21:36 UTC (permalink / raw) To: linux-raid On Sun, May 30, 2010 at 09:18:21AM +1200, Richard wrote: > This happened to me before I discovered that LSI SAS1068E no longer > reliably tolerate querying via smartd/smartctl. > > Have a look at https://bugzilla.kernel.org/show_bug.cgi?id=14831 > > and there is a patch that seems to fix it here: > > http://lkml.org/lkml/2010/4/26/335 Good news! I appreciate the information. I'm planning to update these machines with new kernels and will include this patch. > Use hdparm if you need serial numbers. The labels Sun puts on the drives has numbers from the "device model." I will see if hdparm yields those numbers...once this is all settled. Thanks for the suggestion. > In the the half dozen or so tests I have done, where more than 2 > drives have been thrown out of md RAID6 arrays due to these > controller resets, > reassembly using --force has worked with no data corruption, but > this may have been good luck. Wow! That's encouraging. I would feel amazingly more confident if someone would give me the exact command to try. This is not a good time for me to exercise my ignorance by experimenting. Thank you for your helpful insight! --kyler ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: recovering from a controller failure 2010-05-29 21:36 ` Kyler Laird @ 2010-05-29 21:38 ` Richard 2010-05-29 21:45 ` Kyler Laird 0 siblings, 1 reply; 23+ messages in thread From: Richard @ 2010-05-29 21:38 UTC (permalink / raw) To: Kyler Laird; +Cc: linux-raid Kyler Laird wrote: > Wow! That's encouraging. I would feel amazingly more confident if > someone would give me the exact command to try. This is not a good > time for me to exercise my ignorance by experimenting. mdadm -A -f /dev/mdX Regards, Richard ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: recovering from a controller failure 2010-05-29 21:38 ` Richard @ 2010-05-29 21:45 ` Kyler Laird 2010-05-29 21:50 ` Richard 2010-05-29 21:59 ` Richard 0 siblings, 2 replies; 23+ messages in thread From: Kyler Laird @ 2010-05-29 21:45 UTC (permalink / raw) To: linux-raid On Sun, May 30, 2010 at 09:38:50AM +1200, Richard wrote: > mdadm -A -f /dev/mdX root@00144ff2a334:/# mdadm -A -f /dev/md0 mdadm: /dev/md0 not identified in config file. These are net-booted file servers. They share a root file system so I rely on auto-detection of the RAID partitions. I appreciate the hand holding. --kyler ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: recovering from a controller failure 2010-05-29 21:45 ` Kyler Laird @ 2010-05-29 21:50 ` Richard 2010-05-30 0:15 ` Kyler Laird 2010-05-29 21:59 ` Richard 1 sibling, 1 reply; 23+ messages in thread From: Richard @ 2010-05-29 21:50 UTC (permalink / raw) To: Kyler Laird; +Cc: linux-raid Kyler Laird wrote: > On Sun, May 30, 2010 at 09:38:50AM +1200, Richard wrote: > >> mdadm -A -f /dev/mdX > > root@00144ff2a334:/# mdadm -A -f /dev/md0 > mdadm: /dev/md0 not identified in config file. > > These are net-booted file servers. They share a root file system so I > rely on auto-detection of the RAID partitions. > > I appreciate the hand holding. How about adding entries to your mdadm.conf file containing the UUID of /dev/md0, eg: ARRAY /dev/md8 level=raid6 num-devices=16 UUID=38a06a50:ce3fc204:728edfb7:4f4cdd43 note this should be all one line. mdadm -D /dev/md0 should get you the UUID. Regards, Richard ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: recovering from a controller failure 2010-05-29 21:50 ` Richard @ 2010-05-30 0:15 ` Kyler Laird 2010-05-30 0:28 ` Richard 2010-05-30 3:33 ` Leslie Rhorer 0 siblings, 2 replies; 23+ messages in thread From: Kyler Laird @ 2010-05-30 0:15 UTC (permalink / raw) To: linux-raid On Sun, May 30, 2010 at 09:50:26AM +1200, Richard wrote: > How about adding entries to your mdadm.conf file containing the UUID > of /dev/md0, eg: > > ARRAY /dev/md8 level=raid6 num-devices=16 > UUID=38a06a50:ce3fc204:728edfb7:4f4cdd43 > > note this should be all one line. I'll be happy to do that. > mdadm -D /dev/md0 should get you the UUID. root@00144ff2a334:/# mdadm -D /dev/md0 mdadm: md device /dev/md0 does not appear to be active. So...how do I get the UUIDs? I tried blkid and got this. http://lairds.us/temp/ucmeng_md/uuids Those UUIDs are far from unique. Thanks! --kyler ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: recovering from a controller failure 2010-05-30 0:15 ` Kyler Laird @ 2010-05-30 0:28 ` Richard 2010-05-30 0:54 ` Richard 2010-05-30 3:33 ` Leslie Rhorer 1 sibling, 1 reply; 23+ messages in thread From: Richard @ 2010-05-30 0:28 UTC (permalink / raw) To: Linux RAID Mailing List Kyler Laird wrote: > So...how do I get the UUIDs? I tried blkid and got this. > http://lairds.us/temp/ucmeng_md/uuids > Those UUIDs are far from unique. How about mdadm --examine /dev/sdX where sdX is a compont of the failed array. If the drive was partioned prior to being md'ed you will need that partition eg sda1. Regards, Richard ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: recovering from a controller failure 2010-05-30 0:28 ` Richard @ 2010-05-30 0:54 ` Richard 0 siblings, 0 replies; 23+ messages in thread From: Richard @ 2010-05-30 0:54 UTC (permalink / raw) To: Linux RAID Mailing List Richard wrote: > Kyler Laird wrote: > >> So...how do I get the UUIDs? I tried blkid and got this. >> http://lairds.us/temp/ucmeng_md/uuids >> Those UUIDs are far from unique. Sorry, I just checked the link showing the blkid's. These are almost certainly correct and they are onlu unique between arrays. This is the whole point - all members of an array have the same UUID so that mdadm knows which devices are part of the same array. UUID's should always be part of an mdadm.conf so that thewre can be no ambiguity. Regards, Richard ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: recovering from a controller failure 2010-05-30 0:15 ` Kyler Laird 2010-05-30 0:28 ` Richard @ 2010-05-30 3:33 ` Leslie Rhorer 2010-05-30 13:17 ` CoolCold 2010-05-30 18:55 ` Richard Scobie 1 sibling, 2 replies; 23+ messages in thread From: Leslie Rhorer @ 2010-05-30 3:33 UTC (permalink / raw) To: 'Kyler Laird', linux-raid > On Sun, May 30, 2010 at 09:50:26AM +1200, Richard wrote: > > > How about adding entries to your mdadm.conf file containing the UUID > > of /dev/md0, eg: > > > > ARRAY /dev/md8 level=raid6 num-devices=16 > > UUID=38a06a50:ce3fc204:728edfb7:4f4cdd43 > > > > note this should be all one line. > > I'll be happy to do that. > > > mdadm -D /dev/md0 should get you the UUID. > > root@00144ff2a334:/# mdadm -D /dev/md0 > mdadm: md device /dev/md0 does not appear to be active. > > So...how do I get the UUIDs? I tried blkid and got this. > http://lairds.us/temp/ucmeng_md/uuids > Those UUIDs are far from unique. After all your drives are visible, of course: `mdadm --examine /dev/sd* /dev/hd* > <filename>` `more <filename>` Make note of the array UUID for each drive. When done, `mdadm --assemble --assume-clean /dev/mdX /dev/<drive0> /dev/<drive1> /dev/<drive2> ...etc` where <drive0>, <drive1>, etc are all members of the same array UUID. Mount the file system, and fsck it. Once everything is verified good, `echo repair > /sys/block/mdX/md/sync_action` `mdadm --examine --scan >> /etc/mdadm/mdadm.conf` ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: recovering from a controller failure 2010-05-30 3:33 ` Leslie Rhorer @ 2010-05-30 13:17 ` CoolCold 2010-05-30 22:38 ` Leslie Rhorer 2010-05-30 18:55 ` Richard Scobie 1 sibling, 1 reply; 23+ messages in thread From: CoolCold @ 2010-05-30 13:17 UTC (permalink / raw) To: Leslie Rhorer; +Cc: Kyler Laird, linux-raid On Sun, May 30, 2010 at 7:33 AM, Leslie Rhorer <lrhorer@satx.rr.com> wrote: >> On Sun, May 30, 2010 at 09:50:26AM +1200, Richard wrote: >> >> > How about adding entries to your mdadm.conf file containing the UUID >> > of /dev/md0, eg: >> > >> > ARRAY /dev/md8 level=raid6 num-devices=16 >> > UUID=38a06a50:ce3fc204:728edfb7:4f4cdd43 >> > >> > note this should be all one line. >> >> I'll be happy to do that. >> >> > mdadm -D /dev/md0 should get you the UUID. >> >> root@00144ff2a334:/# mdadm -D /dev/md0 >> mdadm: md device /dev/md0 does not appear to be active. >> >> So...how do I get the UUIDs? I tried blkid and got this. >> http://lairds.us/temp/ucmeng_md/uuids >> Those UUIDs are far from unique. > > After all your drives are visible, of course: > > `mdadm --examine /dev/sd* /dev/hd* > <filename>` > `more <filename>` > > Make note of the array UUID for each drive. When done, > > `mdadm --assemble --assume-clean /dev/mdX /dev/<drive0> /dev/<drive1> > /dev/<drive2> ...etc` > > where <drive0>, <drive1>, etc are all members of the same array UUID. > > Mount the file system, and fsck it. Once everything is verified > good, > > `echo repair > /sys/block/mdX/md/sync_action` Taking in account "Events" fields are differing on disks from 1st & 2nd controller, interesting question for me - what will happen on this "repair" ? And what this "Events" field really means? I didn't found description in man pages. > `mdadm --examine --scan >> /etc/mdadm/mdadm.conf` > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Best regards, [COOLCOLD-RIPN] -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: recovering from a controller failure 2010-05-30 13:17 ` CoolCold @ 2010-05-30 22:38 ` Leslie Rhorer 2010-05-31 8:33 ` CoolCold 0 siblings, 1 reply; 23+ messages in thread From: Leslie Rhorer @ 2010-05-30 22:38 UTC (permalink / raw) To: linux-raid > -----Original Message----- > From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- > owner@vger.kernel.org] On Behalf Of CoolCold > Sent: Sunday, May 30, 2010 8:18 AM > To: Leslie Rhorer > Cc: Kyler Laird; linux-raid@vger.kernel.org > Subject: Re: recovering from a controller failure > > On Sun, May 30, 2010 at 7:33 AM, Leslie Rhorer <lrhorer@satx.rr.com> > wrote: > >> On Sun, May 30, 2010 at 09:50:26AM +1200, Richard wrote: > >> > >> > How about adding entries to your mdadm.conf file containing the UUID > >> > of /dev/md0, eg: > >> > > >> > ARRAY /dev/md8 level=raid6 num-devices=16 > >> > UUID=38a06a50:ce3fc204:728edfb7:4f4cdd43 > >> > > >> > note this should be all one line. > >> > >> I'll be happy to do that. > >> > >> > mdadm -D /dev/md0 should get you the UUID. > >> > >> root@00144ff2a334:/# mdadm -D /dev/md0 > >> mdadm: md device /dev/md0 does not appear to be active. > >> > >> So...how do I get the UUIDs? I tried blkid and got this. > >> http://lairds.us/temp/ucmeng_md/uuids > >> Those UUIDs are far from unique. > > > > After all your drives are visible, of course: > > > > `mdadm --examine /dev/sd* /dev/hd* > <filename>` > > `more <filename>` > > > > Make note of the array UUID for each drive. When done, > > > > `mdadm --assemble --assume-clean /dev/mdX /dev/<drive0> /dev/<drive1> > > /dev/<drive2> ...etc` > > > > where <drive0>, <drive1>, etc are all members of the same array UUID. > > > > Mount the file system, and fsck it. Once everything is verified > > good, > > > > `echo repair > /sys/block/mdX/md/sync_action` > Taking in account "Events" fields are differing on disks from 1st & > 2nd controller, interesting question for me - what will happen on this > "repair" ? Note that should be --force, not --assume-clean. The --assume-clean switch would be used if you re-created the array, not just re-assembled it. Once the array is assembled, the repair function will re-establish the redundancy within the array. Any stripes whose data does not match the calculated value required to produce the upper layer information are re-written. > And what this "Events" field really means? I didn't found description > in man pages. I believe a number of things. For one thing, it is used to keep track of which version of data resides in each drive, whenever an array event is encountered. The value of the events counter in the members of an array should not be different by more than 1, or mdadm kicks the drive out of the array. I expect it may also be used during forced re-assembly and / or during a resync of a RAID1 system to help determine which version of a stripe is correct. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: recovering from a controller failure 2010-05-30 22:38 ` Leslie Rhorer @ 2010-05-31 8:33 ` CoolCold 2010-05-31 8:50 ` Leslie Rhorer 0 siblings, 1 reply; 23+ messages in thread From: CoolCold @ 2010-05-31 8:33 UTC (permalink / raw) To: Leslie Rhorer; +Cc: linux-raid On Mon, May 31, 2010 at 2:38 AM, Leslie Rhorer <lrhorer@satx.rr.com> wrote: > > >> -----Original Message----- >> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- >> owner@vger.kernel.org] On Behalf Of CoolCold >> Sent: Sunday, May 30, 2010 8:18 AM >> To: Leslie Rhorer >> Cc: Kyler Laird; linux-raid@vger.kernel.org >> Subject: Re: recovering from a controller failure >> >> On Sun, May 30, 2010 at 7:33 AM, Leslie Rhorer <lrhorer@satx.rr.com> >> wrote: >> >> On Sun, May 30, 2010 at 09:50:26AM +1200, Richard wrote: >> >> >> >> > How about adding entries to your mdadm.conf file containing the UUID >> >> > of /dev/md0, eg: >> >> > >> >> > ARRAY /dev/md8 level=raid6 num-devices=16 >> >> > UUID=38a06a50:ce3fc204:728edfb7:4f4cdd43 >> >> > >> >> > note this should be all one line. >> >> >> >> I'll be happy to do that. >> >> >> >> > mdadm -D /dev/md0 should get you the UUID. >> >> >> >> root@00144ff2a334:/# mdadm -D /dev/md0 >> >> mdadm: md device /dev/md0 does not appear to be active. >> >> >> >> So...how do I get the UUIDs? I tried blkid and got this. >> >> http://lairds.us/temp/ucmeng_md/uuids >> >> Those UUIDs are far from unique. >> > >> > After all your drives are visible, of course: >> > >> > `mdadm --examine /dev/sd* /dev/hd* > <filename>` >> > `more <filename>` >> > >> > Make note of the array UUID for each drive. When done, >> > >> > `mdadm --assemble --assume-clean /dev/mdX /dev/<drive0> /dev/<drive1> >> > /dev/<drive2> ...etc` >> > >> > where <drive0>, <drive1>, etc are all members of the same array UUID. >> > >> > Mount the file system, and fsck it. Once everything is verified >> > good, >> > >> > `echo repair > /sys/block/mdX/md/sync_action` >> Taking in account "Events" fields are differing on disks from 1st & >> 2nd controller, interesting question for me - what will happen on this >> "repair" ? > > Note that should be --force, not --assume-clean. The --assume-clean > switch would be used if you re-created the array, not just re-assembled it. > Once the array is assembled, the repair function will re-establish the > redundancy within the array. Any stripes whose data does not match the > calculated value required to produce the upper layer information are > re-written. That's it - as you can see there are 15 drives in raid6 array. Examine on disks from sda to sdh shows drives active and event count is 0.159, sdi to sdp events count is 0.168 and show that sd[a-i] are faulty. So I'm guessing there is no way to know which part of array is "right" and i guess they are desynced. > >> And what this "Events" field really means? I didn't found description >> in man pages. > > I believe a number of things. For one thing, it is used to keep > track of which version of data resides in each drive, whenever an array > event is encountered. The value of the events counter in the members of an > array should not be different by more than 1, or mdadm kicks the drive out > of the array. I've thought similar, but interesting - in this situation drives has event count value like "0.168" and "0.159"... > I expect it may also be used during forced re-assembly and / > or during a resync of a RAID1 system to help determine which version of a > stripe is correct. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Best regards, [COOLCOLD-RIPN] -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: recovering from a controller failure 2010-05-31 8:33 ` CoolCold @ 2010-05-31 8:50 ` Leslie Rhorer 0 siblings, 0 replies; 23+ messages in thread From: Leslie Rhorer @ 2010-05-31 8:50 UTC (permalink / raw) To: 'CoolCold'; +Cc: linux-raid > > Once the array is assembled, the repair function will re-establish the > > redundancy within the array. Any stripes whose data does not match the > > calculated value required to produce the upper layer information are > > re-written. > That's it - as you can see there are 15 drives in raid6 array. Examine > on disks from sda to sdh shows drives active and event count is 0.159, > sdi to sdp events count is 0.168 and show that sd[a-i] are faulty. So > I'm guessing there is no way to know which part of array is "right" > and i guess they are desynced. I deleted the original e-mails while cleaning out my in box a few hours ago, so I can't look at your original response, but I've never seen fractional event counts. Some of mine are in the millions. In any case, if the corruption is bad enough, you may indeed lose some data. Remember, however, that unless this was a brand new array, or the data on the array was undergoing a truly phenomenal amount of thrashing, most of the data on the drives is probably consistent, or at least consistent enough to allow recovery. Some, however, possibly even a large amount, may be toast. That's on reason why you have backups. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: recovering from a controller failure 2010-05-30 3:33 ` Leslie Rhorer 2010-05-30 13:17 ` CoolCold @ 2010-05-30 18:55 ` Richard Scobie 2010-05-30 22:23 ` Leslie Rhorer 1 sibling, 1 reply; 23+ messages in thread From: Richard Scobie @ 2010-05-30 18:55 UTC (permalink / raw) To: Leslie Rhorer; +Cc: 'Kyler Laird', linux-raid Leslie Rhorer wrote: > `mdadm --assemble --assume-clean /dev/mdX /dev/<drive0> /dev/<drive1> > /dev/<drive2> ...etc` --assume-clean is not an option for assemble, the --force option is required. Regards, Richard ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: recovering from a controller failure 2010-05-30 18:55 ` Richard Scobie @ 2010-05-30 22:23 ` Leslie Rhorer 0 siblings, 0 replies; 23+ messages in thread From: Leslie Rhorer @ 2010-05-30 22:23 UTC (permalink / raw) To: 'Richard Scobie'; +Cc: 'Kyler Laird', linux-raid Oops! You're right. > -----Original Message----- > From: Richard Scobie [mailto:richard@sauce.co.nz] > Sent: Sunday, May 30, 2010 1:55 PM > To: Leslie Rhorer > Cc: 'Kyler Laird'; linux-raid@vger.kernel.org > Subject: Re: recovering from a controller failure > > Leslie Rhorer wrote: > > > `mdadm --assemble --assume-clean /dev/mdX /dev/<drive0> /dev/<drive1> > > /dev/<drive2> ...etc` > > --assume-clean is not an option for assemble, the --force option is > required. > > Regards, > > Richard ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: recovering from a controller failure 2010-05-29 21:45 ` Kyler Laird 2010-05-29 21:50 ` Richard @ 2010-05-29 21:59 ` Richard 1 sibling, 0 replies; 23+ messages in thread From: Richard @ 2010-05-29 21:59 UTC (permalink / raw) To: Kyler Laird; +Cc: linux-raid Kyler Laird wrote: > On Sun, May 30, 2010 at 09:38:50AM +1200, Richard wrote: > >> mdadm -A -f /dev/mdX You will probably need to stop the array first, if it's not already. prior to doing this. mdadm -S /dev/md0 Regards, Richard ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: recovering from a controller failure 2010-05-29 21:18 ` Richard 2010-05-29 21:36 ` Kyler Laird @ 2010-05-29 21:43 ` Berkey B Walker 1 sibling, 0 replies; 23+ messages in thread From: Berkey B Walker @ 2010-05-29 21:43 UTC (permalink / raw) To: Richard; +Cc: Kyler Laird, linux-raid Good find, Richard. Simplifies things a lot. I liked the phrase "Abusively looping" , as that was a technique I used to use (30 yr. ago) b- Richard wrote: > Kyler Laird wrote: > >> I'd like to know if this is something I can recover. I do have backups >> but it's a huge pain to recover this much data. > > This happened to me before I discovered that LSI SAS1068E no longer > reliably tolerate querying via smartd/smartctl. > > Have a look at https://bugzilla.kernel.org/show_bug.cgi?id=14831 > > and there is a patch that seems to fix it here: > > http://lkml.org/lkml/2010/4/26/335 > > Use hdparm if you need serial numbers. > > In the the half dozen or so tests I have done, where more than 2 > drives have been thrown out of md RAID6 arrays due to these controller > resets, > reassembly using --force has worked with no data corruption, but this > may have been good luck. > > Regards, > > Richard > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: recovering from a controller failure @ 2010-05-31 18:27 Kyler Laird 2010-06-01 15:49 ` Kyler Laird 0 siblings, 1 reply; 23+ messages in thread From: Kyler Laird @ 2010-05-31 18:27 UTC (permalink / raw) To: linux-raid I appreciate the help that everyone here has been providing with this frustrating problem. It looks like there's agreement that I need to use "--force" to assemble the array with the disk devices specified. Here's my first cut at a command to try: mdadm --force --assemble /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 http://lairds.us/temp/ucmeng_md/suggested_recovery I'm sure I'm missing something. Corrections are welcome. (It would be comforting if mdadm had a "--dry-run" option.) Thank you, all! --kyler ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: recovering from a controller failure 2010-05-31 18:27 Kyler Laird @ 2010-06-01 15:49 ` Kyler Laird 2010-06-01 19:15 ` Richard Scobie 0 siblings, 1 reply; 23+ messages in thread From: Kyler Laird @ 2010-06-01 15:49 UTC (permalink / raw) To: linux-raid I decided to try simply using "--force" to assemble the array. It seems to have worked. http://lairds.us/temp/ucmeng_md/suggested_recovery As you can see, it didn't use /dev/sdah1, starting the RAID6 array with one drive missing. I can safely --add this drive or was there a reason it wasn't used? I also plan to add the spare (/dev/sdp1). Thanks for all the help! --kyler ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: recovering from a controller failure 2010-06-01 15:49 ` Kyler Laird @ 2010-06-01 19:15 ` Richard Scobie 0 siblings, 0 replies; 23+ messages in thread From: Richard Scobie @ 2010-06-01 19:15 UTC (permalink / raw) To: Kyler Laird; +Cc: linux-raid Kyler Laird wrote: > I decided to try simply using "--force" to assemble the array. It seems > to have worked. > http://lairds.us/temp/ucmeng_md/suggested_recovery > As you can see, it didn't use /dev/sdah1, starting the RAID6 array with > one drive missing. > > I can safely --add this drive or was there a reason it wasn't used? I > also plan to add the spare (/dev/sdp1). It would be prudent to remove /dev/sdah1 and use smartctl on a non LSI SAS controller or another machine, to check that it has not failed. If not, prior to re adding it back, I would perform on fsck on the filesystem to make sure there are no errors. Regards, Richard ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2010-06-01 19:15 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-05-29 19:07 recovering from a controller failure Kyler Laird 2010-05-29 19:46 ` Berkey B Walker 2010-05-29 20:44 ` Kyler Laird 2010-05-29 21:18 ` Richard 2010-05-29 21:36 ` Kyler Laird 2010-05-29 21:38 ` Richard 2010-05-29 21:45 ` Kyler Laird 2010-05-29 21:50 ` Richard 2010-05-30 0:15 ` Kyler Laird 2010-05-30 0:28 ` Richard 2010-05-30 0:54 ` Richard 2010-05-30 3:33 ` Leslie Rhorer 2010-05-30 13:17 ` CoolCold 2010-05-30 22:38 ` Leslie Rhorer 2010-05-31 8:33 ` CoolCold 2010-05-31 8:50 ` Leslie Rhorer 2010-05-30 18:55 ` Richard Scobie 2010-05-30 22:23 ` Leslie Rhorer 2010-05-29 21:59 ` Richard 2010-05-29 21:43 ` Berkey B Walker -- strict thread matches above, loose matches on Subject: below -- 2010-05-31 18:27 Kyler Laird 2010-06-01 15:49 ` Kyler Laird 2010-06-01 19:15 ` Richard Scobie
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).