recovering from a controller failure

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* recovering from a controller failure
@ 2010-05-29 19:07 Kyler Laird
  2010-05-29 19:46 ` Berkey B Walker
  2010-05-29 21:18 ` Richard
  0 siblings, 2 replies; 23+ messages in thread
From: Kyler Laird @ 2010-05-29 19:07 UTC (permalink / raw)
  To: linux-raid

Recently a drive failed on one of our file servers.  The machine has
three RAID6 arrays (15 1TB each plus spares).  I let the spare rebuild
and then started the process of replacing the drive.

Unfortunately I'd misplaced the list of drive IDs so I generated a new
list in order to identify the failed drive.  I used "smartctl" and made
a quick script to scan all 48 drives and generate pretty output.  That
was a mistake.  After running it a couple times one of the controllers
failed and several disks in the first array were failed.

I worked on the machine for awhile.  (It has an NFS root.)  I got some
information from it before it rebooted (via watchdog).  I've dumped all
of the information here.
	http://lairds.us/temp/ucmeng_md/

In mdstat_0 you can see the status of the arrays right after the
controller failure.  mdstat_1 shows the status after reboot.

sys_block shows a listing of the block devices so you can see that the
problem drives are on controller 1.

The examine_sd?1 files show -E output from each drive in md0.  Note that
the Events count is different for the drives on the problem controller.

I'd like to know if this is something I can recover.  I do have backups
but it's a huge pain to recover this much data.

Thank you.

--kyler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: recovering from a controller failure
  2010-05-29 19:07 recovering from a controller failure Kyler Laird
@ 2010-05-29 19:46 ` Berkey B Walker
  2010-05-29 20:44   ` Kyler Laird
  2010-05-29 21:18 ` Richard
  1 sibling, 1 reply; 23+ messages in thread
From: Berkey B Walker @ 2010-05-29 19:46 UTC (permalink / raw)
  To: Kyler Laird; +Cc: linux-raid

To me, things do not look good for a quick fix.  It kinda looks like you 
killed it.  Any info about the details of how things died, and exactly 
what you did after things atarted going south?  What are you using for a 
controller? It sounds like it is ready for the dump.  Any messages from 
the controller, itself?
b-

Kyler Laird wrote:
> Recently a drive failed on one of our file servers.  The machine has
> three RAID6 arrays (15 1TB each plus spares).  I let the spare rebuild
> and then started the process of replacing the drive.
>
> Unfortunately I'd misplaced the list of drive IDs so I generated a new
> list in order to identify the failed drive.  I used "smartctl" and made
> a quick script to scan all 48 drives and generate pretty output.  That
> was a mistake.  After running it a couple times one of the controllers
> failed and several disks in the first array were failed.
>
> I worked on the machine for awhile.  (It has an NFS root.)  I got some
> information from it before it rebooted (via watchdog).  I've dumped all
> of the information here.
> 	http://lairds.us/temp/ucmeng_md/
>
> In mdstat_0 you can see the status of the arrays right after the
> controller failure.  mdstat_1 shows the status after reboot.
>
> sys_block shows a listing of the block devices so you can see that the
> problem drives are on controller 1.
>
> The examine_sd?1 files show -E output from each drive in md0.  Note that
> the Events count is different for the drives on the problem controller.
>
> I'd like to know if this is something I can recover.  I do have backups
> but it's a huge pain to recover this much data.
>
> Thank you.
>
> --kyler
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>    

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: recovering from a controller failure
  2010-05-29 19:46 ` Berkey B Walker
@ 2010-05-29 20:44   ` Kyler Laird
  0 siblings, 0 replies; 23+ messages in thread
From: Kyler Laird @ 2010-05-29 20:44 UTC (permalink / raw)
  To: linux-raid

On Sat, May 29, 2010 at 03:46:31PM -0400, Berkey B Walker wrote:

> To me, things do not look good for a quick fix.  It kinda looks like
> you killed it.  Any info about the details of how things died,

I used smartctl multiple times on all drives in quick succession.

> and
> exactly what you did after things atarted going south?

I collected information.
	http://lairds.us/temp/ucmeng_md/mdstat_0

> What are you
> using for a controller? 

	http://lairds.us/temp/ucmeng_md/lspci
	03:00.0 SCSI storage controller: LSI Logic / Symbios Logic
	SAS1068E PCI-Express Fusion-MPT SAS (rev 04)

> It sounds like it is ready for the dump.
> Any messages from the controller, itself?

I didn't capture any before the reboot.

Thank you.

--kyler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: recovering from a controller failure
  2010-05-29 19:07 recovering from a controller failure Kyler Laird
  2010-05-29 19:46 ` Berkey B Walker
@ 2010-05-29 21:18 ` Richard
  2010-05-29 21:36   ` Kyler Laird
  2010-05-29 21:43   ` Berkey B Walker
  1 sibling, 2 replies; 23+ messages in thread
From: Richard @ 2010-05-29 21:18 UTC (permalink / raw)
  To: Kyler Laird; +Cc: linux-raid

Kyler Laird wrote:

> I'd like to know if this is something I can recover.  I do have backups
> but it's a huge pain to recover this much data.

This happened to me before I discovered that LSI SAS1068E no longer 
reliably tolerate querying via smartd/smartctl.

Have a look at https://bugzilla.kernel.org/show_bug.cgi?id=14831

and there is a patch that seems to fix it here:

http://lkml.org/lkml/2010/4/26/335

Use hdparm if you need serial numbers.

In the the half dozen or so tests I have done, where more than 2 drives 
have been thrown out of md RAID6 arrays due to these controller resets,
reassembly using --force has worked with no data corruption, but this 
may have been good luck.

Regards,

Richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: recovering from a controller failure
  2010-05-29 21:18 ` Richard
@ 2010-05-29 21:36   ` Kyler Laird
  2010-05-29 21:38     ` Richard
  2010-05-29 21:43   ` Berkey B Walker
  1 sibling, 1 reply; 23+ messages in thread
From: Kyler Laird @ 2010-05-29 21:36 UTC (permalink / raw)
  To: linux-raid

On Sun, May 30, 2010 at 09:18:21AM +1200, Richard wrote:

> This happened to me before I discovered that LSI SAS1068E no longer
> reliably tolerate querying via smartd/smartctl.
> 
> Have a look at https://bugzilla.kernel.org/show_bug.cgi?id=14831
> 
> and there is a patch that seems to fix it here:
> 
> http://lkml.org/lkml/2010/4/26/335

Good news!  I appreciate the information.  I'm planning to update these
machines with new kernels and will include this patch.

> Use hdparm if you need serial numbers.

The labels Sun puts on the drives has numbers from the "device model."
I will see if hdparm yields those numbers...once this is all settled. 
Thanks for the suggestion.

> In the the half dozen or so tests I have done, where more than 2
> drives have been thrown out of md RAID6 arrays due to these
> controller resets,
> reassembly using --force has worked with no data corruption, but
> this may have been good luck.

Wow!  That's encouraging.  I would feel amazingly more confident if
someone would give me the exact command to try.  This is not a good
time for me to exercise my ignorance by experimenting.

Thank you for your helpful insight!

--kyler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: recovering from a controller failure
  2010-05-29 21:36   ` Kyler Laird
@ 2010-05-29 21:38     ` Richard
  2010-05-29 21:45       ` Kyler Laird
  0 siblings, 1 reply; 23+ messages in thread
From: Richard @ 2010-05-29 21:38 UTC (permalink / raw)
  To: Kyler Laird; +Cc: linux-raid

Kyler Laird wrote:

> Wow!  That's encouraging.  I would feel amazingly more confident if
> someone would give me the exact command to try.  This is not a good
> time for me to exercise my ignorance by experimenting.

mdadm -A -f /dev/mdX

Regards,

Richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: recovering from a controller failure
  2010-05-29 21:18 ` Richard
  2010-05-29 21:36   ` Kyler Laird
@ 2010-05-29 21:43   ` Berkey B Walker
  1 sibling, 0 replies; 23+ messages in thread
From: Berkey B Walker @ 2010-05-29 21:43 UTC (permalink / raw)
  To: Richard; +Cc: Kyler Laird, linux-raid


Good find, Richard.  Simplifies things a lot.  I liked the phrase 
"Abusively looping"  , as that was a technique I used to use (30 yr. ago)
b-


Richard wrote:
> Kyler Laird wrote:
>
>> I'd like to know if this is something I can recover.  I do have backups
>> but it's a huge pain to recover this much data.
>
> This happened to me before I discovered that LSI SAS1068E no longer 
> reliably tolerate querying via smartd/smartctl.
>
> Have a look at https://bugzilla.kernel.org/show_bug.cgi?id=14831
>
> and there is a patch that seems to fix it here:
>
> http://lkml.org/lkml/2010/4/26/335
>
> Use hdparm if you need serial numbers.
>
> In the the half dozen or so tests I have done, where more than 2 
> drives have been thrown out of md RAID6 arrays due to these controller 
> resets,
> reassembly using --force has worked with no data corruption, but this 
> may have been good luck.
>
> Regards,
>
> Richard
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: recovering from a controller failure
  2010-05-29 21:38     ` Richard
@ 2010-05-29 21:45       ` Kyler Laird
  2010-05-29 21:50         ` Richard
  2010-05-29 21:59         ` Richard
  0 siblings, 2 replies; 23+ messages in thread
From: Kyler Laird @ 2010-05-29 21:45 UTC (permalink / raw)
  To: linux-raid

On Sun, May 30, 2010 at 09:38:50AM +1200, Richard wrote:

> mdadm -A -f /dev/mdX

	root@00144ff2a334:/# mdadm -A -f /dev/md0
	mdadm: /dev/md0 not identified in config file.

These are net-booted file servers.  They share a root file system so I
rely on auto-detection of the RAID partitions.

I appreciate the hand holding.

--kyler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: recovering from a controller failure
  2010-05-29 21:45       ` Kyler Laird
@ 2010-05-29 21:50         ` Richard
  2010-05-30  0:15           ` Kyler Laird
  2010-05-29 21:59         ` Richard
  1 sibling, 1 reply; 23+ messages in thread
From: Richard @ 2010-05-29 21:50 UTC (permalink / raw)
  To: Kyler Laird; +Cc: linux-raid

Kyler Laird wrote:
> On Sun, May 30, 2010 at 09:38:50AM +1200, Richard wrote:
> 
>> mdadm -A -f /dev/mdX
> 
> 	root@00144ff2a334:/# mdadm -A -f /dev/md0
> 	mdadm: /dev/md0 not identified in config file.
> 
> These are net-booted file servers.  They share a root file system so I
> rely on auto-detection of the RAID partitions.
> 
> I appreciate the hand holding.

How about adding entries to your mdadm.conf file containing the UUID of 
/dev/md0, eg:

ARRAY /dev/md8 level=raid6 num-devices=16 
UUID=38a06a50:ce3fc204:728edfb7:4f4cdd43

note this should be all one line.

mdadm -D /dev/md0 should get you the UUID.

Regards,

Richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: recovering from a controller failure
  2010-05-29 21:45       ` Kyler Laird
  2010-05-29 21:50         ` Richard
@ 2010-05-29 21:59         ` Richard
  1 sibling, 0 replies; 23+ messages in thread
From: Richard @ 2010-05-29 21:59 UTC (permalink / raw)
  To: Kyler Laird; +Cc: linux-raid

Kyler Laird wrote:
> On Sun, May 30, 2010 at 09:38:50AM +1200, Richard wrote:
> 
>> mdadm -A -f /dev/mdX

You will probably need to stop the array first, if it's not already. 
prior to doing this.

mdadm -S /dev/md0

Regards,

Richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: recovering from a controller failure
  2010-05-29 21:50         ` Richard
@ 2010-05-30  0:15           ` Kyler Laird
  2010-05-30  0:28             ` Richard
  2010-05-30  3:33             ` Leslie Rhorer
  0 siblings, 2 replies; 23+ messages in thread
From: Kyler Laird @ 2010-05-30  0:15 UTC (permalink / raw)
  To: linux-raid

On Sun, May 30, 2010 at 09:50:26AM +1200, Richard wrote:

> How about adding entries to your mdadm.conf file containing the UUID
> of /dev/md0, eg:
> 
> ARRAY /dev/md8 level=raid6 num-devices=16
> UUID=38a06a50:ce3fc204:728edfb7:4f4cdd43
> 
> note this should be all one line.

I'll be happy to do that.

> mdadm -D /dev/md0 should get you the UUID.

	root@00144ff2a334:/# mdadm -D /dev/md0
	mdadm: md device /dev/md0 does not appear to be active.

So...how do I get the UUIDs?  I tried blkid and got this.
	http://lairds.us/temp/ucmeng_md/uuids
Those UUIDs are far from unique.

Thanks!

--kyler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: recovering from a controller failure
  2010-05-30  0:15           ` Kyler Laird
@ 2010-05-30  0:28             ` Richard
  2010-05-30  0:54               ` Richard
  2010-05-30  3:33             ` Leslie Rhorer
  1 sibling, 1 reply; 23+ messages in thread
From: Richard @ 2010-05-30  0:28 UTC (permalink / raw)
  To: Linux RAID Mailing List

Kyler Laird wrote:

> So...how do I get the UUIDs?  I tried blkid and got this.
> 	http://lairds.us/temp/ucmeng_md/uuids
> Those UUIDs are far from unique.

How about mdadm --examine /dev/sdX

where sdX is a compont of the failed array. If the drive was partioned
prior to being md'ed you will need that partition eg sda1.

Regards,

Richard


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: recovering from a controller failure
  2010-05-30  0:28             ` Richard
@ 2010-05-30  0:54               ` Richard
  0 siblings, 0 replies; 23+ messages in thread
From: Richard @ 2010-05-30  0:54 UTC (permalink / raw)
  To: Linux RAID Mailing List

Richard wrote:
> Kyler Laird wrote:
> 
>> So...how do I get the UUIDs?  I tried blkid and got this.
>>     http://lairds.us/temp/ucmeng_md/uuids
>> Those UUIDs are far from unique.

Sorry, I just checked the link showing the blkid's.

These are almost certainly correct and they are onlu unique between arrays.

This is the whole point - all members of an array have the same UUID so 
that mdadm knows which devices are part of the same array.

UUID's should always be part of an mdadm.conf so that thewre can be no 
ambiguity.

Regards,

Richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: recovering from a controller failure
  2010-05-30  0:15           ` Kyler Laird
  2010-05-30  0:28             ` Richard
@ 2010-05-30  3:33             ` Leslie Rhorer
  2010-05-30 13:17               ` CoolCold
  2010-05-30 18:55               ` Richard Scobie
  1 sibling, 2 replies; 23+ messages in thread
From: Leslie Rhorer @ 2010-05-30  3:33 UTC (permalink / raw)
  To: 'Kyler Laird', linux-raid

> On Sun, May 30, 2010 at 09:50:26AM +1200, Richard wrote:
> 
> > How about adding entries to your mdadm.conf file containing the UUID
> > of /dev/md0, eg:
> >
> > ARRAY /dev/md8 level=raid6 num-devices=16
> > UUID=38a06a50:ce3fc204:728edfb7:4f4cdd43
> >
> > note this should be all one line.
> 
> I'll be happy to do that.
> 
> > mdadm -D /dev/md0 should get you the UUID.
> 
> 	root@00144ff2a334:/# mdadm -D /dev/md0
> 	mdadm: md device /dev/md0 does not appear to be active.
> 
> So...how do I get the UUIDs?  I tried blkid and got this.
> 	http://lairds.us/temp/ucmeng_md/uuids
> Those UUIDs are far from unique.

	After all your drives are visible, of course:

`mdadm --examine /dev/sd* /dev/hd* > <filename>`
`more <filename>`

Make note of the array UUID for each drive.  When done,

`mdadm --assemble --assume-clean /dev/mdX /dev/<drive0> /dev/<drive1>
/dev/<drive2> ...etc`

where <drive0>, <drive1>, etc are all members of the same array UUID.

	Mount the file system, and fsck it.  Once everything is verified
good,

`echo repair > /sys/block/mdX/md/sync_action`
`mdadm --examine --scan >> /etc/mdadm/mdadm.conf`


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: recovering from a controller failure
  2010-05-30  3:33             ` Leslie Rhorer
@ 2010-05-30 13:17               ` CoolCold
  2010-05-30 22:38                 ` Leslie Rhorer
  2010-05-30 18:55               ` Richard Scobie
  1 sibling, 1 reply; 23+ messages in thread
From: CoolCold @ 2010-05-30 13:17 UTC (permalink / raw)
  To: Leslie Rhorer; +Cc: Kyler Laird, linux-raid

On Sun, May 30, 2010 at 7:33 AM, Leslie Rhorer <lrhorer@satx.rr.com> wrote:
>> On Sun, May 30, 2010 at 09:50:26AM +1200, Richard wrote:
>>
>> > How about adding entries to your mdadm.conf file containing the UUID
>> > of /dev/md0, eg:
>> >
>> > ARRAY /dev/md8 level=raid6 num-devices=16
>> > UUID=38a06a50:ce3fc204:728edfb7:4f4cdd43
>> >
>> > note this should be all one line.
>>
>> I'll be happy to do that.
>>
>> > mdadm -D /dev/md0 should get you the UUID.
>>
>>       root@00144ff2a334:/# mdadm -D /dev/md0
>>       mdadm: md device /dev/md0 does not appear to be active.
>>
>> So...how do I get the UUIDs?  I tried blkid and got this.
>>       http://lairds.us/temp/ucmeng_md/uuids
>> Those UUIDs are far from unique.
>
>        After all your drives are visible, of course:
>
> `mdadm --examine /dev/sd* /dev/hd* > <filename>`
> `more <filename>`
>
> Make note of the array UUID for each drive.  When done,
>
> `mdadm --assemble --assume-clean /dev/mdX /dev/<drive0> /dev/<drive1>
> /dev/<drive2> ...etc`
>
> where <drive0>, <drive1>, etc are all members of the same array UUID.
>
>        Mount the file system, and fsck it.  Once everything is verified
> good,
>
> `echo repair > /sys/block/mdX/md/sync_action`
Taking in account "Events" fields are differing on disks from 1st &
2nd controller, interesting question for me - what will happen on this
"repair" ?
And what this "Events" field really means? I didn't found description
in man pages.

> `mdadm --examine --scan >> /etc/mdadm/mdadm.conf`
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Best regards,
[COOLCOLD-RIPN]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: recovering from a controller failure
  2010-05-30  3:33             ` Leslie Rhorer
  2010-05-30 13:17               ` CoolCold
@ 2010-05-30 18:55               ` Richard Scobie
  2010-05-30 22:23                 ` Leslie Rhorer
  1 sibling, 1 reply; 23+ messages in thread
From: Richard Scobie @ 2010-05-30 18:55 UTC (permalink / raw)
  To: Leslie Rhorer; +Cc: 'Kyler Laird', linux-raid

Leslie Rhorer wrote:

> `mdadm --assemble --assume-clean /dev/mdX /dev/<drive0>  /dev/<drive1>
> /dev/<drive2>  ...etc`

--assume-clean is not an option for assemble, the --force option is 
required.

Regards,

Richard


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: recovering from a controller failure
  2010-05-30 18:55               ` Richard Scobie
@ 2010-05-30 22:23                 ` Leslie Rhorer
  0 siblings, 0 replies; 23+ messages in thread
From: Leslie Rhorer @ 2010-05-30 22:23 UTC (permalink / raw)
  To: 'Richard Scobie'; +Cc: 'Kyler Laird', linux-raid

	Oops!  You're right.

> -----Original Message-----
> From: Richard Scobie [mailto:richard@sauce.co.nz]
> Sent: Sunday, May 30, 2010 1:55 PM
> To: Leslie Rhorer
> Cc: 'Kyler Laird'; linux-raid@vger.kernel.org
> Subject: Re: recovering from a controller failure
> 
> Leslie Rhorer wrote:
> 
> > `mdadm --assemble --assume-clean /dev/mdX /dev/<drive0>  /dev/<drive1>
> > /dev/<drive2>  ...etc`
> 
> --assume-clean is not an option for assemble, the --force option is
> required.
> 
> Regards,
> 
> Richard



^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: recovering from a controller failure
  2010-05-30 13:17               ` CoolCold
@ 2010-05-30 22:38                 ` Leslie Rhorer
  2010-05-31  8:33                   ` CoolCold
  0 siblings, 1 reply; 23+ messages in thread
From: Leslie Rhorer @ 2010-05-30 22:38 UTC (permalink / raw)
  To: linux-raid



> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of CoolCold
> Sent: Sunday, May 30, 2010 8:18 AM
> To: Leslie Rhorer
> Cc: Kyler Laird; linux-raid@vger.kernel.org
> Subject: Re: recovering from a controller failure
> 
> On Sun, May 30, 2010 at 7:33 AM, Leslie Rhorer <lrhorer@satx.rr.com>
> wrote:
> >> On Sun, May 30, 2010 at 09:50:26AM +1200, Richard wrote:
> >>
> >> > How about adding entries to your mdadm.conf file containing the UUID
> >> > of /dev/md0, eg:
> >> >
> >> > ARRAY /dev/md8 level=raid6 num-devices=16
> >> > UUID=38a06a50:ce3fc204:728edfb7:4f4cdd43
> >> >
> >> > note this should be all one line.
> >>
> >> I'll be happy to do that.
> >>
> >> > mdadm -D /dev/md0 should get you the UUID.
> >>
> >>       root@00144ff2a334:/# mdadm -D /dev/md0
> >>       mdadm: md device /dev/md0 does not appear to be active.
> >>
> >> So...how do I get the UUIDs?  I tried blkid and got this.
> >>       http://lairds.us/temp/ucmeng_md/uuids
> >> Those UUIDs are far from unique.
> >
> >        After all your drives are visible, of course:
> >
> > `mdadm --examine /dev/sd* /dev/hd* > <filename>`
> > `more <filename>`
> >
> > Make note of the array UUID for each drive.  When done,
> >
> > `mdadm --assemble --assume-clean /dev/mdX /dev/<drive0> /dev/<drive1>
> > /dev/<drive2> ...etc`
> >
> > where <drive0>, <drive1>, etc are all members of the same array UUID.
> >
> >        Mount the file system, and fsck it.  Once everything is verified
> > good,
> >
> > `echo repair > /sys/block/mdX/md/sync_action`
> Taking in account "Events" fields are differing on disks from 1st &
> 2nd controller, interesting question for me - what will happen on this
> "repair" ?

	Note that should be --force, not --assume-clean.  The --assume-clean
switch would be used if you re-created the array, not just re-assembled it.
Once the array is assembled, the repair function will re-establish the
redundancy within the array.  Any stripes whose data does not match the
calculated value required to produce the upper layer information are
re-written.

> And what this "Events" field really means? I didn't found description
> in man pages.

	I believe a number of things.  For one thing, it is used to keep
track of which version of data resides in each drive, whenever an array
event is encountered.  The value of the events counter in the members of an
array should not be different by more than 1, or mdadm kicks the drive out
of the array.  I expect it may also be used during forced re-assembly and /
or during a resync of a RAID1 system to help determine which version of a
stripe is correct.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: recovering from a controller failure
  2010-05-30 22:38                 ` Leslie Rhorer
@ 2010-05-31  8:33                   ` CoolCold
  2010-05-31  8:50                     ` Leslie Rhorer
  0 siblings, 1 reply; 23+ messages in thread
From: CoolCold @ 2010-05-31  8:33 UTC (permalink / raw)
  To: Leslie Rhorer; +Cc: linux-raid

On Mon, May 31, 2010 at 2:38 AM, Leslie Rhorer <lrhorer@satx.rr.com> wrote:
>
>
>> -----Original Message-----
>> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
>> owner@vger.kernel.org] On Behalf Of CoolCold
>> Sent: Sunday, May 30, 2010 8:18 AM
>> To: Leslie Rhorer
>> Cc: Kyler Laird; linux-raid@vger.kernel.org
>> Subject: Re: recovering from a controller failure
>>
>> On Sun, May 30, 2010 at 7:33 AM, Leslie Rhorer <lrhorer@satx.rr.com>
>> wrote:
>> >> On Sun, May 30, 2010 at 09:50:26AM +1200, Richard wrote:
>> >>
>> >> > How about adding entries to your mdadm.conf file containing the UUID
>> >> > of /dev/md0, eg:
>> >> >
>> >> > ARRAY /dev/md8 level=raid6 num-devices=16
>> >> > UUID=38a06a50:ce3fc204:728edfb7:4f4cdd43
>> >> >
>> >> > note this should be all one line.
>> >>
>> >> I'll be happy to do that.
>> >>
>> >> > mdadm -D /dev/md0 should get you the UUID.
>> >>
>> >>       root@00144ff2a334:/# mdadm -D /dev/md0
>> >>       mdadm: md device /dev/md0 does not appear to be active.
>> >>
>> >> So...how do I get the UUIDs?  I tried blkid and got this.
>> >>       http://lairds.us/temp/ucmeng_md/uuids
>> >> Those UUIDs are far from unique.
>> >
>> >        After all your drives are visible, of course:
>> >
>> > `mdadm --examine /dev/sd* /dev/hd* > <filename>`
>> > `more <filename>`
>> >
>> > Make note of the array UUID for each drive.  When done,
>> >
>> > `mdadm --assemble --assume-clean /dev/mdX /dev/<drive0> /dev/<drive1>
>> > /dev/<drive2> ...etc`
>> >
>> > where <drive0>, <drive1>, etc are all members of the same array UUID.
>> >
>> >        Mount the file system, and fsck it.  Once everything is verified
>> > good,
>> >
>> > `echo repair > /sys/block/mdX/md/sync_action`
>> Taking in account "Events" fields are differing on disks from 1st &
>> 2nd controller, interesting question for me - what will happen on this
>> "repair" ?
>
>        Note that should be --force, not --assume-clean.  The --assume-clean
> switch would be used if you re-created the array, not just re-assembled it.
> Once the array is assembled, the repair function will re-establish the
> redundancy within the array.  Any stripes whose data does not match the
> calculated value required to produce the upper layer information are
> re-written.
That's it - as you can see there are 15 drives in raid6 array. Examine
on disks from sda to sdh shows drives active and event count is 0.159,
sdi to sdp events count is 0.168 and show that sd[a-i] are faulty. So
I'm guessing there is no way to know which part of array is "right"
and i guess they are desynced.

>
>> And what this "Events" field really means? I didn't found description
>> in man pages.
>
>        I believe a number of things.  For one thing, it is used to keep
> track of which version of data resides in each drive, whenever an array
> event is encountered.  The value of the events counter in the members of an
> array should not be different by more than 1, or mdadm kicks the drive out
> of the array.
I've thought similar, but interesting - in this situation drives has
event count value like "0.168" and "0.159"...
> I expect it may also be used during forced re-assembly and /
> or during a resync of a RAID1 system to help determine which version of a
> stripe is correct.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Best regards,
[COOLCOLD-RIPN]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: recovering from a controller failure
  2010-05-31  8:33                   ` CoolCold
@ 2010-05-31  8:50                     ` Leslie Rhorer
  0 siblings, 0 replies; 23+ messages in thread
From: Leslie Rhorer @ 2010-05-31  8:50 UTC (permalink / raw)
  To: 'CoolCold'; +Cc: linux-raid

> > Once the array is assembled, the repair function will re-establish the
> > redundancy within the array.  Any stripes whose data does not match the
> > calculated value required to produce the upper layer information are
> > re-written.
> That's it - as you can see there are 15 drives in raid6 array. Examine
> on disks from sda to sdh shows drives active and event count is 0.159,
> sdi to sdp events count is 0.168 and show that sd[a-i] are faulty. So
> I'm guessing there is no way to know which part of array is "right"
> and i guess they are desynced.

	I deleted the original e-mails while cleaning out my in box a few
hours ago, so I can't look at your original response, but I've never seen
fractional event counts.  Some of mine are in the millions.

	In any case, if the corruption is bad enough, you may indeed lose
some data.  Remember, however, that unless this was a brand new array, or
the data on the array was undergoing a truly phenomenal amount of thrashing,
most of the data on the drives is probably consistent, or at least
consistent enough to allow recovery.  Some, however, possibly even a large
amount, may be toast.  That's on reason why you have backups.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: recovering from a controller failure
@ 2010-05-31 18:27 Kyler Laird
  2010-06-01 15:49 ` Kyler Laird
  0 siblings, 1 reply; 23+ messages in thread
From: Kyler Laird @ 2010-05-31 18:27 UTC (permalink / raw)
  To: linux-raid

I appreciate the help that everyone here has been providing with this
frustrating problem.  It looks like there's agreement that I need to
use "--force" to assemble the array with the disk devices specified. 
Here's my first cut at a command to try:
	mdadm --force --assemble /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1
	http://lairds.us/temp/ucmeng_md/suggested_recovery
I'm sure I'm missing something.  Corrections are welcome.

(It would be comforting if mdadm had a "--dry-run" option.)

Thank you, all!

--kyler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: recovering from a controller failure
  2010-05-31 18:27 Kyler Laird
@ 2010-06-01 15:49 ` Kyler Laird
  2010-06-01 19:15   ` Richard Scobie
  0 siblings, 1 reply; 23+ messages in thread
From: Kyler Laird @ 2010-06-01 15:49 UTC (permalink / raw)
  To: linux-raid

I decided to try simply using "--force" to assemble the array.  It seems
to have worked.
 	http://lairds.us/temp/ucmeng_md/suggested_recovery
As you can see, it didn't use /dev/sdah1, starting the RAID6 array with
one drive missing.

I can safely --add this drive or was there a reason it wasn't used?  I
also plan to add the spare (/dev/sdp1).

Thanks for all the help!

--kyler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: recovering from a controller failure
  2010-06-01 15:49 ` Kyler Laird
@ 2010-06-01 19:15   ` Richard Scobie
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Scobie @ 2010-06-01 19:15 UTC (permalink / raw)
  To: Kyler Laird; +Cc: linux-raid

Kyler Laird wrote:
> I decided to try simply using "--force" to assemble the array.  It seems
> to have worked.
>   	http://lairds.us/temp/ucmeng_md/suggested_recovery
> As you can see, it didn't use /dev/sdah1, starting the RAID6 array with
> one drive missing.
>
> I can safely --add this drive or was there a reason it wasn't used?  I
> also plan to add the spare (/dev/sdp1).

It would be prudent to remove /dev/sdah1 and use smartctl on a non LSI 
SAS controller or another machine, to check that it has not failed.

If not, prior to re adding it back, I would perform on fsck on the 
filesystem to make sure there are no errors.

Regards,

Richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2010-06-01 19:15 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-29 19:07 recovering from a controller failure Kyler Laird
2010-05-29 19:46 ` Berkey B Walker
2010-05-29 20:44   ` Kyler Laird
2010-05-29 21:18 ` Richard
2010-05-29 21:36   ` Kyler Laird
2010-05-29 21:38     ` Richard
2010-05-29 21:45       ` Kyler Laird
2010-05-29 21:50         ` Richard
2010-05-30  0:15           ` Kyler Laird
2010-05-30  0:28             ` Richard
2010-05-30  0:54               ` Richard
2010-05-30  3:33             ` Leslie Rhorer
2010-05-30 13:17               ` CoolCold
2010-05-30 22:38                 ` Leslie Rhorer
2010-05-31  8:33                   ` CoolCold
2010-05-31  8:50                     ` Leslie Rhorer
2010-05-30 18:55               ` Richard Scobie
2010-05-30 22:23                 ` Leslie Rhorer
2010-05-29 21:59         ` Richard
2010-05-29 21:43   ` Berkey B Walker
  -- strict thread matches above, loose matches on Subject: below --
2010-05-31 18:27 Kyler Laird
2010-06-01 15:49 ` Kyler Laird
2010-06-01 19:15   ` Richard Scobie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).