Need urgent help in fixing raid5 array

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Need urgent help in fixing raid5 array
@ 2008-12-05 17:03 Mike Myers
  2008-12-06  0:18 ` Mike Myers
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Myers @ 2008-12-05 17:03 UTC (permalink / raw)
  To: linux-raid

I have a problem with repairing a raid5 array I really need some help with.  I must be missing something here.

I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that.  Both arrays have 7 1 TB disks in them.  I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots.  That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline.  Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state.  I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild.  No problem.

Around 60% through the resync, smartd started reporting problems with one of the other drives in the array.  Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline.  Ugh...

Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing.  As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here.  I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered.

So I remove the old disk,  re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead.  Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array.  What am I missing here?  Why doesn't it view it as the old disk as a member and just start it?

thx
mike

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2008-12-05 17:03 Mike Myers
@ 2008-12-06  0:18 ` Mike Myers
  2008-12-06  0:24   ` Justin Piszcz
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Myers @ 2008-12-06  0:18 UTC (permalink / raw)
  To: Mike Myers, linux-raid

Anyone?  What am I missing here?

thx
mike

----- Original Message ----
From: Mike Myers <mikesm559@yahoo.com>
To: linux-raid@vger.kernel.org
Sent: Friday, December 5, 2008 9:03:22 AM
Subject: Need urgent help in fixing raid5 array

I have a problem with repairing a raid5 array I really need some help with.  I must be missing something here.

I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that.  Both arrays have 7 1 TB disks in them.  I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots.  That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline.  Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state.  I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild.  No problem.

Around 60% through the resync, smartd started reporting problems with one of the other drives in the array.  Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline.  Ugh...

Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing.  As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here.  I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered.

So I remove the old disk,  re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead.  Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array.  What am I missing here?  Why doesn't it view it as the old disk as a member and just start it?

thx
mike

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2008-12-06  0:18 ` Mike Myers
@ 2008-12-06  0:24   ` Justin Piszcz
  2008-12-06  0:47     ` Mike Myers
  2008-12-06  0:52     ` David Lethe
  0 siblings, 2 replies; 46+ messages in thread
From: Justin Piszcz @ 2008-12-06  0:24 UTC (permalink / raw)
  To: Mike Myers; +Cc: linux-raid


You can try this as a last resort:
http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07815.html

(mdadm w/create and assume-clean) but only use this as a last resort, when 
I had two disk failures, I was able to see some of the data but ultimately 
it was lost, bottom line? i dont use raid5 anymore, raid6 only, in the 
3ware docs they recommend if you use more than 4 disks you should use 
raid6 if you have the capability, i agree

some others on the list may have more . less intrusive ideas . only  use 
the above method as a LAST RESORT, i was able to assemble the array but I 
had problems getting xfs_repair to fix the filesystem

On Fri, 5 Dec 2008, Mike Myers wrote:

> Anyone?  What am I missing here?
>
> thx
> mike
>
>
>
>
> ----- Original Message ----
> From: Mike Myers <mikesm559@yahoo.com>
> To: linux-raid@vger.kernel.org
> Sent: Friday, December 5, 2008 9:03:22 AM
> Subject: Need urgent help in fixing raid5 array
>
> I have a problem with repairing a raid5 array I really need some help with.  I must be missing something here.
>
> I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that.  Both arrays have 7 1 TB disks in them.  I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots.  That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline.  Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state.  I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild.  No problem.
>
> Around 60% through the resync, smartd started reporting problems with one of the other drives in the array.  Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline.  Ugh...
>
> Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing.  As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here.  I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered.
>
> So I remove the old disk,  re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead.  Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array.  What am I missing here?  Why doesn't it view it as the old disk as a member and just start it?
>
> thx
> mike
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2008-12-06  0:24   ` Justin Piszcz
@ 2008-12-06  0:47     ` Mike Myers
  2008-12-06  0:51       ` Justin Piszcz
  2008-12-06  0:52     ` David Lethe
  1 sibling, 1 reply; 46+ messages in thread
From: Mike Myers @ 2008-12-06  0:47 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid

Thanks very much.  All the disks I am trying to assemble have the same event count and UUID, which is why I don't understand why it's not assembling.  I'll try the assume-clean option and see if that helps.

It would be great to understand how md determines if the drives are in sync with each other.  I thought the uuid and event count was all you needed...

thx
mike




----- Original Message ----
From: Justin Piszcz <jpiszcz@lucidpixels.com>
To: Mike Myers <mikesm559@yahoo.com>
Cc: linux-raid@vger.kernel.org
Sent: Friday, December 5, 2008 4:24:46 PM
Subject: Re: Need urgent help in fixing raid5 array


You can try this as a last resort:
http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07815.html

(mdadm w/create and assume-clean) but only use this as a last resort, when 
I had two disk failures, I was able to see some of the data but ultimately 
it was lost, bottom line? i dont use raid5 anymore, raid6 only, in the 
3ware docs they recommend if you use more than 4 disks you should use 
raid6 if you have the capability, i agree

some others on the list may have more . less intrusive ideas . only  use 
the above method as a LAST RESORT, i was able to assemble the array but I 
had problems getting xfs_repair to fix the filesystem

On Fri, 5 Dec 2008, Mike Myers wrote:

> Anyone?  What am I missing here?
>
> thx
> mike
>
>
>
>
> ----- Original Message ----
> From: Mike Myers <mikesm559@yahoo.com>
> To: linux-raid@vger.kernel.org
> Sent: Friday, December 5, 2008 9:03:22 AM
> Subject: Need urgent help in fixing raid5 array
>
> I have a problem with repairing a raid5 array I really need some help with.  I must be missing something here.
>
> I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that.  Both arrays have 7 1 TB disks in them.  I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots.  That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline.  Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state.  I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild.  No problem.
>
> Around 60% through the resync, smartd started reporting problems with one of the other drives in the array.  Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline.  Ugh...
>
> Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing.  As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here.  I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered.
>
> So I remove the old disk,  re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead.  Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array.  What am I missing here?  Why doesn't it view it as the old disk as a member and just start it?
>
> thx
> mike
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



      

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2008-12-06  0:47     ` Mike Myers
@ 2008-12-06  0:51       ` Justin Piszcz
  2008-12-06  0:58         ` Mike Myers
  2008-12-06 19:02         ` Mike Myers
  0 siblings, 2 replies; 46+ messages in thread
From: Justin Piszcz @ 2008-12-06  0:51 UTC (permalink / raw)
  To: Mike Myers; +Cc: linux-raid

Only use it as A LAST resort, check the mailing list or wait for 
Neil/someone else with/whose had a similar issue who can maybe help more 
here.

On Fri, 5 Dec 2008, Mike Myers wrote:

> Thanks very much.  All the disks I am trying to assemble have the same event count and UUID, which is why I don't understand why it's not assembling.  I'll try the assume-clean option and see if that helps.
>
> It would be great to understand how md determines if the drives are in sync with each other.  I thought the uuid and event count was all you needed...
>
> thx
> mike
>
>
>
>
> ----- Original Message ----
> From: Justin Piszcz <jpiszcz@lucidpixels.com>
> To: Mike Myers <mikesm559@yahoo.com>
> Cc: linux-raid@vger.kernel.org
> Sent: Friday, December 5, 2008 4:24:46 PM
> Subject: Re: Need urgent help in fixing raid5 array
>
>
> You can try this as a last resort:
> http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07815.html
>
> (mdadm w/create and assume-clean) but only use this as a last resort, when
> I had two disk failures, I was able to see some of the data but ultimately
> it was lost, bottom line? i dont use raid5 anymore, raid6 only, in the
> 3ware docs they recommend if you use more than 4 disks you should use
> raid6 if you have the capability, i agree
>
> some others on the list may have more . less intrusive ideas . only  use
> the above method as a LAST RESORT, i was able to assemble the array but I
> had problems getting xfs_repair to fix the filesystem
>
> On Fri, 5 Dec 2008, Mike Myers wrote:
>
>> Anyone?  What am I missing here?
>>
>> thx
>> mike
>>
>>
>>
>>
>> ----- Original Message ----
>> From: Mike Myers <mikesm559@yahoo.com>
>> To: linux-raid@vger.kernel.org
>> Sent: Friday, December 5, 2008 9:03:22 AM
>> Subject: Need urgent help in fixing raid5 array
>>
>> I have a problem with repairing a raid5 array I really need some help with.  I must be missing something here.
>>
>> I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that.  Both arrays have 7 1 TB disks in them.  I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots.  That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline.  Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state.  I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild.  No problem.
>>
>> Around 60% through the resync, smartd started reporting problems with one of the other drives in the array.  Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline.  Ugh...
>>
>> Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing.  As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here.  I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered.
>>
>> So I remove the old disk,  re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead.  Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array.  What am I missing here?  Why doesn't it view it as the old disk as a member and just start it?
>>
>> thx
>> mike
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: Need urgent help in fixing raid5 array
  2008-12-06  0:24   ` Justin Piszcz
  2008-12-06  0:47     ` Mike Myers
@ 2008-12-06  0:52     ` David Lethe
  1 sibling, 0 replies; 46+ messages in thread
From: David Lethe @ 2008-12-06  0:52 UTC (permalink / raw)
  To: Justin Piszcz, Mike Myers; +Cc: linux-raid

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Justin Piszcz
> Sent: Friday, December 05, 2008 6:25 PM

Mike - Really strongly consider bringing in a pro, or at very least,
buying some scratch disks so you can work with copies.  Forced assembly
doesn't look at the data if you are degraded, so it has high potential
to make things worse.  Too many things going on here ... If data is
backed up and you just want to save some time with an experiment ...
then go forward with assume-clean.  

But it will likely destroy a large chunk of your data in the process,
destroy it forever.  

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2008-12-06  0:51       ` Justin Piszcz
@ 2008-12-06  0:58         ` Mike Myers
  2008-12-06 19:02         ` Mike Myers
  1 sibling, 0 replies; 46+ messages in thread
From: Mike Myers @ 2008-12-06  0:58 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid

Ok, I'll wait some more before doing that and see if Neil or someone else pipes in.  I am really trying to avoid recreating the superblock structures, though it's pretty clear what the sequence of devices is from doing examines on them.

thx
mike




----- Original Message ----
From: Justin Piszcz <jpiszcz@lucidpixels.com>
To: Mike Myers <mikesm559@yahoo.com>
Cc: linux-raid@vger.kernel.org
Sent: Friday, December 5, 2008 4:51:26 PM
Subject: Re: Need urgent help in fixing raid5 array

Only use it as A LAST resort, check the mailing list or wait for 
Neil/someone else with/whose had a similar issue who can maybe help more 
here.

On Fri, 5 Dec 2008, Mike Myers wrote:

> Thanks very much.  All the disks I am trying to assemble have the same event count and UUID, which is why I don't understand why it's not assembling.  I'll try the assume-clean option and see if that helps.
>
> It would be great to understand how md determines if the drives are in sync with each other.  I thought the uuid and event count was all you needed...
>
> thx
> mike
>
>
>
>
> ----- Original Message ----
> From: Justin Piszcz <jpiszcz@lucidpixels.com>
> To: Mike Myers <mikesm559@yahoo.com>
> Cc: linux-raid@vger.kernel.org
> Sent: Friday, December 5, 2008 4:24:46 PM
> Subject: Re: Need urgent help in fixing raid5 array
>
>
> You can try this as a last resort:
> http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07815.html
>
> (mdadm w/create and assume-clean) but only use this as a last resort, when
> I had two disk failures, I was able to see some of the data but ultimately
> it was lost, bottom line? i dont use raid5 anymore, raid6 only, in the
> 3ware docs they recommend if you use more than 4 disks you should use
> raid6 if you have the capability, i agree
>
> some others on the list may have more . less intrusive ideas . only  use
> the above method as a LAST RESORT, i was able to assemble the array but I
> had problems getting xfs_repair to fix the filesystem
>
> On Fri, 5 Dec 2008, Mike Myers wrote:
>
>> Anyone?  What am I missing here?
>>
>> thx
>> mike
>>
>>
>>
>>
>> ----- Original Message ----
>> From: Mike Myers <mikesm559@yahoo.com>
>> To: linux-raid@vger.kernel.org
>> Sent: Friday, December 5, 2008 9:03:22 AM
>> Subject: Need urgent help in fixing raid5 array
>>
>> I have a problem with repairing a raid5 array I really need some help with.  I must be missing something here.
>>
>> I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that.  Both arrays have 7 1 TB disks in them.  I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots.  That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline.  Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state.  I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild.  No problem.
>>
>> Around 60% through the resync, smartd started reporting problems with one of the other drives in the array.  Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline.  Ugh...
>>
>> Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing.  As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here.  I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered.
>>
>> So I remove the old disk,  re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead.  Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array.  What am I missing here?  Why doesn't it view it as the old disk as a member and just start it?
>>
>> thx
>> mike
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
>
>
>



      

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2008-12-06  0:51       ` Justin Piszcz
  2008-12-06  0:58         ` Mike Myers
@ 2008-12-06 19:02         ` Mike Myers
  2008-12-06 19:30           ` Mike Myers
  1 sibling, 1 reply; 46+ messages in thread
From: Mike Myers @ 2008-12-06 19:02 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid

Ok, here is some more info on this odd problem.

I have seven 1 TB drives in raid5 array.  sdb1 sdc1 sdf1 sdh1 sdi1 sdj1 sdk1

They all have the same uuid, and events is the same for each except sdk1, which I think was the disk being resynced.  As I understand it, when the array is being resynced, the events counter on the newly added drive is different than the others.  The other drives have the same events and uuid, and all show clean.  When I try and assemble the array of the 7 drives, it tells me there are 5 drives and 1 spare, not enough to start the array.  If I remove sdk1 (the drive with the different event number on it), I get the same exact same message.  

By removing one drive at a time from the assemble command, I determined that md thinks that sdh1 is the spare, even though events are the same and the the UUID is the same, the checksum says it's ok.  Why does it think this drive is a spare and not a data drive?  

I had cloned the data drive that had failed and got almost everything copied over, all but 12kb.  So I think it's fine and is not being a problem.

How does md decide which drive is a spare and which is an active synced drive, etc... ?  I can't seem to find a document that outlines all this.

Thx
mike




----- Original Message ----
From: Justin Piszcz <jpiszcz@lucidpixels.com>
To: Mike Myers <mikesm559@yahoo.com>
Cc: linux-raid@vger.kernel.org
Sent: Friday, December 5, 2008 4:51:26 PM
Subject: Re: Need urgent help in fixing raid5 array

Only use it as A LAST resort, check the mailing list or wait for 
Neil/someone else with/whose had a similar issue who can maybe help more 
here.

On Fri, 5 Dec 2008, Mike Myers wrote:

> Thanks very much.  All the disks I am trying to assemble have the same event count and UUID, which is why I don't understand why it's not assembling.  I'll try the assume-clean option and see if that helps.
>
> It would be great to understand how md determines if the drives are in sync with each other.  I thought the uuid and event count was all you needed...
>
> thx
> mike
>
>
>
>
> ----- Original Message ----
> From: Justin Piszcz <jpiszcz@lucidpixels.com>
> To: Mike Myers <mikesm559@yahoo.com>
> Cc: linux-raid@vger.kernel.org
> Sent: Friday, December 5, 2008 4:24:46 PM
> Subject: Re: Need urgent help in fixing raid5 array
>
>
> You can try this as a last resort:
> http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07815.html
>
> (mdadm w/create and assume-clean) but only use this as a last resort, when
> I had two disk failures, I was able to see some of the data but ultimately
> it was lost, bottom line? i dont use raid5 anymore, raid6 only, in the
> 3ware docs they recommend if you use more than 4 disks you should use
> raid6 if you have the capability, i agree
>
> some others on the list may have more . less intrusive ideas . only  use
> the above method as a LAST RESORT, i was able to assemble the array but I
> had problems getting xfs_repair to fix the filesystem
>
> On Fri, 5 Dec 2008, Mike Myers wrote:
>
>> Anyone?  What am I missing here?
>>
>> thx
>> mike
>>
>>
>>
>>
>> ----- Original Message ----
>> From: Mike Myers <mikesm559@yahoo.com>
>> To: linux-raid@vger.kernel.org
>> Sent: Friday, December 5, 2008 9:03:22 AM
>> Subject: Need urgent help in fixing raid5 array
>>
>> I have a problem with repairing a raid5 array I really need some help with.  I must be missing something here.
>>
>> I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that.  Both arrays have 7 1 TB disks in them.  I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots.  That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline.  Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state.  I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild.  No problem.
>>
>> Around 60% through the resync, smartd started reporting problems with one of the other drives in the array.  Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline.  Ugh...
>>
>> Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing.  As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here.  I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered.
>>
>> So I remove the old disk,  re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead.  Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array.  What am I missing here?  Why doesn't it view it as the old disk as a member and just start it?
>>
>> thx
>> mike
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



      

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2008-12-06 19:02         ` Mike Myers
@ 2008-12-06 19:30           ` Mike Myers
  2008-12-06 20:14             ` Mike Myers
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Myers @ 2008-12-06 19:30 UTC (permalink / raw)
  To: Mike Myers, Justin Piszcz; +Cc: linux-raid

/dev/sdk1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : e70e0697:a10a5b75:941dd76f:196d9e4e
  Creation Time : Tue Aug 19 21:31:10 2008
     Raid Level : raid5
  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
     Array Size : 5860559616 (5589.07 GiB 6001.21 GB)
   Raid Devices : 7
  Total Devices : 7
Preferred Minor : 2

    Update Time : Thu Dec  4 15:32:09 2008
          State : clean
 Active Devices : 6
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 1
       Checksum : ab1934d5 - correct
         Events : 0.1436484

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     6       8      161        6      active sync   /dev/sdk1

   0     0       0        0        0      removed
   1     1       8       81        1      active sync   /dev/sdf1
   2     2       8      145        2      active sync   /dev/sdj1
   3     3       8       17        3      active sync   /dev/sdb1
   4     4       8       33        4      active sync   /dev/sdc1
   5     5       8      129        5      active sync   /dev/sdi1
   6     6       8      161        6      active sync   /dev/sdk1
   7     7       8      113        7      spare   /dev/sdh1


Here is the examine from sdh1 (which I thought was the the disk being replaced by now appears to be the spare):

/dev/sdh1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : e70e0697:a10a5b75:941dd76f:196d9e4e
  Creation Time : Tue Aug 19 21:31:10 2008
     Raid Level : raid5
  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
     Array Size : 5860559616 (5589.07 GiB 6001.21 GB)
   Raid Devices : 7
  Total Devices : 8
Preferred Minor : 2

    Update Time : Fri Dec  5 08:15:16 2008
          State : clean
 Active Devices : 5
Working Devices : 7
 Failed Devices : 1
  Spare Devices : 2
       Checksum : ab1a2d37 - correct
         Events : 0.1438064

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     8       8      113        8      spare   /dev/sdh1

   0     0       0        0        0      removed
   1     1       8       81        1      active sync   /dev/sdf1
   2     2       8      145        2      active sync   /dev/sdj1
   3     3       8       17        3      active sync   /dev/sdb1
   4     4       8       33        4      active sync   /dev/sdc1
   5     5       8      129        5      active sync   /dev/sdi1
   6     6       0        0        6      faulty removed
   7     7       8      241        7      spare   /dev/sdp1
   8     8       8      113        8      spare   /dev/sdh1


And here is the output of the examine of a known good member sdb1:


/dev/sdb1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : e70e0697:a10a5b75:941dd76f:196d9e4e
  Creation Time : Tue Aug 19 21:31:10 2008
     Raid Level : raid5
  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
     Array Size : 5860559616 (5589.07 GiB 6001.21 GB)
   Raid Devices : 7
  Total Devices : 8
Preferred Minor : 2

    Update Time : Fri Dec  5 08:15:16 2008
          State : clean
 Active Devices : 5
Working Devices : 7
 Failed Devices : 1
  Spare Devices : 2
       Checksum : ab1a2cd3 - correct
         Events : 0.1438064

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     3       8       17        3      active sync   /dev/sdb1

   0     0       0        0        0      removed
   1     1       8       81        1      active sync   /dev/sdf1
   2     2       8      145        2      active sync   /dev/sdj1
   3     3       8       17        3      active sync   /dev/sdb1
   4     4       8       33        4      active sync   /dev/sdc1
   5     5       8      129        5      active sync   /dev/sdi1
   6     6       0        0        6      faulty removed
   7     7       8      241        7      spare   /dev/sdp1
   8     8       8      113        8      spare   /dev/sdh1


Any more ideas as to what's going on?

Thanks,
Mike




----- Original Message ----
From: Mike Myers <mikesm559@yahoo.com>
To: Justin Piszcz <jpiszcz@lucidpixels.com>
Cc: linux-raid@vger.kernel.org
Sent: Saturday, December 6, 2008 11:02:39 AM
Subject: Re: Need urgent help in fixing raid5 array

Ok, here is some more info on this odd problem.

I have seven 1 TB drives in raid5 array.  sdb1 sdc1 sdf1 sdh1 sdi1 sdj1 sdk1

They all have the same uuid, and events is the same for each except sdk1, which I think was the disk being resynced.  As I understand it, when the array is being resynced, the events counter on the newly added drive is different than the others.  The other drives have the same events and uuid, and all show clean.  When I try and assemble the array of the 7 drives, it tells me there are 5 drives and 1 spare, not enough to start the array.  If I remove sdk1 (the drive with the different event number on it), I get the same exact same message.  

By removing one drive at a time from the assemble command, I determined that md thinks that sdh1 is the spare, even though events are the same and the the UUID is the same, the checksum says it's ok.  Why does it think this drive is a spare and not a data drive?  

I had cloned the data drive that had failed and got almost everything copied over, all but 12kb.  So I think it's fine and is not being a problem.

How does md decide which drive is a spare and which is an active synced drive, etc... ?  I can't seem to find a document that outlines all this.

Thx
mike




----- Original Message ----
From: Justin Piszcz <jpiszcz@lucidpixels.com>
To: Mike Myers <mikesm559@yahoo.com>
Cc: linux-raid@vger.kernel.org
Sent: Friday, December 5, 2008 4:51:26 PM
Subject: Re: Need urgent help in fixing raid5 array

Only use it as A LAST resort, check the mailing list or wait for 
Neil/someone else with/whose had a similar issue who can maybe help more 
here.

On Fri, 5 Dec 2008, Mike Myers wrote:

> Thanks very much.  All the disks I am trying to assemble have the same event count and UUID, which is why I don't understand why it's not assembling.  I'll try the assume-clean option and see if that helps.
>
> It would be great to understand how md determines if the drives are in sync with each other.  I thought the uuid and event count was all you needed...
>
> thx
> mike
>
>
>
>
> ----- Original Message ----
> From: Justin Piszcz <jpiszcz@lucidpixels.com>
> To: Mike Myers <mikesm559@yahoo.com>
> Cc: linux-raid@vger.kernel.org
> Sent: Friday, December 5, 2008 4:24:46 PM
> Subject: Re: Need urgent help in fixing raid5 array
>
>
> You can try this as a last resort:
> http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07815.html
>
> (mdadm w/create and assume-clean) but only use this as a last resort, when
> I had two disk failures, I was able to see some of the data but ultimately
> it was lost, bottom line? i dont use raid5 anymore, raid6 only, in the
> 3ware docs they recommend if you use more than 4 disks you should use
> raid6 if you have the capability, i agree
>
> some others on the list may have more . less intrusive ideas . only  use
> the above method as a LAST RESORT, i was able to assemble the array but I
> had problems getting xfs_repair to fix the filesystem
>
> On Fri, 5 Dec 2008, Mike Myers wrote:
>
>> Anyone?  What am I missing here?
>>
>> thx
>> mike
>>
>>
>>
>>
>> ----- Original Message ----
>> From: Mike Myers <mikesm559@yahoo.com>
>> To: linux-raid@vger.kernel.org
>> Sent: Friday, December 5, 2008 9:03:22 AM
>> Subject: Need urgent help in fixing raid5 array
>>
>> I have a problem with repairing a raid5 array I really need some help with.  I must be missing something here.
>>
>> I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that.  Both arrays have 7 1 TB disks in them.  I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots.  That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline.  Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state.  I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild.  No problem.
>>
>> Around 60% through the resync, smartd started reporting problems with one of the other drives in the array.  Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline.  Ugh...
>>
>> Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing.  As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here.  I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered.
>>
>> So I remove the old disk,  re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead.  Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array.  What am I missing here?  Why doesn't it view it as the old disk as a member and just start it?
>>
>> thx
>> mike
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



      
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



      

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2008-12-06 19:30           ` Mike Myers
@ 2008-12-06 20:14             ` Mike Myers
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Myers @ 2008-12-06 20:14 UTC (permalink / raw)
  To: Mike Myers, Justin Piszcz; +Cc: linux-raid

Ok, I seem to have recovered...  Once I realized that event though the event number on sdk1 was slightly different than the rest, but that I could confirm it was the new drive that had the cloned data from the original failing drive, I did an assemble with a --force option, and then the array came up just fine.  I rebooted for good measure and lvm and xfs came up fine on boot, and all the files are there and perfectly accessible.  There was about 12kb of data that couldn't be recovered, but since this storage volume stores mostly large TV recordings, I think it will be Ok.  It would have bene very hard to track down which file those sectors were in any case.

I then added the spare again and the system is now rebuilding just fine (should be done in abt 5 hrs)...

Thanks for all the advice everyone.

Thx
mike




----- Original Message ----
From: Mike Myers <mikesm559@yahoo.com>
To: Mike Myers <mikesm559@yahoo.com>; Justin Piszcz <jpiszcz@lucidpixels.com>
Cc: linux-raid@vger.kernel.org
Sent: Saturday, December 6, 2008 11:30:00 AM
Subject: Re: Need urgent help in fixing raid5 array

/dev/sdk1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : e70e0697:a10a5b75:941dd76f:196d9e4e
  Creation Time : Tue Aug 19 21:31:10 2008
     Raid Level : raid5
  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
     Array Size : 5860559616 (5589.07 GiB 6001.21 GB)
   Raid Devices : 7
  Total Devices : 7
Preferred Minor : 2

    Update Time : Thu Dec  4 15:32:09 2008
          State : clean
Active Devices : 6
Working Devices : 7
Failed Devices : 0
  Spare Devices : 1
       Checksum : ab1934d5 - correct
         Events : 0.1436484

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     6       8      161        6      active sync   /dev/sdk1

   0     0       0        0        0      removed
   1     1       8       81        1      active sync   /dev/sdf1
   2     2       8      145        2      active sync   /dev/sdj1
   3     3       8       17        3      active sync   /dev/sdb1
   4     4       8       33        4      active sync   /dev/sdc1
   5     5       8      129        5      active sync   /dev/sdi1
   6     6       8      161        6      active sync   /dev/sdk1
   7     7       8      113        7      spare   /dev/sdh1


Here is the examine from sdh1 (which I thought was the the disk being replaced by now appears to be the spare):

/dev/sdh1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : e70e0697:a10a5b75:941dd76f:196d9e4e
  Creation Time : Tue Aug 19 21:31:10 2008
     Raid Level : raid5
  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
     Array Size : 5860559616 (5589.07 GiB 6001.21 GB)
   Raid Devices : 7
  Total Devices : 8
Preferred Minor : 2

    Update Time : Fri Dec  5 08:15:16 2008
          State : clean
Active Devices : 5
Working Devices : 7
Failed Devices : 1
  Spare Devices : 2
       Checksum : ab1a2d37 - correct
         Events : 0.1438064

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     8       8      113        8      spare   /dev/sdh1

   0     0       0        0        0      removed
   1     1       8       81        1      active sync   /dev/sdf1
   2     2       8      145        2      active sync   /dev/sdj1
   3     3       8       17        3      active sync   /dev/sdb1
   4     4       8       33        4      active sync   /dev/sdc1
   5     5       8      129        5      active sync   /dev/sdi1
   6     6       0        0        6      faulty removed
   7     7       8      241        7      spare   /dev/sdp1
   8     8       8      113        8      spare   /dev/sdh1


And here is the output of the examine of a known good member sdb1:


/dev/sdb1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : e70e0697:a10a5b75:941dd76f:196d9e4e
  Creation Time : Tue Aug 19 21:31:10 2008
     Raid Level : raid5
  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
     Array Size : 5860559616 (5589.07 GiB 6001.21 GB)
   Raid Devices : 7
  Total Devices : 8
Preferred Minor : 2

    Update Time : Fri Dec  5 08:15:16 2008
          State : clean
Active Devices : 5
Working Devices : 7
Failed Devices : 1
  Spare Devices : 2
       Checksum : ab1a2cd3 - correct
         Events : 0.1438064

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     3       8       17        3      active sync   /dev/sdb1

   0     0       0        0        0      removed
   1     1       8       81        1      active sync   /dev/sdf1
   2     2       8      145        2      active sync   /dev/sdj1
   3     3       8       17        3      active sync   /dev/sdb1
   4     4       8       33        4      active sync   /dev/sdc1
   5     5       8      129        5      active sync   /dev/sdi1
   6     6       0        0        6      faulty removed
   7     7       8      241        7      spare   /dev/sdp1
   8     8       8      113        8      spare   /dev/sdh1


Any more ideas as to what's going on?

Thanks,
Mike




----- Original Message ----
From: Mike Myers <mikesm559@yahoo.com>
To: Justin Piszcz <jpiszcz@lucidpixels.com>
Cc: linux-raid@vger.kernel.org
Sent: Saturday, December 6, 2008 11:02:39 AM
Subject: Re: Need urgent help in fixing raid5 array

Ok, here is some more info on this odd problem.

I have seven 1 TB drives in raid5 array.  sdb1 sdc1 sdf1 sdh1 sdi1 sdj1 sdk1

They all have the same uuid, and events is the same for each except sdk1, which I think was the disk being resynced.  As I understand it, when the array is being resynced, the events counter on the newly added drive is different than the others.  The other drives have the same events and uuid, and all show clean.  When I try and assemble the array of the 7 drives, it tells me there are 5 drives and 1 spare, not enough to start the array.  If I remove sdk1 (the drive with the different event number on it), I get the same exact same message.  

By removing one drive at a time from the assemble command, I determined that md thinks that sdh1 is the spare, even though events are the same and the the UUID is the same, the checksum says it's ok.  Why does it think this drive is a spare and not a data drive?  

I had cloned the data drive that had failed and got almost everything copied over, all but 12kb.  So I think it's fine and is not being a problem.

How does md decide which drive is a spare and which is an active synced drive, etc... ?  I can't seem to find a document that outlines all this.

Thx
mike




----- Original Message ----
From: Justin Piszcz <jpiszcz@lucidpixels.com>
To: Mike Myers <mikesm559@yahoo.com>
Cc: linux-raid@vger.kernel.org
Sent: Friday, December 5, 2008 4:51:26 PM
Subject: Re: Need urgent help in fixing raid5 array

Only use it as A LAST resort, check the mailing list or wait for 
Neil/someone else with/whose had a similar issue who can maybe help more 
here.

On Fri, 5 Dec 2008, Mike Myers wrote:

> Thanks very much.  All the disks I am trying to assemble have the same event count and UUID, which is why I don't understand why it's not assembling.  I'll try the assume-clean option and see if that helps.
>
> It would be great to understand how md determines if the drives are in sync with each other.  I thought the uuid and event count was all you needed...
>
> thx
> mike
>
>
>
>
> ----- Original Message ----
> From: Justin Piszcz <jpiszcz@lucidpixels.com>
> To: Mike Myers <mikesm559@yahoo.com>
> Cc: linux-raid@vger.kernel.org
> Sent: Friday, December 5, 2008 4:24:46 PM
> Subject: Re: Need urgent help in fixing raid5 array
>
>
> You can try this as a last resort:
> http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07815.html
>
> (mdadm w/create and assume-clean) but only use this as a last resort, when
> I had two disk failures, I was able to see some of the data but ultimately
> it was lost, bottom line? i dont use raid5 anymore, raid6 only, in the
> 3ware docs they recommend if you use more than 4 disks you should use
> raid6 if you have the capability, i agree
>
> some others on the list may have more . less intrusive ideas . only  use
> the above method as a LAST RESORT, i was able to assemble the array but I
> had problems getting xfs_repair to fix the filesystem
>
> On Fri, 5 Dec 2008, Mike Myers wrote:
>
>> Anyone?  What am I missing here?
>>
>> thx
>> mike
>>
>>
>>
>>
>> ----- Original Message ----
>> From: Mike Myers <mikesm559@yahoo.com>
>> To: linux-raid@vger.kernel.org
>> Sent: Friday, December 5, 2008 9:03:22 AM
>> Subject: Need urgent help in fixing raid5 array
>>
>> I have a problem with repairing a raid5 array I really need some help with.  I must be missing something here.
>>
>> I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that.  Both arrays have 7 1 TB disks in them.  I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots.  That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline.  Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state.  I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild.  No problem.
>>
>> Around 60% through the resync, smartd started reporting problems with one of the other drives in the array.  Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline.  Ugh...
>>
>> Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing.  As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here.  I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered.
>>
>> So I remove the old disk,  re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead.  Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array.  What am I missing here?  Why doesn't it view it as the old disk as a member and just start it?
>>
>> thx
>> mike
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



      
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



      
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



      

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
@ 2009-01-01 15:31 Mike Myers
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Myers @ 2009-01-01 15:31 UTC (permalink / raw)
  To: linux-raid

Well, thanks for all your help last month.  As i posted, things came
back up and I survived the failure.  Now, I have yet another problem.
:(  After 5 years of running a linux server as a dedicated NAS, I am
hitting some very weird problems.  This server started as an single
processor AMD system with 4 320GB drives, and has been upgraded
multiple times so that it is now a quad core Intel rackmounted 4U
system with 14 1 TB drives and I have never lost data in any of the
upgrades of CPU, motherboard and disk controller hardware and disk
drives.  Now after last month's near death experience I am faced with
another serious problem in less than a month.  Any help you guys could
give me would be most appreciated.  This is a sucky way to start the
new year.

The array I had problems with last month (md2 comprised of 7 1 TB drives in a RAID5 config) is running just fine. 
md1, which is built of 7 1 TB hitachi 7K1000 drives is now having
problems.  We returned from a 10 day family visit with everything
running just fine.  There ws a brief power outage today, abt 3 mins,
but I can't see how that could be related as the server is on a high
quality rackmount 3U APC UPS that handled the outage just fine.  I was
working on the system getting X to work again after a nvidia driver
update, and when that was working fine, checked the disks to discover
that md1 was in a degraded state, with /dev/sdl1 kicked out of the
array (removed).  I tried to do a dd from the drive to verify it's
location in the rack, but I got an i/o error.  This was most odd, and
so went to the rack and pulled the disk and reinserted it.  No system
log entries recorded the device being pulled or re-installed.  So I am
thinking that a cable somehow
has come loose.  I power the system
down, pull it out of the rack, look at the cable that goes to the
drive, everything looks fine.  

So I reboot the system, and now the array won't come online because now in addition to the drive that
shows as (removed), one of the other drives shows as a faulty spare. 
Well, learning from the last go around, I reassemble the array with the
--force option, and the array comes back up.  But LVM won't come back
up because it sees the physical volume that maps to md1 as missing. 
Now I am very concerned.  After trying a bunch of things, I do a
pvcreate with the missing UUID on md1, restart the vg and the logical
volume comes back up.  I was thinking I may have told lvm to use an
array of bad data, but to my surprise, I mounted the filesystem and
everything looked intact!  Ok, sometimes you win.  So I do one more
reboot to get the system back up in multiuser so I can back up some of
the more important media stored on the volume (it's got about 10 Tb
used, but most of that is PVR recordings, but there is a lot of ripped
music and DVD's that I really don't
want to rerip) on a another server that has some space on it while I figure out what has been happening.

The reboot again fails because of a problem with md1.  This time, another
one of the drives shows as removed (/dev/sdm1), and I can't reassemble
the array with a --force option.  It is acting like /dev/sdl1 (the
other removed unit), and even though I can read from the drives fine,
their UUID is fine, etc..., md does not consider them as part of the
array.  /dev/sdo1 (which was the drive that looked like a faulty spare)
seems OK when trying to do the assemble.  sdm1 seemed just fine before
the reboot, and was showing no problems before.  They are not hooked up
on the same controller cable ( a SAS to SATA fanout), and the LSI MPT
controller card seems to talk to the other disks just fine.  

Anyways,I have no idea as to what's going on.  When I try to add sdm1 or sdl1
back into the array, md complains the device is busy, which is very odd
because it's not part of another array or doing anything else in the
system.

Any idea as to what could be happening here?  I am beyond frustrated.

thanks,
Mike

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
       [not found] <451872.61166.qm@web30802.mail.mud.yahoo.com>
@ 2009-01-01 15:40 ` Justin Piszcz
  2009-01-01 17:51   ` Mike Myers
  0 siblings, 1 reply; 46+ messages in thread
From: Justin Piszcz @ 2009-01-01 15:40 UTC (permalink / raw)
  To: Mike Myers; +Cc: linux-raid, john lists



On Thu, 1 Jan 2009, Mike Myers wrote:

> Well, thanks for all your help last month.  As i posted, things came
> back up and I survived the failure.  Now, I have yet another problem.
> :(  After 5 years of running a linux server as a dedicated NAS, I am
> hitting some very weird problems.  This server started as an single
> processor AMD system with 4 320GB drives, and has been upgraded
> multiple times so that it is now a quad core Intel rackmounted 4U
> system with 14 1 TB drives and I have never lost data in any of the
> upgrades of CPU, motherboard and disk controller hardware and disk
> drives.  Now after last month's near death experience I am faced with
> another serious problem in less than a month.  Any help you guys could
> give me would be most appreciated.  This is a sucky way to start the
> new year.
>
> The array I had problems with last month (md2
> comprised of 7 1 TB drives in a RAID5 config) is running just fine.
> md1, which is built of 7 1 TB hitachi 7K1000 drives is now having
> problems.  We returned from a 10 day family visit with everything
> running just fine.  There ws a brief power outage today, abt 3 mins,
> but I can't see how that could be related as the server is on a high
> quality rackmount 3U APC UPS that handled the outage just fine.  I was
> working on the system getting X to work again after a nvidia driver
> update, and when that was working fine, checked the disks to discover
> that md1 was in a degraded state, with /dev/sdl1 kicked out of the
> array (removed).  I tried to do a dd from the drive to verify it's
> location in the rack, but I got an i/o error.  This was most odd, and
> so went to the rack and pulled the disk and reinserted it.  No system
> log entries recorded the device being pulled or re-installed.  So I am
> thinking that a cable somehow
> has come loose.  I power the system
> down, pull it out of the rack, look at the cable that goes to the
> drive, everything looks fine.
>
> So I reboot the system, and now
> the array won't come online because now in addition to the drive that
> shows as (removed), one of the other drives shows as a faulty spare.
> Well, learning from the last go around, I reassemble the array with the
> --force option, and the array comes back up.  But LVM won't come back
> up because it sees the physical volume that maps to md1 as missing.
> Now I am very concerned.  After trying a bunch of things, I do a
> pvcreate with the missing UUID on md1, restart the vg and the logical
> volume comes back up.  I was thinking I may have told lvm to use an
> array of bad data, but to my surprise, I mounted the filesystem and
> everything looked intact!  Ok, sometimes you win.  So I do one more
> reboot to get the system back up in multiuser so I can back up some of
> the more important media stored on the volume (it's got about 10 Tb
> used, but most of that is PVR recordings, but there is a lot of ripped
> music and DVD's that I really don't
> want to rerip) on a another server that has some space on it while I figure out what has been happening.
>
> The
> reboot again fails because of a problem with md1.  This time, another
> one of the drives shows as removed (/dev/sdm1), and I can't reassemble
> the array with a --force option.  It is acting like /dev/sdl1 (the
> other removed unit), and even though I can read from the drives fine,
> their UUID is fine, etc..., md does not consider them as part of the
> array.  /dev/sdo1 (which was the drive that looked like a faulty spare)
> seems OK when trying to do the assemble.  sdm1 seemed just fine before
> the reboot, and was showing no problems before.  They are not hooked up
> on the same controller cable ( a SAS to SATA fanout), and the LSI MPT
> controller card seems to talk to the other disks just fine.
>
> Anyways,
> I have no idea as to what's going on.  When I try to add sdm1 or sdl1
> back into the array, md complains the device is busy, which is very odd
> because it's not part of another array or doing anything else in the
> system.
>
> Any idea as to what could be happening here?  I am beyond frustrated.
>
> thanks,
> Mike
>
>
>

If you are using a hotswap chasis, then it has some sort of 
sata-backplane.  I have seen backplanes go bad in the past, that would be
my first replacement.

Justin.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-01 15:40 ` Need urgent help in fixing raid5 array Justin Piszcz
@ 2009-01-01 17:51   ` Mike Myers
  2009-01-01 18:29     ` Justin Piszcz
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Myers @ 2009-01-01 17:51 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, john lists

The disks that are problematic are still online as far as the OS can tell.  I can do a dd from them and pull off data at the normal speeds, so I don't understand if that's the case why the backplane would be a problem here.  I can try and move them to another slot however (I have a 20 slot SATA backplane in there) and see if that changes how md deals with it.

The OS sees the drive, it inits fine, but md shows it as removed and won't let me add it back to the array because of the "device being busy".  I don't understand the criteria that md uses to add a drive I guess.  The uuid looks fine, and if the events is off, then the -f flag should take care of that.  I've never seen a "device busy" failure on an add before.

thx
mike




----- Original Message ----
From: Justin Piszcz <jpiszcz@lucidpixels.com>
To: Mike Myers <mikesm559@yahoo.com>
Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Thursday, January 1, 2009 7:40:21 AM
Subject: Re: Need urgent help in fixing raid5 array



On Thu, 1 Jan 2009, Mike Myers wrote:

> Well, thanks for all your help last month.  As i posted, things came
> back up and I survived the failure.  Now, I have yet another problem.
> :(  After 5 years of running a linux server as a dedicated NAS, I am
> hitting some very weird problems.  This server started as an single
> processor AMD system with 4 320GB drives, and has been upgraded
> multiple times so that it is now a quad core Intel rackmounted 4U
> system with 14 1 TB drives and I have never lost data in any of the
> upgrades of CPU, motherboard and disk controller hardware and disk
> drives.  Now after last month's near death experience I am faced with
> another serious problem in less than a month.  Any help you guys could
> give me would be most appreciated.  This is a sucky way to start the
> new year.
>
> The array I had problems with last month (md2
> comprised of 7 1 TB drives in a RAID5 config) is running just fine.
> md1, which is built of 7 1 TB hitachi 7K1000 drives is now having
> problems.  We returned from a 10 day family visit with everything
> running just fine.  There ws a brief power outage today, abt 3 mins,
> but I can't see how that could be related as the server is on a high
> quality rackmount 3U APC UPS that handled the outage just fine.  I was
> working on the system getting X to work again after a nvidia driver
> update, and when that was working fine, checked the disks to discover
> that md1 was in a degraded state, with /dev/sdl1 kicked out of the
> array (removed).  I tried to do a dd from the drive to verify it's
> location in the rack, but I got an i/o error.  This was most odd, and
> so went to the rack and pulled the disk and reinserted it.  No system
> log entries recorded the device being pulled or re-installed.  So I am
> thinking that a cable somehow
> has come loose.  I power the system
> down, pull it out of the rack, look at the cable that goes to the
> drive, everything looks fine.
>
> So I reboot the system, and now
> the array won't come online because now in addition to the drive that
> shows as (removed), one of the other drives shows as a faulty spare.
> Well, learning from the last go around, I reassemble the array with the
> --force option, and the array comes back up.  But LVM won't come back
> up because it sees the physical volume that maps to md1 as missing.
> Now I am very concerned.  After trying a bunch of things, I do a
> pvcreate with the missing UUID on md1, restart the vg and the logical
> volume comes back up.  I was thinking I may have told lvm to use an
> array of bad data, but to my surprise, I mounted the filesystem and
> everything looked intact!  Ok, sometimes you win.  So I do one more
> reboot to get the system back up in multiuser so I can back up some of
> the more important media stored on the volume (it's got about 10 Tb
> used, but most of that is PVR recordings, but there is a lot of ripped
> music and DVD's that I really don't
> want to rerip) on a another server that has some space on it while I figure out what has been happening.
>
> The
> reboot again fails because of a problem with md1.  This time, another
> one of the drives shows as removed (/dev/sdm1), and I can't reassemble
> the array with a --force option.  It is acting like /dev/sdl1 (the
> other removed unit), and even though I can read from the drives fine,
> their UUID is fine, etc..., md does not consider them as part of the
> array.  /dev/sdo1 (which was the drive that looked like a faulty spare)
> seems OK when trying to do the assemble.  sdm1 seemed just fine before
> the reboot, and was showing no problems before.  They are not hooked up
> on the same controller cable ( a SAS to SATA fanout), and the LSI MPT
> controller card seems to talk to the other disks just fine.
>
> Anyways,
> I have no idea as to what's going on.  When I try to add sdm1 or sdl1
> back into the array, md complains the device is busy, which is very odd
> because it's not part of another array or doing anything else in the
> system.
>
> Any idea as to what could be happening here?  I am beyond frustrated.
>
> thanks,
> Mike
>
>
>

If you are using a hotswap chasis, then it has some sort of 
sata-backplane.  I have seen backplanes go bad in the past, that would be
my first replacement.

Justin.



      

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-01 17:51   ` Mike Myers
@ 2009-01-01 18:29     ` Justin Piszcz
  2009-01-01 18:40       ` Jon Nelson
  2009-01-02  6:19       ` Mike Myers
  0 siblings, 2 replies; 46+ messages in thread
From: Justin Piszcz @ 2009-01-01 18:29 UTC (permalink / raw)
  To: Mike Myers; +Cc: linux-raid, john lists

I think some output would be pertinent here:

mdadm -D /dev/md0..1..2 etc

cat /proc/mdstat

dmesg/syslog of the errors you are seeing etc



On Thu, 1 Jan 2009, Mike Myers wrote:

> The disks that are problematic are still online as far as the OS can tell.  I can do a dd from them and pull off data at the normal speeds, so I don't understand if that's the case why the backplane would be a problem here.  I can try and move them to another slot however (I have a 20 slot SATA backplane in there) and see if that changes how md deals with it.
>
> The OS sees the drive, it inits fine, but md shows it as removed and won't let me add it back to the array because of the "device being busy".  I don't understand the criteria that md uses to add a drive I guess.  The uuid looks fine, and if the events is off, then the -f flag should take care of that.  I've never seen a "device busy" failure on an add before.
>
> thx
> mike
>
>
>
>
> ----- Original Message ----
> From: Justin Piszcz <jpiszcz@lucidpixels.com>
> To: Mike Myers <mikesm559@yahoo.com>
> Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
> Sent: Thursday, January 1, 2009 7:40:21 AM
> Subject: Re: Need urgent help in fixing raid5 array
>
>
>
> On Thu, 1 Jan 2009, Mike Myers wrote:
>
>> Well, thanks for all your help last month.  As i posted, things came
>> back up and I survived the failure.  Now, I have yet another problem.
>> :(  After 5 years of running a linux server as a dedicated NAS, I am
>> hitting some very weird problems.  This server started as an single
>> processor AMD system with 4 320GB drives, and has been upgraded
>> multiple times so that it is now a quad core Intel rackmounted 4U
>> system with 14 1 TB drives and I have never lost data in any of the
>> upgrades of CPU, motherboard and disk controller hardware and disk
>> drives.  Now after last month's near death experience I am faced with
>> another serious problem in less than a month.  Any help you guys could
>> give me would be most appreciated.  This is a sucky way to start the
>> new year.
>>
>> The array I had problems with last month (md2
>> comprised of 7 1 TB drives in a RAID5 config) is running just fine.
>> md1, which is built of 7 1 TB hitachi 7K1000 drives is now having
>> problems.  We returned from a 10 day family visit with everything
>> running just fine.  There ws a brief power outage today, abt 3 mins,
>> but I can't see how that could be related as the server is on a high
>> quality rackmount 3U APC UPS that handled the outage just fine.  I was
>> working on the system getting X to work again after a nvidia driver
>> update, and when that was working fine, checked the disks to discover
>> that md1 was in a degraded state, with /dev/sdl1 kicked out of the
>> array (removed).  I tried to do a dd from the drive to verify it's
>> location in the rack, but I got an i/o error.  This was most odd, and
>> so went to the rack and pulled the disk and reinserted it.  No system
>> log entries recorded the device being pulled or re-installed.  So I am
>> thinking that a cable somehow
>> has come loose.  I power the system
>> down, pull it out of the rack, look at the cable that goes to the
>> drive, everything looks fine.
>>
>> So I reboot the system, and now
>> the array won't come online because now in addition to the drive that
>> shows as (removed), one of the other drives shows as a faulty spare.
>> Well, learning from the last go around, I reassemble the array with the
>> --force option, and the array comes back up.  But LVM won't come back
>> up because it sees the physical volume that maps to md1 as missing.
>> Now I am very concerned.  After trying a bunch of things, I do a
>> pvcreate with the missing UUID on md1, restart the vg and the logical
>> volume comes back up.  I was thinking I may have told lvm to use an
>> array of bad data, but to my surprise, I mounted the filesystem and
>> everything looked intact!  Ok, sometimes you win.  So I do one more
>> reboot to get the system back up in multiuser so I can back up some of
>> the more important media stored on the volume (it's got about 10 Tb
>> used, but most of that is PVR recordings, but there is a lot of ripped
>> music and DVD's that I really don't
>> want to rerip) on a another server that has some space on it while I figure out what has been happening.
>>
>> The
>> reboot again fails because of a problem with md1.  This time, another
>> one of the drives shows as removed (/dev/sdm1), and I can't reassemble
>> the array with a --force option.  It is acting like /dev/sdl1 (the
>> other removed unit), and even though I can read from the drives fine,
>> their UUID is fine, etc..., md does not consider them as part of the
>> array.  /dev/sdo1 (which was the drive that looked like a faulty spare)
>> seems OK when trying to do the assemble.  sdm1 seemed just fine before
>> the reboot, and was showing no problems before.  They are not hooked up
>> on the same controller cable ( a SAS to SATA fanout), and the LSI MPT
>> controller card seems to talk to the other disks just fine.
>>
>> Anyways,
>> I have no idea as to what's going on.  When I try to add sdm1 or sdl1
>> back into the array, md complains the device is busy, which is very odd
>> because it's not part of another array or doing anything else in the
>> system.
>>
>> Any idea as to what could be happening here?  I am beyond frustrated.
>>
>> thanks,
>> Mike
>>
>>
>>
>
> If you are using a hotswap chasis, then it has some sort of
> sata-backplane.  I have seen backplanes go bad in the past, that would be
> my first replacement.
>
> Justin.
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-01 18:29     ` Justin Piszcz
@ 2009-01-01 18:40       ` Jon Nelson
  2009-01-01 20:38         ` Mike Myers
  2009-01-02  6:19       ` Mike Myers
  1 sibling, 1 reply; 46+ messages in thread
From: Jon Nelson @ 2009-01-01 18:40 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Mike Myers, linux-raid, john lists

Also the contents of /etc/mdadm.conf


On Thu, Jan 1, 2009 at 12:29 PM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> I think some output would be pertinent here:
>
> mdadm -D /dev/md0..1..2 etc
>
> cat /proc/mdstat
>
> dmesg/syslog of the errors you are seeing etc
>
>
>
> On Thu, 1 Jan 2009, Mike Myers wrote:
>
>> The disks that are problematic are still online as far as the OS can tell.
>>  I can do a dd from them and pull off data at the normal speeds, so I don't
>> understand if that's the case why the backplane would be a problem here.  I
>> can try and move them to another slot however (I have a 20 slot SATA
>> backplane in there) and see if that changes how md deals with it.
>>
>> The OS sees the drive, it inits fine, but md shows it as removed and won't
>> let me add it back to the array because of the "device being busy".  I don't
>> understand the criteria that md uses to add a drive I guess.  The uuid looks
>> fine, and if the events is off, then the -f flag should take care of that.
>>  I've never seen a "device busy" failure on an add before.
>>
>> thx
>> mike
>>
>>
>>
>>
>> ----- Original Message ----
>> From: Justin Piszcz <jpiszcz@lucidpixels.com>
>> To: Mike Myers <mikesm559@yahoo.com>
>> Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
>> Sent: Thursday, January 1, 2009 7:40:21 AM
>> Subject: Re: Need urgent help in fixing raid5 array
>>
>>
>>
>> On Thu, 1 Jan 2009, Mike Myers wrote:
>>
>>> Well, thanks for all your help last month.  As i posted, things came
>>> back up and I survived the failure.  Now, I have yet another problem.
>>> :(  After 5 years of running a linux server as a dedicated NAS, I am
>>> hitting some very weird problems.  This server started as an single
>>> processor AMD system with 4 320GB drives, and has been upgraded
>>> multiple times so that it is now a quad core Intel rackmounted 4U
>>> system with 14 1 TB drives and I have never lost data in any of the
>>> upgrades of CPU, motherboard and disk controller hardware and disk
>>> drives.  Now after last month's near death experience I am faced with
>>> another serious problem in less than a month.  Any help you guys could
>>> give me would be most appreciated.  This is a sucky way to start the
>>> new year.
>>>
>>> The array I had problems with last month (md2
>>> comprised of 7 1 TB drives in a RAID5 config) is running just fine.
>>> md1, which is built of 7 1 TB hitachi 7K1000 drives is now having
>>> problems.  We returned from a 10 day family visit with everything
>>> running just fine.  There ws a brief power outage today, abt 3 mins,
>>> but I can't see how that could be related as the server is on a high
>>> quality rackmount 3U APC UPS that handled the outage just fine.  I was
>>> working on the system getting X to work again after a nvidia driver
>>> update, and when that was working fine, checked the disks to discover
>>> that md1 was in a degraded state, with /dev/sdl1 kicked out of the
>>> array (removed).  I tried to do a dd from the drive to verify it's
>>> location in the rack, but I got an i/o error.  This was most odd, and
>>> so went to the rack and pulled the disk and reinserted it.  No system
>>> log entries recorded the device being pulled or re-installed.  So I am
>>> thinking that a cable somehow
>>> has come loose.  I power the system
>>> down, pull it out of the rack, look at the cable that goes to the
>>> drive, everything looks fine.
>>>
>>> So I reboot the system, and now
>>> the array won't come online because now in addition to the drive that
>>> shows as (removed), one of the other drives shows as a faulty spare.
>>> Well, learning from the last go around, I reassemble the array with the
>>> --force option, and the array comes back up.  But LVM won't come back
>>> up because it sees the physical volume that maps to md1 as missing.
>>> Now I am very concerned.  After trying a bunch of things, I do a
>>> pvcreate with the missing UUID on md1, restart the vg and the logical
>>> volume comes back up.  I was thinking I may have told lvm to use an
>>> array of bad data, but to my surprise, I mounted the filesystem and
>>> everything looked intact!  Ok, sometimes you win.  So I do one more
>>> reboot to get the system back up in multiuser so I can back up some of
>>> the more important media stored on the volume (it's got about 10 Tb
>>> used, but most of that is PVR recordings, but there is a lot of ripped
>>> music and DVD's that I really don't
>>> want to rerip) on a another server that has some space on it while I
>>> figure out what has been happening.
>>>
>>> The
>>> reboot again fails because of a problem with md1.  This time, another
>>> one of the drives shows as removed (/dev/sdm1), and I can't reassemble
>>> the array with a --force option.  It is acting like /dev/sdl1 (the
>>> other removed unit), and even though I can read from the drives fine,
>>> their UUID is fine, etc..., md does not consider them as part of the
>>> array.  /dev/sdo1 (which was the drive that looked like a faulty spare)
>>> seems OK when trying to do the assemble.  sdm1 seemed just fine before
>>> the reboot, and was showing no problems before.  They are not hooked up
>>> on the same controller cable ( a SAS to SATA fanout), and the LSI MPT
>>> controller card seems to talk to the other disks just fine.
>>>
>>> Anyways,
>>> I have no idea as to what's going on.  When I try to add sdm1 or sdl1
>>> back into the array, md complains the device is busy, which is very odd
>>> because it's not part of another array or doing anything else in the
>>> system.
>>>
>>> Any idea as to what could be happening here?  I am beyond frustrated.
>>>
>>> thanks,
>>> Mike
>>>
>>>
>>>
>>
>> If you are using a hotswap chasis, then it has some sort of
>> sata-backplane.  I have seen backplanes go bad in the past, that would be
>> my first replacement.
>>
>> Justin.
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Jon

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-01 18:40       ` Jon Nelson
@ 2009-01-01 20:38         ` Mike Myers
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Myers @ 2009-01-01 20:38 UTC (permalink / raw)
  To: Jon Nelson, Justin Piszcz; +Cc: linux-raid, john lists

Thanks for the help.  After a night of being powered off, it now appears that one of the 8 port MPT cards won't initialize properly and be seen by the BIOS (and consequently by linux).  The funny thing is that this card didn't connect any of the troublesome drives, but rather the drives in md2 which were working just fine.  It looks like a hardware failure, since swapping slots doesn't fix it, and if I swap the 8087 connectors, whatever drives are on the "good" card are detected and come up in linux.

I have an old marvell based 4 port SATA controller that I will dig out and see if I can't get running, and if I move the ports arounds I should be able to get all the drives visible again.  

I must have sinned greviously somehow to suffer with this problem on the first day of the year.  And I don't even work for a wall street firm.  

More later.

thx
mike






----- Original Message ----
From: Jon Nelson <jnelson-linux-raid@jamponi.net>
To: Justin Piszcz <jpiszcz@lucidpixels.com>
Cc: Mike Myers <mikesm559@yahoo.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Thursday, January 1, 2009 10:40:18 AM
Subject: Re: Need urgent help in fixing raid5 array

Also the contents of /etc/mdadm.conf


On Thu, Jan 1, 2009 at 12:29 PM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> I think some output would be pertinent here:
>
> mdadm -D /dev/md0..1..2 etc
>
> cat /proc/mdstat
>
> dmesg/syslog of the errors you are seeing etc
>
>
>
> On Thu, 1 Jan 2009, Mike Myers wrote:
>
>> The disks that are problematic are still online as far as the OS can tell.
>>  I can do a dd from them and pull off data at the normal speeds, so I don't
>> understand if that's the case why the backplane would be a problem here.  I
>> can try and move them to another slot however (I have a 20 slot SATA
>> backplane in there) and see if that changes how md deals with it.
>>
>> The OS sees the drive, it inits fine, but md shows it as removed and won't
>> let me add it back to the array because of the "device being busy".  I don't
>> understand the criteria that md uses to add a drive I guess.  The uuid looks
>> fine, and if the events is off, then the -f flag should take care of that.
>>  I've never seen a "device busy" failure on an add before.
>>
>> thx
>> mike
>>
>>
>>
>>
>> ----- Original Message ----
>> From: Justin Piszcz <jpiszcz@lucidpixels.com>
>> To: Mike Myers <mikesm559@yahoo.com>
>> Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
>> Sent: Thursday, January 1, 2009 7:40:21 AM
>> Subject: Re: Need urgent help in fixing raid5 array
>>
>>
>>
>> On Thu, 1 Jan 2009, Mike Myers wrote:
>>
>>> Well, thanks for all your help last month.  As i posted, things came
>>> back up and I survived the failure.  Now, I have yet another problem.
>>> :(  After 5 years of running a linux server as a dedicated NAS, I am
>>> hitting some very weird problems.  This server started as an single
>>> processor AMD system with 4 320GB drives, and has been upgraded
>>> multiple times so that it is now a quad core Intel rackmounted 4U
>>> system with 14 1 TB drives and I have never lost data in any of the
>>> upgrades of CPU, motherboard and disk controller hardware and disk
>>> drives.  Now after last month's near death experience I am faced with
>>> another serious problem in less than a month.  Any help you guys could
>>> give me would be most appreciated.  This is a sucky way to start the
>>> new year.
>>>
>>> The array I had problems with last month (md2
>>> comprised of 7 1 TB drives in a RAID5 config) is running just fine.
>>> md1, which is built of 7 1 TB hitachi 7K1000 drives is now having
>>> problems.  We returned from a 10 day family visit with everything
>>> running just fine.  There ws a brief power outage today, abt 3 mins,
>>> but I can't see how that could be related as the server is on a high
>>> quality rackmount 3U APC UPS that handled the outage just fine.  I was
>>> working on the system getting X to work again after a nvidia driver
>>> update, and when that was working fine, checked the disks to discover
>>> that md1 was in a degraded state, with /dev/sdl1 kicked out of the
>>> array (removed).  I tried to do a dd from the drive to verify it's
>>> location in the rack, but I got an i/o error.  This was most odd, and
>>> so went to the rack and pulled the disk and reinserted it.  No system
>>> log entries recorded the device being pulled or re-installed.  So I am
>>> thinking that a cable somehow
>>> has come loose.  I power the system
>>> down, pull it out of the rack, look at the cable that goes to the
>>> drive, everything looks fine.
>>>
>>> So I reboot the system, and now
>>> the array won't come online because now in addition to the drive that
>>> shows as (removed), one of the other drives shows as a faulty spare.
>>> Well, learning from the last go around, I reassemble the array with the
>>> --force option, and the array comes back up.  But LVM won't come back
>>> up because it sees the physical volume that maps to md1 as missing.
>>> Now I am very concerned.  After trying a bunch of things, I do a
>>> pvcreate with the missing UUID on md1, restart the vg and the logical
>>> volume comes back up.  I was thinking I may have told lvm to use an
>>> array of bad data, but to my surprise, I mounted the filesystem and
>>> everything looked intact!  Ok, sometimes you win.  So I do one more
>>> reboot to get the system back up in multiuser so I can back up some of
>>> the more important media stored on the volume (it's got about 10 Tb
>>> used, but most of that is PVR recordings, but there is a lot of ripped
>>> music and DVD's that I really don't
>>> want to rerip) on a another server that has some space on it while I
>>> figure out what has been happening.
>>>
>>> The
>>> reboot again fails because of a problem with md1.  This time, another
>>> one of the drives shows as removed (/dev/sdm1), and I can't reassemble
>>> the array with a --force option.  It is acting like /dev/sdl1 (the
>>> other removed unit), and even though I can read from the drives fine,
>>> their UUID is fine, etc..., md does not consider them as part of the
>>> array.  /dev/sdo1 (which was the drive that looked like a faulty spare)
>>> seems OK when trying to do the assemble.  sdm1 seemed just fine before
>>> the reboot, and was showing no problems before.  They are not hooked up
>>> on the same controller cable ( a SAS to SATA fanout), and the LSI MPT
>>> controller card seems to talk to the other disks just fine.
>>>
>>> Anyways,
>>> I have no idea as to what's going on.  When I try to add sdm1 or sdl1
>>> back into the array, md complains the device is busy, which is very odd
>>> because it's not part of another array or doing anything else in the
>>> system.
>>>
>>> Any idea as to what could be happening here?  I am beyond frustrated.
>>>
>>> thanks,
>>> Mike
>>>
>>>
>>>
>>
>> If you are using a hotswap chasis, then it has some sort of
>> sata-backplane.  I have seen backplanes go bad in the past, that would be
>> my first replacement.
>>
>> Justin.
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Jon



      

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-01 18:29     ` Justin Piszcz
  2009-01-01 18:40       ` Jon Nelson
@ 2009-01-02  6:19       ` Mike Myers
  2009-01-02 12:10         ` Justin Piszcz
  2009-01-05 22:11         ` Neil Brown
  1 sibling, 2 replies; 46+ messages in thread
From: Mike Myers @ 2009-01-02  6:19 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, john lists

Ok, the bad MPT board is out, replaced by a SI3132, and I rejiggered the drives around so that all the drives are connected.  It brought me back to the main problem.  md2 is running fine, md1 cannot assemble with only 5 drives out of the 7.

Here is the data you requested:

(none):~ # cat /etc/mdadm.conf
DEVICE partitions
ARRAY /dev/md0 level=raid0 UUID=9412e7e1:fd56806c:0f9cc200:95c7ed98
ARRAY /dev/md3 level=raid0 UUID=67999c69:4a9ca9f9:7d4d6b81:91c98b1f
ARRAY /dev/md1 level=raid5 UUID=b737af5c:7c0a70a9:99a648a0:7f693c7d
ARRAY /dev/md2 level=raid5 UUID=e70e0697:a10a5b75:941dd76f:196d9e4e
#ARRAY /dev/md2 level=raid0 UUID=658369ee:23081b79:c990e3a2:15f38c70
#ARRAY /dev/md3 level=raid0 UUID=e2c910ae:0052c38e:a5e19298:0d057e34
MAILADDR root

(md0 and md3 are old arrays that have since been removed - no disks with their uuids are in the system)

(none):~> mdadm -D /dev/md1
mdadm: md device /dev/md1 does not appear to be active.


(none):~> mdadm -D /dev/md2
/dev/md2:
        Version : 00.90.03
  Creation Time : Tue Aug 19 21:31:10 2008
     Raid Level : raid5
     Array Size : 5860559616 (5589.07 GiB 6001.21 GB)
  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
   Raid Devices : 7
  Total Devices : 7
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Thu Jan  1 21:59:20 2009
          State : clean
 Active Devices : 7
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 128K

           UUID : e70e0697:a10a5b75:941dd76f:196d9e4e
         Events : 0.1438838

    Number   Major   Minor   RaidDevice State
       0       8      209        0      active sync   /dev/sdn1
       1       8      129        1      active sync   /dev/sdi1
       2       8      177        2      active sync   /dev/sdl1
       3       8       17        3      active sync   /dev/sdb1
       4       8       33        4      active sync   /dev/sdc1
       5       8       65        5      active sync   /dev/sde1
       6       8      193        6      active sync   /dev/sdm1


(md1 is comprised of sdd1 sdf1 sdg1 sdh1 sdj1 sdk1 sdo1) 

(none):~> mdadm --examine /dev/sdd1 /dev/sdf1 /dev/sdg1 /dev/sdh1
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d
           Name : 1
  Creation Time : Fri Nov 23 12:15:39 2007
     Raid Level : raid5
   Raid Devices : 7

 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB)
     Array Size : 11721117696 (5589.06 GiB 6001.21 GB)
  Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
   Super Offset : 1953519984 sectors
          State : clean
    Device UUID : 8ea6369b:cfd1c103:845a1a65:d8b1f254

Internal Bitmap : -234 sectors from superblock
    Update Time : Wed Dec 31 22:43:01 2008
       Checksum : ce94ad09 - correct
         Events : 2295122

         Layout : left-symmetric
     Chunk Size : 128K

    Array Slot : 7 (0, failed, failed, failed, 3, 4, failed, 5, 6)
   Array State : u__uuUu 4 failed
/dev/sdf1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d
           Name : 1
  Creation Time : Fri Nov 23 12:15:39 2007
     Raid Level : raid5
   Raid Devices : 7

 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB)
     Array Size : 11721117696 (5589.06 GiB 6001.21 GB)
  Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
   Super Offset : 1953519984 sectors
          State : clean
    Device UUID : 50c2e80e:e36efc92:5ddac3b0:4d847236

Internal Bitmap : -234 sectors from superblock
    Update Time : Wed Dec 31 22:43:01 2008
       Checksum : feaab82b - correct
         Events : 2295122

         Layout : left-symmetric
     Chunk Size : 128K

    Array Slot : 5 (0, failed, failed, failed, 3, 4, failed, 5, 6)
   Array State : u__uUuu 4 failed
/dev/sdg1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d
           Name : 1
  Creation Time : Fri Nov 23 12:15:39 2007
     Raid Level : raid5
   Raid Devices : 7

 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB)
     Array Size : 11721117696 (5589.06 GiB 6001.21 GB)
  Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
   Super Offset : 1953519984 sectors
          State : clean
    Device UUID : c9809a0c:bd4eabbe:c110a056:0cdd3691

Internal Bitmap : -234 sectors from superblock
    Update Time : Fri Jan  2 17:30:13 2009
       Checksum : 28b13f46 - correct
         Events : 2295116

         Layout : left-symmetric
     Chunk Size : 128K

    Array Slot : 0 (0, 1, failed, failed, 3, 4, failed, 5, 6)
   Array State : Uu_uuuu 3 failed
/dev/sdh1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d
           Name : 1
  Creation Time : Fri Nov 23 12:15:39 2007
     Raid Level : raid5
   Raid Devices : 7

 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB)
     Array Size : 11721117696 (5589.06 GiB 6001.21 GB)
  Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
   Super Offset : 1953519984 sectors
          State : clean
    Device UUID : c9809a0c:bd4eabbe:c110a056:0cdd3691

Internal Bitmap : -234 sectors from superblock
    Update Time : Wed Dec 31 22:43:01 2008
       Checksum : 28abe59d - correct
         Events : 2295122

         Layout : left-symmetric
     Chunk Size : 128K

    Array Slot : 0 (0, failed, failed, failed, 3, 4, failed, 5, 6)
   Array State : U__uuuu 4 failed



(none):~> mdadm --examine /dev/sdj1 /dev/sdk1 /dev/sdo1
/dev/sdj1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d
           Name : 1
  Creation Time : Fri Nov 23 12:15:39 2007
     Raid Level : raid5
   Raid Devices : 7

 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB)
     Array Size : 11721117696 (5589.06 GiB 6001.21 GB)
  Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
   Super Offset : 1953519984 sectors
          State : clean
    Device UUID : c61e1d1a:b123f01a:4098ab5e:e8932eb6

Internal Bitmap : -234 sectors from superblock
    Update Time : Wed Dec 31 22:43:01 2008
       Checksum : bf7696f0 - correct
         Events : 2295122

         Layout : left-symmetric
     Chunk Size : 128K

    Array Slot : 8 (0, failed, failed, failed, 3, 4, failed, 5, 6)
   Array State : u__uuuU 4 failed
/dev/sdk1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d
           Name : 1
  Creation Time : Fri Nov 23 12:15:39 2007
     Raid Level : raid5
   Raid Devices : 7

 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB)
     Array Size : 11721117696 (5589.06 GiB 6001.21 GB)
  Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
   Super Offset : 1953519984 sectors
          State : clean
    Device UUID : f1417b9d:64d9c93d:c32d16e8:470ab7af

Internal Bitmap : -234 sectors from superblock
    Update Time : Wed Dec 31 22:43:01 2008
       Checksum : e8a17bad - correct
         Events : 2295122

         Layout : left-symmetric
     Chunk Size : 128K

    Array Slot : 4 (0, failed, failed, failed, 3, 4, failed, 5, 6)
   Array State : u__Uuuu 4 failed
/dev/sdo1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d
           Name : 1
  Creation Time : Fri Nov 23 12:15:39 2007
     Raid Level : raid5
   Raid Devices : 7

 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB)
     Array Size : 11721117696 (5589.06 GiB 6001.21 GB)
  Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
   Super Offset : 1953519984 sectors
          State : clean
    Device UUID : c9809a0c:bd4eabbe:c110a056:0cdd3691

Internal Bitmap : -234 sectors from superblock
    Update Time : Fri Jan  2 17:17:40 2009
       Checksum : 28b13bcd - correct
         Events : 2294980

         Layout : left-symmetric
     Chunk Size : 128K

    Array Slot : 0 (0, 1, failed, failed, 3, 4, failed, 5, 6)
   Array State : Uu_uuuu 3 failed


(none):~> cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md2 : active raid5 sdn1[0] sdm1[6] sde1[5] sdc1[4] sdb1[3] sdl1[2] sdi1[1]
      5860559616 blocks level 5, 128k chunk, algorithm 2 [7/7] [UUUUUUU]

md1 : inactive sdh1[0](S) sdj1[8](S) sdd1[7](S) sdf1[5](S) sdk1[4](S)
      4883799040 blocks super 1.0

unused devices: <none>


I'm not seeing any errors on boot - all the drives come up now.  It's just that md can't put md1 back together again.  Once that happens, then I can try with lvm and see if I can't get the filesystem online.

Anything else that would be helpful?

I am happy to attach the whole bootup log, but it's a little long...

thanks VERY much!

Mike




 



----- Original Message ----
From: Justin Piszcz <jpiszcz@lucidpixels.com>
To: Mike Myers <mikesm559@yahoo.com>
Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Thursday, January 1, 2009 10:29:15 AM
Subject: Re: Need urgent help in fixing raid5 array

I think some output would be pertinent here:

mdadm -D /dev/md0..1..2 etc

cat /proc/mdstat

dmesg/syslog of the errors you are seeing etc



On Thu, 1 Jan 2009, Mike Myers wrote:

> The disks that are problematic are still online as far as the OS can tell.  I can do a dd from them and pull off data at the normal speeds, so I don't understand if that's the case why the backplane would be a problem here.  I can try and move them to another slot however (I have a 20 slot SATA backplane in there) and see if that changes how md deals with it.
>
> The OS sees the drive, it inits fine, but md shows it as removed and won't let me add it back to the array because of the "device being busy".  I don't understand the criteria that md uses to add a drive I guess.  The uuid looks fine, and if the events is off, then the -f flag should take care of that.  I've never seen a "device busy" failure on an add before.
>
> thx
> mike
>
>
>
>
> ----- Original Message ----
> From: Justin Piszcz <jpiszcz@lucidpixels.com>
> To: Mike Myers <mikesm559@yahoo.com>
> Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
> Sent: Thursday, January 1, 2009 7:40:21 AM
> Subject: Re: Need urgent help in fixing raid5 array
>
>
>
> On Thu, 1 Jan 2009, Mike Myers wrote:
>
>> Well, thanks for all your help last month.  As i posted, things came
>> back up and I survived the failure.  Now, I have yet another problem.
>> :(  After 5 years of running a linux server as a dedicated NAS, I am
>> hitting some very weird problems.  This server started as an single
>> processor AMD system with 4 320GB drives, and has been upgraded
>> multiple times so that it is now a quad core Intel rackmounted 4U
>> system with 14 1 TB drives and I have never lost data in any of the
>> upgrades of CPU, motherboard and disk controller hardware and disk
>> drives.  Now after last month's near death experience I am faced with
>> another serious problem in less than a month.  Any help you guys could
>> give me would be most appreciated.  This is a sucky way to start the
>> new year.
>>
>> The array I had problems with last month (md2
>> comprised of 7 1 TB drives in a RAID5 config) is running just fine.
>> md1, which is built of 7 1 TB hitachi 7K1000 drives is now having
>> problems.  We returned from a 10 day family visit with everything
>> running just fine.  There ws a brief power outage today, abt 3 mins,
>> but I can't see how that could be related as the server is on a high
>> quality rackmount 3U APC UPS that handled the outage just fine.  I was
>> working on the system getting X to work again after a nvidia driver
>> update, and when that was working fine, checked the disks to discover
>> that md1 was in a degraded state, with /dev/sdl1 kicked out of the
>> array (removed).  I tried to do a dd from the drive to verify it's
>> location in the rack, but I got an i/o error.  This was most odd, and
>> so went to the rack and pulled the disk and reinserted it.  No system
>> log entries recorded the device being pulled or re-installed.  So I am
>> thinking that a cable somehow
>> has come loose.  I power the system
>> down, pull it out of the rack, look at the cable that goes to the
>> drive, everything looks fine.
>>
>> So I reboot the system, and now
>> the array won't come online because now in addition to the drive that
>> shows as (removed), one of the other drives shows as a faulty spare.
>> Well, learning from the last go around, I reassemble the array with the
>> --force option, and the array comes back up.  But LVM won't come back
>> up because it sees the physical volume that maps to md1 as missing.
>> Now I am very concerned.  After trying a bunch of things, I do a
>> pvcreate with the missing UUID on md1, restart the vg and the logical
>> volume comes back up.  I was thinking I may have told lvm to use an
>> array of bad data, but to my surprise, I mounted the filesystem and
>> everything looked intact!  Ok, sometimes you win.  So I do one more
>> reboot to get the system back up in multiuser so I can back up some of
>> the more important media stored on the volume (it's got about 10 Tb
>> used, but most of that is PVR recordings, but there is a lot of ripped
>> music and DVD's that I really don't
>> want to rerip) on a another server that has some space on it while I figure out what has been happening.
>>
>> The
>> reboot again fails because of a problem with md1.  This time, another
>> one of the drives shows as removed (/dev/sdm1), and I can't reassemble
>> the array with a --force option.  It is acting like /dev/sdl1 (the
>> other removed unit), and even though I can read from the drives fine,
>> their UUID is fine, etc..., md does not consider them as part of the
>> array.  /dev/sdo1 (which was the drive that looked like a faulty spare)
>> seems OK when trying to do the assemble.  sdm1 seemed just fine before
>> the reboot, and was showing no problems before.  They are not hooked up
>> on the same controller cable ( a SAS to SATA fanout), and the LSI MPT
>> controller card seems to talk to the other disks just fine.
>>
>> Anyways,
>> I have no idea as to what's going on.  When I try to add sdm1 or sdl1
>> back into the array, md complains the device is busy, which is very odd
>> because it's not part of another array or doing anything else in the
>> system.
>>
>> Any idea as to what could be happening here?  I am beyond frustrated.
>>
>> thanks,
>> Mike
>>
>>
>>
>
> If you are using a hotswap chasis, then it has some sort of
> sata-backplane.  I have seen backplanes go bad in the past, that would be
> my first replacement.
>
> Justin.
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



      

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-02  6:19       ` Mike Myers
@ 2009-01-02 12:10         ` Justin Piszcz
  2009-01-02 18:12           ` Mike Myers
  2009-01-05 22:11         ` Neil Brown
  1 sibling, 1 reply; 46+ messages in thread
From: Justin Piszcz @ 2009-01-02 12:10 UTC (permalink / raw)
  To: Mike Myers; +Cc: linux-raid, john lists



On Thu, 1 Jan 2009, Mike Myers wrote:

> Ok, the bad MPT board is out, replaced by a SI3132, and I rejiggered the drives around so that all the drives are connected.  It brought me back to the main problem.  md2 is running fine, md1 cannot assemble with only 5 drives out of the 7.
>
> Here is the data you requested:
>
> (none):~ # cat /etc/mdadm.conf
> DEVICE partitions
> ARRAY /dev/md0 level=raid0 UUID=9412e7e1:fd56806c:0f9cc200:95c7ed98
> ARRAY /dev/md3 level=raid0 UUID=67999c69:4a9ca9f9:7d4d6b81:91c98b1f
> ARRAY /dev/md1 level=raid5 UUID=b737af5c:7c0a70a9:99a648a0:7f693c7d
> ARRAY /dev/md2 level=raid5 UUID=e70e0697:a10a5b75:941dd76f:196d9e4e
> #ARRAY /dev/md2 level=raid0 UUID=658369ee:23081b79:c990e3a2:15f38c70
> #ARRAY /dev/md3 level=raid0 UUID=e2c910ae:0052c38e:a5e19298:0d057e34
> MAILADDR root
>
> (md0 and md3 are old arrays that have since been removed - no disks with their uuids are in the system)
>
> (none):~> mdadm -D /dev/md1
> mdadm: md device /dev/md1 does not appear to be active.
>
>
> (none):~> mdadm -D /dev/md2
> /dev/md2:
>        Version : 00.90.03
>  Creation Time : Tue Aug 19 21:31:10 2008
>     Raid Level : raid5
>     Array Size : 5860559616 (5589.07 GiB 6001.21 GB)
>  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
>   Raid Devices : 7
>  Total Devices : 7
> Preferred Minor : 2
>    Persistence : Superblock is persistent
>
>    Update Time : Thu Jan  1 21:59:20 2009
>          State : clean
> Active Devices : 7
> Working Devices : 7
> Failed Devices : 0
>  Spare Devices : 0
>
>         Layout : left-symmetric
>     Chunk Size : 128K
>
>           UUID : e70e0697:a10a5b75:941dd76f:196d9e4e
>         Events : 0.1438838
>
>    Number   Major   Minor   RaidDevice State
>       0       8      209        0      active sync   /dev/sdn1
>       1       8      129        1      active sync   /dev/sdi1
>       2       8      177        2      active sync   /dev/sdl1
>       3       8       17        3      active sync   /dev/sdb1
>       4       8       33        4      active sync   /dev/sdc1
>       5       8       65        5      active sync   /dev/sde1
>       6       8      193        6      active sync   /dev/sdm1
>
>
> (md1 is comprised of sdd1 sdf1 sdg1 sdh1 sdj1 sdk1 sdo1)
>

What happens if you use assemble and force with the five good drives
and one or the other of the ones that are not assembling (to assemble in
degraded mode)?

For the two disks that have 'failed' can you show their smart stats, I am
curious to see them.

Worst case which I do not recommend unless it is your last resort is re-create
the array with --assume-clean with the same options you used originally; doing
this though will cause filesystem corruption.

I recommend you switch to RAID-6 with an array that big btw :)

Justin.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-02 12:10         ` Justin Piszcz
@ 2009-01-02 18:12           ` Mike Myers
  2009-01-02 18:22             ` Justin Piszcz
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Myers @ 2009-01-02 18:12 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, john lists

Thanks for the response.  When I try and assemble the array with just 6 disks (the 5 good ones and one of sdo1 or sdg1) I get:

(none):~> mdadm /dev/md1 --assemble /dev/sdf1   /dev/sdh1 /dev/sdj1 /dev/sdk1 /dev/sdo1 /dev/sdd1 --force
mdadm: /dev/md1 assembled from 5 drives - not enough to start the array.

(none):~> mdadm /dev/md1 --assemble /dev/sdf1  /dev/sdg1 /dev/sdh1 /dev/sdj1 /dev/sdk1  /dev/sdd1 --force
mdadm: /dev/md1 assembled from 5 drives - not enough to start the array.

As for the smart info:

(none):~> smartctl -i /dev/sdo1
smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar 7K1000
Device Model:     Hitachi HDS721010KLA330
Serial Number:    GTJ000PAG552VC
Firmware Version: GKAOA70M
User Capacity:    1,000,204,886,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 1
Local Time is:    Fri Jan  2 09:32:07 2009 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

and 

(none):~> smartctl -i /dev/sdg1
smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar 7K1000
Device Model:     Hitachi HDS721010KLA330
Serial Number:    GTA000PAG5R0AA
Firmware Version: GKAOA70M
User Capacity:    1,000,204,886,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 1
Local Time is:    Fri Jan  2 10:04:55 2009 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

When I tried to read the smart data from the sdo1, the drive went offline and I get a controller error!

Here's what I get talking to sdg1:

(none):~> smartctl -l error  /dev/sdg1
smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 1
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 6388 hours (266 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 a0 08   6d+23:19:44.200  IDENTIFY DEVICE
  25 00 01 01 00 00 00 04   6d+23:19:44.000  READ DMA EXT
  25 00 80 be 1b ba ef ff   6d+23:19:42.500  READ DMA EXT
  25 00 c0 7f 1b ba e0 08   6d+23:19:42.500  READ DMA EXT
  25 00 40 3f 1b ba e0 08   6d+23:19:30.300  READ DMA EXT

As for RAID6, well this array started off as a 3 disk RAID5, and then got incrementally grown as capacity needs grew.  I wasn't going to go beyond 7 disks in the raid set, but since you can't reshape raid5 into raid6, and so I would have another few TB of disk available to move the raid set data to.  Since I use XFS, I can't just move off data and then shrink the filesystem to minimize the needs.  md and XFS make it easy to add disks, but very hard to remove them.  :-(

It looks like my best bet is to try and get sd1g back into the raid set somehow, but I don't understand why md isn't assembling it into the set.  Should I try and clone sdo1 to a new disk, or sdg1?  But I am not sure what help that would be if md won't assemble with it.

thx
mike

----- Original Message ----
From: Justin Piszcz <jpiszcz@lucidpixels.com>
To: Mike Myers <mikesm559@yahoo.com>
Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Friday, January 2, 2009 4:10:20 AM
Subject: Re: Need urgent help in fixing raid5 array

On Thu, 1 Jan 2009, Mike Myers wrote:

> Ok, the bad MPT board is out, replaced by a SI3132, and I rejiggered the drives around so that all the drives are connected.  It brought me back to the main problem.  md2 is running fine, md1 cannot assemble with only 5 drives out of the 7.
>
> Here is the data you requested:
>
> (none):~ # cat /etc/mdadm.conf
> DEVICE partitions
> ARRAY /dev/md0 level=raid0 UUID=9412e7e1:fd56806c:0f9cc200:95c7ed98
> ARRAY /dev/md3 level=raid0 UUID=67999c69:4a9ca9f9:7d4d6b81:91c98b1f
> ARRAY /dev/md1 level=raid5 UUID=b737af5c:7c0a70a9:99a648a0:7f693c7d
> ARRAY /dev/md2 level=raid5 UUID=e70e0697:a10a5b75:941dd76f:196d9e4e
> #ARRAY /dev/md2 level=raid0 UUID=658369ee:23081b79:c990e3a2:15f38c70
> #ARRAY /dev/md3 level=raid0 UUID=e2c910ae:0052c38e:a5e19298:0d057e34
> MAILADDR root
>
> (md0 and md3 are old arrays that have since been removed - no disks with their uuids are in the system)
>
> (none):~> mdadm -D /dev/md1
> mdadm: md device /dev/md1 does not appear to be active.
>
>
> (none):~> mdadm -D /dev/md2
> /dev/md2:
>        Version : 00.90.03
>  Creation Time : Tue Aug 19 21:31:10 2008
>     Raid Level : raid5
>     Array Size : 5860559616 (5589.07 GiB 6001.21 GB)
>  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
>   Raid Devices : 7
>  Total Devices : 7
> Preferred Minor : 2
>    Persistence : Superblock is persistent
>
>    Update Time : Thu Jan  1 21:59:20 2009
>          State : clean
> Active Devices : 7
> Working Devices : 7
> Failed Devices : 0
>  Spare Devices : 0
>
>         Layout : left-symmetric
>     Chunk Size : 128K
>
>           UUID : e70e0697:a10a5b75:941dd76f:196d9e4e
>         Events : 0.1438838
>
>    Number   Major   Minor   RaidDevice State
>       0       8      209        0      active sync   /dev/sdn1
>       1       8      129        1      active sync   /dev/sdi1
>       2       8      177        2      active sync   /dev/sdl1
>       3       8       17        3      active sync   /dev/sdb1
>       4       8       33        4      active sync   /dev/sdc1
>       5       8       65        5      active sync   /dev/sde1
>       6       8      193        6      active sync   /dev/sdm1
>
>
> (md1 is comprised of sdd1 sdf1 sdg1 sdh1 sdj1 sdk1 sdo1)
>

What happens if you use assemble and force with the five good drives
and one or the other of the ones that are not assembling (to assemble in
degraded mode)?

For the two disks that have 'failed' can you show their smart stats, I am
curious to see them.

Worst case which I do not recommend unless it is your last resort is re-create
the array with --assume-clean with the same options you used originally; doing
this though will cause filesystem corruption.

I recommend you switch to RAID-6 with an array that big btw :)

Justin.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-02 18:12           ` Mike Myers
@ 2009-01-02 18:22             ` Justin Piszcz
  2009-01-02 18:46               ` Mike Myers
  0 siblings, 1 reply; 46+ messages in thread
From: Justin Piszcz @ 2009-01-02 18:22 UTC (permalink / raw)
  To: Mike Myers; +Cc: linux-raid, john lists



On Fri, 2 Jan 2009, Mike Myers wrote:

> Thanks for the response.  When I try and assemble the array with just 6 disks (the 5 good ones and one of sdo1 or sdg1) I get:
>
> (none):~> mdadm /dev/md1 --assemble /dev/sdf1   /dev/sdh1 /dev/sdj1 /dev/sdk1 /dev/sdo1 /dev/sdd1 --force
> mdadm: /dev/md1 assembled from 5 drives - not enough to start the array.
>
>
> (none):~> mdadm /dev/md1 --assemble /dev/sdf1  /dev/sdg1 /dev/sdh1 /dev/sdj1 /dev/sdk1  /dev/sdd1 --force
> mdadm: /dev/md1 assembled from 5 drives - not enough to start the array.
>
> As for the smart info:
>
> (none):~> smartctl -i /dev/sdo1
> smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build)
> Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net
>
> === START OF INFORMATION SECTION ===
> Model Family:     Hitachi Deskstar 7K1000
> Device Model:     Hitachi HDS721010KLA330
> Serial Number:    GTJ000PAG552VC
> Firmware Version: GKAOA70M
> User Capacity:    1,000,204,886,016 bytes
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   7
> ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 1
> Local Time is:    Fri Jan  2 09:32:07 2009 PST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> and
>
> (none):~> smartctl -i /dev/sdg1
> smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build)
> Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net
>
> === START OF INFORMATION SECTION ===
> Model Family:     Hitachi Deskstar 7K1000
> Device Model:     Hitachi HDS721010KLA330
> Serial Number:    GTA000PAG5R0AA
> Firmware Version: GKAOA70M
> User Capacity:    1,000,204,886,016 bytes
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   7
> ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 1
> Local Time is:    Fri Jan  2 10:04:55 2009 PST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
>
> When I tried to read the smart data from the sdo1, the drive went offline and I get a controller error!
I would figure out why this happens first and fix it if possible. Backplane?
Cable? Controller?  Btw: The interest bits from smartctl-- need to see
smartctl -a so we can see the statistics for each of the identifiers.

>
>
> Here's what I get talking to sdg1:
>
> (none):~> smartctl -l error  /dev/sdg1
> smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build)
> Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net
>
> === START OF READ SMART DATA SECTION ===
> SMART Error Log Version: 1
> ATA Error Count: 1
>        CR = Command Register [HEX]
>        FR = Features Register [HEX]
>        SC = Sector Count Register [HEX]
>        SN = Sector Number Register [HEX]
>        CL = Cylinder Low Register [HEX]
>        CH = Cylinder High Register [HEX]
>        DH = Device/Head Register [HEX]
>        DC = Device Command Register [HEX]
>        ER = Error register [HEX]
>        ST = Status register [HEX]
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> SS=sec, and sss=millisec. It "wraps" after 49.710 days.
>
> Error 1 occurred at disk power-on lifetime: 6388 hours (266 days + 4 hours)
>  When the command that caused the error occurred, the device was active or idle.
>
>  After command completion occurred, registers were:
>  ER ST SC SN CL CH DH
>  -- -- -- -- -- -- --
>  84 51 00 00 00 00 a0
>
>  Commands leading to the command that caused the error were:
>  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>  -- -- -- -- -- -- -- --  ----------------  --------------------
>  ec 00 00 00 00 00 a0 08   6d+23:19:44.200  IDENTIFY DEVICE
>  25 00 01 01 00 00 00 04   6d+23:19:44.000  READ DMA EXT
>  25 00 80 be 1b ba ef ff   6d+23:19:42.500  READ DMA EXT
>  25 00 c0 7f 1b ba e0 08   6d+23:19:42.500  READ DMA EXT
>  25 00 40 3f 1b ba e0 08   6d+23:19:30.300  READ DMA EXT
>
>
>
> As for RAID6, well this array started off as a 3 disk RAID5, and then got incrementally grown as capacity needs grew.  I wasn't going to go beyond 7 disks in the raid set, but since you can't reshape raid5 into raid6, and so I would have another few TB of disk available to move the raid set data to.  Since I use XFS, I can't just move off data and then shrink the filesystem to minimize the needs.  md and XFS make it easy to add disks, but very hard to remove them.  :-(
>
> It looks like my best bet is to try and get sd1g back into the raid set somehow, but I don't understand why md isn't assembling it into the set.  Should I try and clone sdo1 to a new disk, or sdg1?  But I am not sure what help that would be if md won't assemble with it.
>
> thx
> mike

As far as re-assembling the array, I would wait for Neil or someone who has
done this a few times but you need to find out why disks are giving I/O errors.

If you run:

dd if=/dev/sda of=/dev/null bs=1M &
dd if=/dev/sdb of=/dev/null bs=1M &

for each disk, can you do that for all disks in the raid array and then
see if any errors occur?  if you flood your system with that much I/O and
it doesnt have any problems I'd say you're good to go, but if you run those
commands and background them/run them simultaenously and drives start dropping
left and right, I'd wonder about the backplane myself..

Justin.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-02 18:22             ` Justin Piszcz
@ 2009-01-02 18:46               ` Mike Myers
  2009-01-02 18:57                 ` Justin Piszcz
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Myers @ 2009-01-02 18:46 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, john lists

Well, I can read from sdg1 just fine.  It seems to work ok, at least for a few GB of data.   I'll try this on some of the other disks, but it is possible for to pull the disks out of the backplane and run the SFF-8087 fanout cables direct to each drive and bypass the backplane completely.  It certainly would be easy to do this for the at least the sdo1 drive and see if I can get better results going direct to the disk.  I have moved the disks around the backplane a bit to deal with the issues of the controller failure, so I am pretty sure it's not just one bad slot or the like.  

So you've seen a backplane fail in away that the disks come up fine at boot but have corrupted data transfers across them?  I wonder about the sata cables in that case as well.  I could hook up a pair of PMP's to my SI3132's and bypass the 8077 cables as well.

Thx
Mike






----- Original Message ----
From: Justin Piszcz <jpiszcz@lucidpixels.com>
To: Mike Myers <mikesm559@yahoo.com>
Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Friday, January 2, 2009 10:22:29 AM
Subject: Re: Need urgent help in fixing raid5 array



On Fri, 2 Jan 2009, Mike Myers wrote:

> Thanks for the response.  When I try and assemble the array with just 6 disks (the 5 good ones and one of sdo1 or sdg1) I get:
>
> (none):~> mdadm /dev/md1 --assemble /dev/sdf1   /dev/sdh1 /dev/sdj1 /dev/sdk1 /dev/sdo1 /dev/sdd1 --force
> mdadm: /dev/md1 assembled from 5 drives - not enough to start the array.
>
>
> (none):~> mdadm /dev/md1 --assemble /dev/sdf1  /dev/sdg1 /dev/sdh1 /dev/sdj1 /dev/sdk1  /dev/sdd1 --force
> mdadm: /dev/md1 assembled from 5 drives - not enough to start the array.
>
> As for the smart info:
>
> (none):~> smartctl -i /dev/sdo1
> smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build)
> Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net
>
> === START OF INFORMATION SECTION ===
> Model Family:     Hitachi Deskstar 7K1000
> Device Model:     Hitachi HDS721010KLA330
> Serial Number:    GTJ000PAG552VC
> Firmware Version: GKAOA70M
> User Capacity:    1,000,204,886,016 bytes
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   7
> ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 1
> Local Time is:    Fri Jan  2 09:32:07 2009 PST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> and
>
> (none):~> smartctl -i /dev/sdg1
> smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build)
> Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net
>
> === START OF INFORMATION SECTION ===
> Model Family:     Hitachi Deskstar 7K1000
> Device Model:     Hitachi HDS721010KLA330
> Serial Number:    GTA000PAG5R0AA
> Firmware Version: GKAOA70M
> User Capacity:    1,000,204,886,016 bytes
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   7
> ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 1
> Local Time is:    Fri Jan  2 10:04:55 2009 PST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
>
> When I tried to read the smart data from the sdo1, the drive went offline and I get a controller error!
I would figure out why this happens first and fix it if possible. Backplane?
Cable? Controller?  Btw: The interest bits from smartctl-- need to see
smartctl -a so we can see the statistics for each of the identifiers.

>
>
> Here's what I get talking to sdg1:
>
> (none):~> smartctl -l error  /dev/sdg1
> smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build)
> Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net
>
> === START OF READ SMART DATA SECTION ===
> SMART Error Log Version: 1
> ATA Error Count: 1
>        CR = Command Register [HEX]
>        FR = Features Register [HEX]
>        SC = Sector Count Register [HEX]
>        SN = Sector Number Register [HEX]
>        CL = Cylinder Low Register [HEX]
>        CH = Cylinder High Register [HEX]
>        DH = Device/Head Register [HEX]
>        DC = Device Command Register [HEX]
>        ER = Error register [HEX]
>        ST = Status register [HEX]
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> SS=sec, and sss=millisec. It "wraps" after 49.710 days.
>
> Error 1 occurred at disk power-on lifetime: 6388 hours (266 days + 4 hours)
>  When the command that caused the error occurred, the device was active or idle.
>
>  After command completion occurred, registers were:
>  ER ST SC SN CL CH DH
>  -- -- -- -- -- -- --
>  84 51 00 00 00 00 a0
>
>  Commands leading to the command that caused the error were:
>  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>  -- -- -- -- -- -- -- --  ----------------  --------------------
>  ec 00 00 00 00 00 a0 08   6d+23:19:44.200  IDENTIFY DEVICE
>  25 00 01 01 00 00 00 04   6d+23:19:44.000  READ DMA EXT
>  25 00 80 be 1b ba ef ff   6d+23:19:42.500  READ DMA EXT
>  25 00 c0 7f 1b ba e0 08   6d+23:19:42.500  READ DMA EXT
>  25 00 40 3f 1b ba e0 08   6d+23:19:30.300  READ DMA EXT
>
>
>
> As for RAID6, well this array started off as a 3 disk RAID5, and then got incrementally grown as capacity needs grew.  I wasn't going to go beyond 7 disks in the raid set, but since you can't reshape raid5 into raid6, and so I would have another few TB of disk available to move the raid set data to.  Since I use XFS, I can't just move off data and then shrink the filesystem to minimize the needs.  md and XFS make it easy to add disks, but very hard to remove them.  :-(
>
> It looks like my best bet is to try and get sd1g back into the raid set somehow, but I don't understand why md isn't assembling it into the set.  Should I try and clone sdo1 to a new disk, or sdg1?  But I am not sure what help that would be if md won't assemble with it.
>
> thx
> mike

As far as re-assembling the array, I would wait for Neil or someone who has
done this a few times but you need to find out why disks are giving I/O errors.

If you run:

dd if=/dev/sda of=/dev/null bs=1M &
dd if=/dev/sdb of=/dev/null bs=1M &

for each disk, can you do that for all disks in the raid array and then
see if any errors occur?  if you flood your system with that much I/O and
it doesnt have any problems I'd say you're good to go, but if you run those
commands and background them/run them simultaenously and drives start dropping
left and right, I'd wonder about the backplane myself..

Justin.


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



      

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-02 18:46               ` Mike Myers
@ 2009-01-02 18:57                 ` Justin Piszcz
  2009-01-02 20:46                   ` Mike Myers
                                     ` (3 more replies)
  0 siblings, 4 replies; 46+ messages in thread
From: Justin Piszcz @ 2009-01-02 18:57 UTC (permalink / raw)
  To: Mike Myers; +Cc: linux-raid, john lists



On Fri, 2 Jan 2009, Mike Myers wrote:

> Well, I can read from sdg1 just fine.  It seems to work ok, at least for a few GB of data.   I'll try this on some of the other disks, but it is possible for to pull the disks out of the backplane and run the SFF-8087 fanout cables direct to each drive and bypass the backplane completely.  It certainly would be easy to do this for the at least the sdo1 drive and see if I can get better results going direct to the disk.  I have moved the disks around the backplane a bit to deal with the issues of the controller failure, so I am pretty sure it's not just one bad slot or the like.
>
> So you've seen a backplane fail in away that the disks come up fine at boot but have corrupted data transfers across them?  I wonder about the sata cables in that case as well.  I could hook up a pair of PMP's to my SI3132's and bypass the 8077 cables as well.

1. Try by-passing the backplane.
2. Bad cables will usually cause smart identifier UDMA_CRC_Error_Count to
    increase quite high, if it is 0 or close to it, the cable is unlikely the
    issue.
3. I have seem all kinds of weirdness with bad backplanes, drives dropping out
    of the array, drives producing I/O errors, etc.

Justin.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-02 18:57                 ` Justin Piszcz
@ 2009-01-02 20:46                   ` Mike Myers
  2009-01-02 20:56                   ` Mike Myers
                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 46+ messages in thread
From: Mike Myers @ 2009-01-02 20:46 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, john lists

Well, it looks like (maybe) you could be right about the backplane.  Shortly after replying to you, md2 went off and threw two drives.  So this is too much of a coincidence. Or I have having a really bad time with a bunch of disks!

I had 3 5in3 backplanes from a previous incarnation of the server around, so I moved all the disks from the new system into the old bacplanes, and hooked up power and cables etc...  They are now all online in the new backplanes.  

Md1 looks like it's still in the same state can't assemble from 5 drives.

Md2 when it came up said it couldn't assemble from 3 drives.  (It was working fine when I booted it in old backplane).  I told it to assemble using the --force option, and it adjust two drives events and so now complains that it can't assemble from 5 drives too.  

If I  were taking hits due to a bad backplane, could it be responsible for putting these arrys in such a bad state, even when i cleared the bad backplane?  

I'll probe around using the smart tools to see if I have a bad cable.  Meanwhile I have two new 8 port controllers on order to try and see if I have having more controller related grief.

Any ideas as to have to try reassmbling these guys?  I really don't want to try and do the create --assume-clean approach.  

Thx
mike

----- Original Message ----
From: Justin Piszcz <jpiszcz@lucidpixels.com>
To: Mike Myers <mikesm559@yahoo.com>
Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Friday, January 2, 2009 10:57:13 AM
Subject: Re: Need urgent help in fixing raid5 array

On Fri, 2 Jan 2009, Mike Myers wrote:

> Well, I can read from sdg1 just fine.  It seems to work ok, at least for a few GB of data.   I'll try this on some of the other disks, but it is possible for to pull the disks out of the backplane and run the SFF-8087 fanout cables direct to each drive and bypass the backplane completely.  It certainly would be easy to do this for the at least the sdo1 drive and see if I can get better results going direct to the disk.  I have moved the disks around the backplane a bit to deal with the issues of the controller failure, so I am pretty sure it's not just one bad slot or the like.
> 
> So you've seen a backplane fail in away that the disks come up fine at boot but have corrupted data transfers across them?  I wonder about the sata cables in that case as well.  I could hook up a pair of PMP's to my SI3132's and bypass the 8077 cables as well.

1. Try by-passing the backplane.
2. Bad cables will usually cause smart identifier UDMA_CRC_Error_Count to
   increase quite high, if it is 0 or close to it, the cable is unlikely the
   issue.
3. I have seem all kinds of weirdness with bad backplanes, drives dropping out
   of the array, drives producing I/O errors, etc.

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-02 18:57                 ` Justin Piszcz
  2009-01-02 20:46                   ` Mike Myers
@ 2009-01-02 20:56                   ` Mike Myers
  2009-01-02 21:37                   ` Mike Myers
  2009-01-03  4:19                   ` Mike Myers
  3 siblings, 0 replies; 46+ messages in thread
From: Mike Myers @ 2009-01-02 20:56 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, john lists

BTW, several of the disks that md didn't want to assemble have perfectly valid uuid info on them.  if I try and manually specify the devices in the --assemble line, and add the ones that are missing (but have a valid uuid for the array), even with the force option, md refuses to assemble them.  The disks can't be all bad can they?  That's 4 drives out of 14 that would have all had to go bad at once. 

thx
Mike




----- Original Message ----
From: Justin Piszcz <jpiszcz@lucidpixels.com>
To: Mike Myers <mikesm559@yahoo.com>
Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Friday, January 2, 2009 10:57:13 AM
Subject: Re: Need urgent help in fixing raid5 array



On Fri, 2 Jan 2009, Mike Myers wrote:

> Well, I can read from sdg1 just fine.  It seems to work ok, at least for a few GB of data.   I'll try this on some of the other disks, but it is possible for to pull the disks out of the backplane and run the SFF-8087 fanout cables direct to each drive and bypass the backplane completely.  It certainly would be easy to do this for the at least the sdo1 drive and see if I can get better results going direct to the disk.  I have moved the disks around the backplane a bit to deal with the issues of the controller failure, so I am pretty sure it's not just one bad slot or the like.
> 
> So you've seen a backplane fail in away that the disks come up fine at boot but have corrupted data transfers across them?  I wonder about the sata cables in that case as well.  I could hook up a pair of PMP's to my SI3132's and bypass the 8077 cables as well.

1. Try by-passing the backplane.
2. Bad cables will usually cause smart identifier UDMA_CRC_Error_Count to
   increase quite high, if it is 0 or close to it, the cable is unlikely the
   issue.
3. I have seem all kinds of weirdness with bad backplanes, drives dropping out
   of the array, drives producing I/O errors, etc.

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



      

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-02 18:57                 ` Justin Piszcz
  2009-01-02 20:46                   ` Mike Myers
  2009-01-02 20:56                   ` Mike Myers
@ 2009-01-02 21:37                   ` Mike Myers
  2009-01-03  4:19                   ` Mike Myers
  3 siblings, 0 replies; 46+ messages in thread
From: Mike Myers @ 2009-01-02 21:37 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, john lists

BTW, here is the smart error listing for one of the devices that md seems to refuse to add:

 smartctl -l error  /dev/sdb1
smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 1
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 6388 hours (266 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 a0 08   6d+23:19:44.200  IDENTIFY DEVICE
  25 00 01 01 00 00 00 04   6d+23:19:44.000  READ DMA EXT
  25 00 80 be 1b ba ef ff   6d+23:19:42.500  READ DMA EXT
  25 00 c0 7f 1b ba e0 08   6d+23:19:42.500  READ DMA EXT
  25 00 40 3f 1b ba e0 08   6d+23:19:30.300  READ DMA EXT


It looks like a good disk.

thx
mike




----- Original Message ----
From: Justin Piszcz <jpiszcz@lucidpixels.com>
To: Mike Myers <mikesm559@yahoo.com>
Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Friday, January 2, 2009 10:57:13 AM
Subject: Re: Need urgent help in fixing raid5 array



On Fri, 2 Jan 2009, Mike Myers wrote:

> Well, I can read from sdg1 just fine.  It seems to work ok, at least for a few GB of data.   I'll try this on some of the other disks, but it is possible for to pull the disks out of the backplane and run the SFF-8087 fanout cables direct to each drive and bypass the backplane completely.  It certainly would be easy to do this for the at least the sdo1 drive and see if I can get better results going direct to the disk.  I have moved the disks around the backplane a bit to deal with the issues of the controller failure, so I am pretty sure it's not just one bad slot or the like.
> 
> So you've seen a backplane fail in away that the disks come up fine at boot but have corrupted data transfers across them?  I wonder about the sata cables in that case as well.  I could hook up a pair of PMP's to my SI3132's and bypass the 8077 cables as well.

1. Try by-passing the backplane.
2. Bad cables will usually cause smart identifier UDMA_CRC_Error_Count to
   increase quite high, if it is 0 or close to it, the cable is unlikely the
   issue.
3. I have seem all kinds of weirdness with bad backplanes, drives dropping out
   of the array, drives producing I/O errors, etc.

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



      

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-02 18:57                 ` Justin Piszcz
                                     ` (2 preceding siblings ...)
  2009-01-02 21:37                   ` Mike Myers
@ 2009-01-03  4:19                   ` Mike Myers
  2009-01-03  4:43                     ` Guy Watkins
  3 siblings, 1 reply; 46+ messages in thread
From: Mike Myers @ 2009-01-03  4:19 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, john lists

Ok, good news and bad news.  I finally got all the disks connected and bypassed the backplane.  Md2 starts with 6 members in a degraded mode.  Md1 is still having the same problem.  In doing an examine on each member disk, I discovered that 8 disks had the superblock referencing md2's UUID.  The other thing is that only 6 had the UUID of md1, which is suppposed to have 7 members.  One of the two (sdf1) that has the superblock of md2 (but not active in the array) is also an Hitachi, which it shouldn't be (md2 is a seagate 7200.11 array). This appears to be the missing md1 disk.  I don't understand how it got the other raid array's info, but things are weird here.

That was the good news.  The bad news is that when I tried to assemble md1 with all the md1 members plus sdf1 (the disk that thinks its part of md2), I mistakenly used it as the target for for mdadm assemble command.  Ugh.

So I typed: mdadm /dev/sdf1 --assemble /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdi1 /dev/sdj1 --force

So now sdf1 instead of having the wrong superblock has no super block.  Am I completely hosed at this point?  I probably needed to figure out a way to get this disk a new superblock anyway, but but I suspect things are even harder to fix at this point.

Any ideas as to how to fix this?  Is there another superblock somewhere else on the disk that I can recover the proper info from?

Thanks,
mike

----- Original Message ----
From: Justin Piszcz <jpiszcz@lucidpixels.com>
To: Mike Myers <mikesm559@yahoo.com>
Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Friday, January 2, 2009 10:57:13 AM
Subject: Re: Need urgent help in fixing raid5 array

On Fri, 2 Jan 2009, Mike Myers wrote:

> Well, I can read from sdg1 just fine.  It seems to work ok, at least for a few GB of data.   I'll try this on some of the other disks, but it is possible for to pull the disks out of the backplane and run the SFF-8087 fanout cables direct to each drive and bypass the backplane completely.  It certainly would be easy to do this for the at least the sdo1 drive and see if I can get better results going direct to the disk.  I have moved the disks around the backplane a bit to deal with the issues of the controller failure, so I am pretty sure it's not just one bad slot or the like.
> 
> So you've seen a backplane fail in away that the disks come up fine at boot but have corrupted data transfers across them?  I wonder about the sata cables in that case as well.  I could hook up a pair of PMP's to my SI3132's and bypass the 8077 cables as well.

1. Try by-passing the backplane.
2. Bad cables will usually cause smart identifier UDMA_CRC_Error_Count to
   increase quite high, if it is 0 or close to it, the cable is unlikely the
   issue.
3. I have seem all kinds of weirdness with bad backplanes, drives dropping out
   of the array, drives producing I/O errors, etc.

Justin.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: Need urgent help in fixing raid5 array
  2009-01-03  4:19                   ` Mike Myers
@ 2009-01-03  4:43                     ` Guy Watkins
  2009-01-03  5:02                       ` Mike Myers
  0 siblings, 1 reply; 46+ messages in thread
From: Guy Watkins @ 2009-01-03  4:43 UTC (permalink / raw)
  To: 'Mike Myers', 'Justin Piszcz'
  Cc: linux-raid, 'john lists'

} -----Original Message-----
} From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
} owner@vger.kernel.org] On Behalf Of Mike Myers
} Sent: Friday, January 02, 2009 11:20 PM
} To: Justin Piszcz
} Cc: linux-raid@vger.kernel.org; john lists
} Subject: Re: Need urgent help in fixing raid5 array
} 
} Ok, good news and bad news.  I finally got all the disks connected and
} bypassed the backplane.  Md2 starts with 6 members in a degraded mode.
} Md1 is still having the same problem.  In doing an examine on each member
} disk, I discovered that 8 disks had the superblock referencing md2's UUID.
} The other thing is that only 6 had the UUID of md1, which is suppposed to
} have 7 members.  One of the two (sdf1) that has the superblock of md2 (but
} not active in the array) is also an Hitachi, which it shouldn't be (md2 is
} a seagate 7200.11 array). This appears to be the missing md1 disk.  I
} don't understand how it got the other raid array's info, but things are
} weird here.
} 
} That was the good news.  The bad news is that when I tried to assemble md1
} with all the md1 members plus sdf1 (the disk that thinks its part of md2),
} I mistakenly used it as the target for for mdadm assemble command.  Ugh.
} 
} So I typed: mdadm /dev/sdf1 --assemble /dev/sdb1 /dev/sdc1 /dev/sdd1
} /dev/sde1 /dev/sdf1 /dev/sdi1 /dev/sdj1 --force
} 
} So now sdf1 instead of having the wrong superblock has no super block.  Am
} I completely hosed at this point?  I probably needed to figure out a way
} to get this disk a new superblock anyway, but but I suspect things are
} even harder to fix at this point.
} 
} Any ideas as to how to fix this?  Is there another superblock somewhere
} else on the disk that I can recover the proper info from?
} 
} Thanks,
} mike

I don't consider myself an expert, however...

I think you should only assemble it with 6 of 7 disks.  Leave out the one
that you think has the most wrong data.  If this works, the array will not
try to sync anything.  So, no data damaged.  Then test the data.  Once you
are really sure the data is as good as it can be, then add the missing disk,
it will resync at that time.  However, 1 bad block on any of the 6 disks
will cause a failure.

Then, switch to RAID6 ASAP!  :)

Guy

} 
} 
} 
} 
} ----- Original Message ----
} From: Justin Piszcz <jpiszcz@lucidpixels.com>
} To: Mike Myers <mikesm559@yahoo.com>
} Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
} Sent: Friday, January 2, 2009 10:57:13 AM
} Subject: Re: Need urgent help in fixing raid5 array
} 
} 
} 
} On Fri, 2 Jan 2009, Mike Myers wrote:
} 
} > Well, I can read from sdg1 just fine.  It seems to work ok, at least for
} a few GB of data.   I'll try this on some of the other disks, but it is
} possible for to pull the disks out of the backplane and run the SFF-8087
} fanout cables direct to each drive and bypass the backplane completely.
} It certainly would be easy to do this for the at least the sdo1 drive and
} see if I can get better results going direct to the disk.  I have moved
} the disks around the backplane a bit to deal with the issues of the
} controller failure, so I am pretty sure it's not just one bad slot or the
} like.
} >
} > So you've seen a backplane fail in away that the disks come up fine at
} boot but have corrupted data transfers across them?  I wonder about the
} sata cables in that case as well.  I could hook up a pair of PMP's to my
} SI3132's and bypass the 8077 cables as well.
} 
} 1. Try by-passing the backplane.
} 2. Bad cables will usually cause smart identifier UDMA_CRC_Error_Count to
}    increase quite high, if it is 0 or close to it, the cable is unlikely
} the
}    issue.
} 3. I have seem all kinds of weirdness with bad backplanes, drives dropping
} out
}    of the array, drives producing I/O errors, etc.
} 
} Justin.
} 
} 
} 
} --
} To unsubscribe from this list: send the line "unsubscribe linux-raid" in
} the body of a message to majordomo@vger.kernel.org
} More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-03  4:43                     ` Guy Watkins
@ 2009-01-03  5:02                       ` Mike Myers
  2009-01-03 12:46                         ` John Robinson
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Myers @ 2009-01-03  5:02 UTC (permalink / raw)
  To: Guy Watkins, Justin Piszcz; +Cc: linux-raid, john lists

I have tried that.  It still complains about only having 4 disks to start the array (if don't tell it to use sdf1).

I have been unable to explain my md refuses to use some of the members even though they have good superblock info on them even with the force command.  There are two members of md1 that are online and seem to have proper superblock info, but md doesn't assemble md1 with them.  

Is there a place (besides the code) where md's specifics about how it assembles members is documented?  

Thx
mike

----- Original Message ----
From: Guy Watkins <linux-raid@watkins-home.com>
To: Mike Myers <mikesm559@yahoo.com>; Justin Piszcz <jpiszcz@lucidpixels.com>
Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Friday, January 2, 2009 8:43:53 PM
Subject: RE: Need urgent help in fixing raid5 array

} -----Original Message-----
} From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
} owner@vger.kernel.org] On Behalf Of Mike Myers
} Sent: Friday, January 02, 2009 11:20 PM
} To: Justin Piszcz
} Cc: linux-raid@vger.kernel.org; john lists
} Subject: Re: Need urgent help in fixing raid5 array
} 
} Ok, good news and bad news.  I finally got all the disks connected and
} bypassed the backplane.  Md2 starts with 6 members in a degraded mode.
} Md1 is still having the same problem.  In doing an examine on each member
} disk, I discovered that 8 disks had the superblock referencing md2's UUID.
} The other thing is that only 6 had the UUID of md1, which is suppposed to
} have 7 members.  One of the two (sdf1) that has the superblock of md2 (but
} not active in the array) is also an Hitachi, which it shouldn't be (md2 is
} a seagate 7200.11 array). This appears to be the missing md1 disk.  I
} don't understand how it got the other raid array's info, but things are
} weird here.
} 
} That was the good news.  The bad news is that when I tried to assemble md1
} with all the md1 members plus sdf1 (the disk that thinks its part of md2),
} I mistakenly used it as the target for for mdadm assemble command.  Ugh.
} 
} So I typed: mdadm /dev/sdf1 --assemble /dev/sdb1 /dev/sdc1 /dev/sdd1
} /dev/sde1 /dev/sdf1 /dev/sdi1 /dev/sdj1 --force
} 
} So now sdf1 instead of having the wrong superblock has no super block.  Am
} I completely hosed at this point?  I probably needed to figure out a way
} to get this disk a new superblock anyway, but but I suspect things are
} even harder to fix at this point.
} 
} Any ideas as to how to fix this?  Is there another superblock somewhere
} else on the disk that I can recover the proper info from?
} 
} Thanks,
} mike

I don't consider myself an expert, however...

I think you should only assemble it with 6 of 7 disks.  Leave out the one
that you think has the most wrong data.  If this works, the array will not
try to sync anything.  So, no data damaged.  Then test the data.  Once you
are really sure the data is as good as it can be, then add the missing disk,
it will resync at that time.  However, 1 bad block on any of the 6 disks
will cause a failure.

Then, switch to RAID6 ASAP!  :)

Guy

} 
} 
} 
} 
} ----- Original Message ----
} From: Justin Piszcz <jpiszcz@lucidpixels.com>
} To: Mike Myers <mikesm559@yahoo.com>
} Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
} Sent: Friday, January 2, 2009 10:57:13 AM
} Subject: Re: Need urgent help in fixing raid5 array
} 
} 
} 
} On Fri, 2 Jan 2009, Mike Myers wrote:
} 
} > Well, I can read from sdg1 just fine.  It seems to work ok, at least for
} a few GB of data.   I'll try this on some of the other disks, but it is
} possible for to pull the disks out of the backplane and run the SFF-8087
} fanout cables direct to each drive and bypass the backplane completely.
} It certainly would be easy to do this for the at least the sdo1 drive and
} see if I can get better results going direct to the disk.  I have moved
} the disks around the backplane a bit to deal with the issues of the
} controller failure, so I am pretty sure it's not just one bad slot or the
} like.
} >
} > So you've seen a backplane fail in away that the disks come up fine at
} boot but have corrupted data transfers across them?  I wonder about the
} sata cables in that case as well.  I could hook up a pair of PMP's to my
} SI3132's and bypass the 8077 cables as well.
} 
} 1. Try by-passing the backplane.
} 2. Bad cables will usually cause smart identifier UDMA_CRC_Error_Count to
}    increase quite high, if it is 0 or close to it, the cable is unlikely
} the
}    issue.
} 3. I have seem all kinds of weirdness with bad backplanes, drives dropping
} out
}    of the array, drives producing I/O errors, etc.
} 
} Justin.
} 
} 
} 
} --
} To unsubscribe from this list: send the line "unsubscribe linux-raid" in
} the body of a message to majordomo@vger.kernel.org
} More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-03  5:02                       ` Mike Myers
@ 2009-01-03 12:46                         ` John Robinson
  2009-01-03 15:49                           ` Mike Myers
  0 siblings, 1 reply; 46+ messages in thread
From: John Robinson @ 2009-01-03 12:46 UTC (permalink / raw)
  To: Mike Myers; +Cc: linux-raid

On 03/01/2009 05:02, Mike Myers wrote:
> I have tried that.  It still complains about only having 4 disks to start the array (if don't tell it to use sdf1).
> 
> I have been unable to explain my md refuses to use some of the members even though they have good superblock info on them even with the force command.  There are two members of md1 that are online and seem to have proper superblock info, but md doesn't assemble md1 with them.  
> 
> Is there a place (besides the code) where md's specifics about how it assembles members is documented?  

I'm absolutely no expert here, but I vaguely recall one of the 
developers recently noting that there was a minor bug in `mdadm 
--assemble --force` whereby if you didn't mention the broken member(s) 
first on the command line, and the early member(s) were good, the later 
member(s) didn't get forced. So in your case, you might try mentioning 
your array members in a different order, as long as you don't blame me 
when it eats your cat, or whatever.

Aha, here it is: http://marc.info/?l=linux-raid&m=122938233431234&w=2

Not quite what I said, but not a zillion miles off :-) Good luck.

Cheers,

John.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-03 12:46                         ` John Robinson
@ 2009-01-03 15:49                           ` Mike Myers
  2009-01-03 16:14                             ` John Robinson
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Myers @ 2009-01-03 15:49 UTC (permalink / raw)
  To: John Robinson; +Cc: linux-raid

Ugh.  This would explain a lot!  i'll try this out and see if it can help get md1 back online.  Is the only way to regenerate a superblock on a member doing a a create with the assume-clean option?

thx
mike

----- Original Message ----
From: John Robinson <john.robinson@anonymous.org.uk>
To: Mike Myers <mikesm559@yahoo.com>
Cc: linux-raid@vger.kernel.org
Sent: Saturday, January 3, 2009 4:46:15 AM
Subject: Re: Need urgent help in fixing raid5 array

On 03/01/2009 05:02, Mike Myers wrote:
> I have tried that.  It still complains about only having 4 disks to start the array (if don't tell it to use sdf1).
> 
> I have been unable to explain my md refuses to use some of the members even though they have good superblock info on them even with the force command.  There are two members of md1 that are online and seem to have proper superblock info, but md doesn't assemble md1 with them.  
> Is there a place (besides the code) where md's specifics about how it assembles members is documented?  

I'm absolutely no expert here, but I vaguely recall one of the developers recently noting that there was a minor bug in `mdadm --assemble --force` whereby if you didn't mention the broken member(s) first on the command line, and the early member(s) were good, the later member(s) didn't get forced. So in your case, you might try mentioning your array members in a different order, as long as you don't blame me when it eats your cat, or whatever.

Aha, here it is: http://marc.info/?l=linux-raid&m=122938233431234&w=2

Not quite what I said, but not a zillion miles off :-) Good luck.

Cheers,

John.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-03 15:49                           ` Mike Myers
@ 2009-01-03 16:14                             ` John Robinson
  2009-01-03 16:47                               ` Mike Myers
  2009-01-03 19:03                               ` Mike Myers
  0 siblings, 2 replies; 46+ messages in thread
From: John Robinson @ 2009-01-03 16:14 UTC (permalink / raw)
  To: Mike Myers; +Cc: linux-raid

On 03/01/2009 15:49, Mike Myers wrote:
> Ugh.  This would explain a lot!  i'll try this out and see if it can help get md1 back online.

Good luck; the rest of that thread's probably worth a read before 
starting in too, just to see whether you need to mention your two dirty 
members first or last.

As Guy suggested earlier in this thread, you might try doing your 
reassemble while missing out the one with the apparently completely 
hosed superblock, to at least get the thing up in degraded mode, then 
test fsck it (e.g. `e2fsck -n`) and mount it read-only to see if you've 
still got any data. And perhaps take a backup then.

>  Is the only way to regenerate a superblock on a member doing a a create with the assume-clean option?

I'm not sure, but I expect so. Seconding what Justin said much earlier 
in this thread, personally I'd wait until one of the gurus arrives, in 
their shining armour and on their white charger, before trying this.

Cheers,

John.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-03 16:14                             ` John Robinson
@ 2009-01-03 16:47                               ` Mike Myers
  2009-01-03 19:03                               ` Mike Myers
  1 sibling, 0 replies; 46+ messages in thread
From: Mike Myers @ 2009-01-03 16:47 UTC (permalink / raw)
  To: John Robinson; +Cc: linux-raid

I tried doing the assemble without the one with the trashed superblock, but got the same result.  It may be the order of devices is a problem.  I'll play around with this knowing about the bug and see how I fare.

I am running lvm and xfs on top of the raid set, so knowing if it's working is somewhat more complicated, but I will continue to hold off on the assume-clean option for awhile more.

I have about 4 TB of storage on other servers that I can dump a lot of the data to for backup once I get this thing running again, then I'll reconfigure everything from scratch.

Thanks so much for the help.

thx
mike

----- Original Message ----
From: John Robinson <john.robinson@anonymous.org.uk>
To: Mike Myers <mikesm559@yahoo.com>
Cc: linux-raid@vger.kernel.org
Sent: Saturday, January 3, 2009 8:14:25 AM
Subject: Re: Need urgent help in fixing raid5 array

On 03/01/2009 15:49, Mike Myers wrote:
> Ugh.  This would explain a lot!  i'll try this out and see if it can help get md1 back online.

Good luck; the rest of that thread's probably worth a read before starting in too, just to see whether you need to mention your two dirty members first or last.

As Guy suggested earlier in this thread, you might try doing your reassemble while missing out the one with the apparently completely hosed superblock, to at least get the thing up in degraded mode, then test fsck it (e.g. `e2fsck -n`) and mount it read-only to see if you've still got any data. And perhaps take a backup then.

>  Is the only way to regenerate a superblock on a member doing a a create with the assume-clean option?

I'm not sure, but I expect so. Seconding what Justin said much earlier in this thread, personally I'd wait until one of the gurus arrives, in their shining armour and on their white charger, before trying this.

Cheers,

John.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-03 16:14                             ` John Robinson
  2009-01-03 16:47                               ` Mike Myers
@ 2009-01-03 19:03                               ` Mike Myers
  1 sibling, 0 replies; 46+ messages in thread
From: Mike Myers @ 2009-01-03 19:03 UTC (permalink / raw)
  To: John Robinson; +Cc: linux-raid

Ok, all the devices are marked clean, so that doesn't appear to be the problem.  but thanks to your link I was reminded that the assemble command has a verbose option.  It gives us a much better clue:

# mdadm /dev/md1 --assemble /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdi1 /dev/sdj1 --force --verbose

md: md1 stopped.
md: unbind<sdi1>
md: export_rdev(sdi1)
md: unbind<sde1>
md: export_rdev(sde1)
md: unbind<sdd1>
md: export_rdev(sdd1)
md: unbind<sdc1>
md: export_rdev(sdc1)
md: unbind<sdf1>
md: export_rdev(sdf1)
mdadm: looking for devices for /dev/md1
mdadm: /dev/sdb1 is identified as a member of /dev/md1, slot 0
mdadm: /dev/sdc1 is identified as a member of /dev/md1, slot 4
mdadm: /dev/sdd1 is identified as a member of /dev/md1, slot 5
mdadm: /dev/sde1 is identified as a member of /dev/md1, slot 6
mdadm: /dev/sdi1 is identified as a member of /dev/md1, slot 0
mdadm: /dev/sdj1 is identified as a member of /dev/md1, slot 0
mdadm: no uptodate device for slot 1 of /dev/md1
mdadm: no uptodate device for slot 2 of /dev/md1
mdadm: no uptodate device for slot 3 of /dev/md1
md: bind<sdc1>
mdadm: added /dev/sdc1 to /dev/md1 as 4
md: bind<sdd1>
mdadm: added /dev/sdd1 to /dev/md1 as 5
md: bind<sdc1>
mdadm: added /dev/sde1 to /dev/md1 as 6
md: bind<sdi1>
mdadm: added /dev/sdi1 to /dev/md1 as 0
mdadm: /dev/md1 assembled from 4 drives - not enough to start the array.

Any ideas as to what the issue here is?  How did the slot info get corrupted?  How can I tell which slot is being these drives belong to?  I have backups of the system, so if this is mentioned in a log file, I can probably get it, but the device names are all different now because of the controller failure and the bypassing of the backplane.

thx
mike

----- Original Message ----
From: John Robinson <john.robinson@anonymous.org.uk>
To: Mike Myers <mikesm559@yahoo.com>
Cc: linux-raid@vger.kernel.org
Sent: Saturday, January 3, 2009 8:14:25 AM
Subject: Re: Need urgent help in fixing raid5 array

On 03/01/2009 15:49, Mike Myers wrote:
> Ugh.  This would explain a lot!  i'll try this out and see if it can help get md1 back online.

Good luck; the rest of that thread's probably worth a read before starting in too, just to see whether you need to mention your two dirty members first or last.

As Guy suggested earlier in this thread, you might try doing your reassemble while missing out the one with the apparently completely hosed superblock, to at least get the thing up in degraded mode, then test fsck it (e.g. `e2fsck -n`) and mount it read-only to see if you've still got any data. And perhaps take a backup then.

>  Is the only way to regenerate a superblock on a member doing a a create with the assume-clean option?

I'm not sure, but I expect so. Seconding what Justin said much earlier in this thread, personally I'd wait until one of the gurus arrives, in their shining armour and on their white charger, before trying this.

Cheers,

John.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-02  6:19       ` Mike Myers
  2009-01-02 12:10         ` Justin Piszcz
@ 2009-01-05 22:11         ` Neil Brown
  2009-01-05 22:22           ` Mike Myers
  1 sibling, 1 reply; 46+ messages in thread
From: Neil Brown @ 2009-01-05 22:11 UTC (permalink / raw)
  To: Mike Myers; +Cc: Justin Piszcz, linux-raid, john lists

On Thursday January 1, mikesm559@yahoo.com wrote:
> 
> (md1 is comprised of sdd1 sdf1 sdg1 sdh1 sdj1 sdk1 sdo1) 
> 
> (none):~> mdadm --examine /dev/sdd1 /dev/sdf1 /dev/sdg1 /dev/sdh1
....
> (none):~> mdadm --examine /dev/sdj1 /dev/sdk1 /dev/sdo1
....

So, of these 7 devices, 5 of them (d1 f1 h1 j1 k1) think that are
good members of the array with current metadata, and the other two (g1
and o1) have old/wrong metadata.

This means that you need to recreate the array.

The questions:  Which of g1 and o1 has more recent valid data, and
which should be device '1' and which should be device '2'.
Depending on the answer to these two questions, the command to you
will be one for the following 4 possibilities.

mdadm -C /dev/md1 -e1.0 -l5 -n7 -c128 -b internal /dev/sdh1 /dev/sdg1 missing /dev/sdk1 /dev/sdf1 /dev/sdd1 /dev/sdj1
mdadm -C /dev/md1 -e1.0 -l5 -n7 -c128 -b internal /dev/sdh1 /dev/sdo1 missing /dev/sdk1 /dev/sdf1 /dev/sdd1 /dev/sdj1
mdadm -C /dev/md1 -e1.0 -l5 -n7 -c128 -b internal /dev/sdh1 missing /dev/sdg1 /dev/sdk1 /dev/sdf1 /dev/sdd1 /dev/sdj1
mdadm -C /dev/md1 -e1.0 -l5 -n7 -c128 -b internal /dev/sdh1 missing /dev/sdo1 /dev/sdk1 /dev/sdf1 /dev/sdd1 /dev/sdj1

These commands will not change the data on the device, just the
metadata.
Once you have created the array, you try to validate the data (fsck or
similar).
If it looks bad, stop the array and try a different command.

Note: the metadata on g1 and o1 is very strange.  It looks like an old
copy of the metadata from sdh1.  so it could be that one of g1 or o1
is really the first drive in the array, and h1 one of the two
'missing' devices.  So if none of the 4 commands I gave work, try
other permutations with o1 or g1 first, and h1 second or third.

NeilBrown

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-05 22:11         ` Neil Brown
@ 2009-01-05 22:22           ` Mike Myers
  2009-01-05 22:53             ` NeilBrown
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Myers @ 2009-01-05 22:22 UTC (permalink / raw)
  To: Neil Brown; +Cc: Justin Piszcz, linux-raid, john lists

Thanks!  I see what you are doing here.  So since none of these commands actually change the underlying data, if I get the order right. the array will come up and the LVM superblock will be visible, and then I can try and bring the filesystem online?  If I get the order wrong, I can just try it again with another combination.  Do I have that right?  

I should probably print out all the existing metadata and save it since the data will be wiped out by the create.

How could the drives get the bad metadata on them>  I've played with software raid for about 4 years and have seen seen something this strange.

thx
mike

----- Original Message ----
From: Neil Brown <neilb@suse.de>
To: Mike Myers <mikesm559@yahoo.com>
Cc: Justin Piszcz <jpiszcz@lucidpixels.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Monday, January 5, 2009 2:11:01 PM
Subject: Re: Need urgent help in fixing raid5 array

On Thursday January 1, mikesm559@yahoo.com wrote:
> 
> (md1 is comprised of sdd1 sdf1 sdg1 sdh1 sdj1 sdk1 sdo1) 
> 
> (none):~> mdadm --examine /dev/sdd1 /dev/sdf1 /dev/sdg1 /dev/sdh1
....
> (none):~> mdadm --examine /dev/sdj1 /dev/sdk1 /dev/sdo1
....

So, of these 7 devices, 5 of them (d1 f1 h1 j1 k1) think that are
good members of the array with current metadata, and the other two (g1
and o1) have old/wrong metadata.

This means that you need to recreate the array.

The questions:  Which of g1 and o1 has more recent valid data, and
which should be device '1' and which should be device '2'.
Depending on the answer to these two questions, the command to you
will be one for the following 4 possibilities.

mdadm -C /dev/md1 -e1.0 -l5 -n7 -c128 -b internal /dev/sdh1 /dev/sdg1 missing /dev/sdk1 /dev/sdf1 /dev/sdd1 /dev/sdj1
mdadm -C /dev/md1 -e1.0 -l5 -n7 -c128 -b internal /dev/sdh1 /dev/sdo1 missing /dev/sdk1 /dev/sdf1 /dev/sdd1 /dev/sdj1
mdadm -C /dev/md1 -e1.0 -l5 -n7 -c128 -b internal /dev/sdh1 missing /dev/sdg1 /dev/sdk1 /dev/sdf1 /dev/sdd1 /dev/sdj1
mdadm -C /dev/md1 -e1.0 -l5 -n7 -c128 -b internal /dev/sdh1 missing /dev/sdo1 /dev/sdk1 /dev/sdf1 /dev/sdd1 /dev/sdj1

These commands will not change the data on the device, just the
metadata.
Once you have created the array, you try to validate the data (fsck or
similar).
If it looks bad, stop the array and try a different command.

Note: the metadata on g1 and o1 is very strange.  It looks like an old
copy of the metadata from sdh1.  so it could be that one of g1 or o1
is really the first drive in the array, and h1 one of the two
'missing' devices.  So if none of the 4 commands I gave work, try
other permutations with o1 or g1 first, and h1 second or third.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-05 22:22           ` Mike Myers
@ 2009-01-05 22:53             ` NeilBrown
  2009-01-06  2:46               ` Mike Myers
  0 siblings, 1 reply; 46+ messages in thread
From: NeilBrown @ 2009-01-05 22:53 UTC (permalink / raw)
  To: Mike Myers; +Cc: Justin Piszcz, linux-raid, john lists

On Tue, January 6, 2009 9:22 am, Mike Myers wrote:
> Thanks!  I see what you are doing here.  So since none of these commands
> actually change the underlying data, if I get the order right. the array
> will come up and the LVM superblock will be visible, and then I can try
> and bring the filesystem online?  If I get the order wrong, I can just try
> it again with another combination.  Do I have that right?

Exactly, yes.

>
> I should probably print out all the existing metadata and save it since
> the data will be wiped out by the create.

Certainly a good idea.

>
> How could the drives get the bad metadata on them>  I've played with
> software raid for about 4 years and have seen seen something this strange.

I really cannot think how they could get the particular bad metadata that
they did.  "mdadm --add" will change the metadata, but only to make the
device appear to be a spare, which isn't the case here.
"mdadm --assemble --force" can re-write the metadata, but again I cannot
imagine it making the particular change that has been made.  Maybe there
is a bug in there somewhere.

NeilBrown


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-05 22:53             ` NeilBrown
@ 2009-01-06  2:46               ` Mike Myers
  2009-01-06  4:00                 ` NeilBrown
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Myers @ 2009-01-06  2:46 UTC (permalink / raw)
  To: NeilBrown; +Cc: Justin Piszcz, linux-raid, john lists

BTW, don't I need to use the --assume-clean option in the create operation to have this work right?

Thanks,
Mike




----- Original Message ----
From: NeilBrown <neilb@suse.de>
To: Mike Myers <mikesm559@yahoo.com>
Cc: Justin Piszcz <jpiszcz@lucidpixels.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Monday, January 5, 2009 2:53:43 PM
Subject: Re: Need urgent help in fixing raid5 array

On Tue, January 6, 2009 9:22 am, Mike Myers wrote:
> Thanks!  I see what you are doing here.  So since none of these commands
> actually change the underlying data, if I get the order right. the array
> will come up and the LVM superblock will be visible, and then I can try
> and bring the filesystem online?  If I get the order wrong, I can just try
> it again with another combination.  Do I have that right?

Exactly, yes.

>
> I should probably print out all the existing metadata and save it since
> the data will be wiped out by the create.

Certainly a good idea.

>
> How could the drives get the bad metadata on them>  I've played with
> software raid for about 4 years and have seen seen something this strange.

I really cannot think how they could get the particular bad metadata that
they did.  "mdadm --add" will change the metadata, but only to make the
device appear to be a spare, which isn't the case here.
"mdadm --assemble --force" can re-write the metadata, but again I cannot
imagine it making the particular change that has been made.  Maybe there
is a bug in there somewhere.

NeilBrown


      

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-06  2:46               ` Mike Myers
@ 2009-01-06  4:00                 ` NeilBrown
  2009-01-06  5:55                   ` Mike Myers
  2009-01-06  6:24                   ` Mike Myers
  0 siblings, 2 replies; 46+ messages in thread
From: NeilBrown @ 2009-01-06  4:00 UTC (permalink / raw)
  To: Mike Myers; +Cc: Justin Piszcz, linux-raid, john lists

On Tue, January 6, 2009 1:46 pm, Mike Myers wrote:
> BTW, don't I need to use the --assume-clean option in the create operation
> to have this work right?

No.  When you create a degraded raid5, it is always assumed to be clean,
because it doesn't make any sense for it to be dirty.
However it wouldn't hurt to use --assume-clean, but it won't make any
difference.

NeilBrown


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-06  4:00                 ` NeilBrown
@ 2009-01-06  5:55                   ` Mike Myers
  2009-01-06 23:23                     ` Neil Brown
  2009-01-06  6:24                   ` Mike Myers
  1 sibling, 1 reply; 46+ messages in thread
From: Mike Myers @ 2009-01-06  5:55 UTC (permalink / raw)
  To: NeilBrown; +Cc: Justin Piszcz, linux-raid, john lists

Neil, the devices have moved around a bit since my post on the --examine results of each drive as an attempt to bypass a possibly bad sata backplane and known bad sata controller.  So I have to recheck the array positions of the drives.  A few questions about how md acts:

1) When I grow an array, do the existing members of the array maintain their slot positions?  So if I had 4 drives sda1 sdb1 sdc1 and sdd1 as part of a RAID5 array, and then add sde1, would sde1 take slot 4 or the array, or do the array slots get reset during a reshape operation?  

The reason I ask is that from the smartctl -a data for each drive, I can get the total powered on hours for the drive.  If the drive has a lot of hours on it, it will have been added earlier than another drive, and if the array positiosn are constant, then I can kind of reconstruct the array order on an array that's been built incrementally by look at drive power on times.

2) If a drive goes bad an is replaced by a spare, does the spare take the orginal array slot of the faulty drive?

3) It appears the slot number -1 is the member number?  That is, if I do an examine on /dev/sdc1, it tells me it's got slot 5 of md1.  But when I do an assemble operation with the --verbose flag, it says /dev/sdc1 "as added as 4".  The reason I ask if that's true, what would a slot number 0 mean in terms of what --assemble is supposed to add it as?  When I do the assemble, it's added as 0, which I don't understand if the slot number is supposed to be one higher?

4) /dev/sdf1 (the new device name) thinks it's part of md2 (when I do an examine), but can't be, because md2 is all seagate and already has 7 members in it (the right number of drives).  So it must be part of md1, which is missing a member.  When I first tried to reassemble md1, it said it only found 5 good drives, and couldn't start.  now it says it only finds 4 good drives.  So I assume sdf1 is one of the 5 good ones but got a weird superblock written to it.  Other than the drive hours trick I thought of earlier, is there any way to determine what it's slot number should have been since I am missing slots 1, 2, and 3, and have 3 candidates for slot 0?

Lasttly, it REALLY would make life a LOT easier if the devices wouldn't change evrytime they were plugged into a different controller slot, or that controller slots wouldn't change based on boot order etc...  It is a pain in rear when you have a hardware outage or a disk that isn't detected properly on boot and then hot added to have it's /dev/sdx1 label change.  I know it's not md's fault the way this works, but in a hot swap world it makes it very hard to document drive configurations and map devices under linux to physical drives.

Thanks a lot Neil.

Mike

----- Original Message ----
From: NeilBrown <neilb@suse.de>
To: Mike Myers <mikesm559@yahoo.com>
Cc: Justin Piszcz <jpiszcz@lucidpixels.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Monday, January 5, 2009 8:00:43 PM
Subject: Re: Need urgent help in fixing raid5 array

On Tue, January 6, 2009 1:46 pm, Mike Myers wrote:
> BTW, don't I need to use the --assume-clean option in the create operation
> to have this work right?

No.  When you create a degraded raid5, it is always assumed to be clean,
because it doesn't make any sense for it to be dirty.
However it wouldn't hurt to use --assume-clean, but it won't make any
difference.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-06  4:00                 ` NeilBrown
  2009-01-06  5:55                   ` Mike Myers
@ 2009-01-06  6:24                   ` Mike Myers
  2009-01-06 23:31                     ` Neil Brown
  1 sibling, 1 reply; 46+ messages in thread
From: Mike Myers @ 2009-01-06  6:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: Justin Piszcz, linux-raid, john lists

BTW, in the original email I sent that had the --examine info for each of these array members, three devices have the same device UUID and array slot, and two of them share an older event count, and one has a slightly newer event count.  Which of these should be the real array slot 0?  And I notice that one of the members in that email had a device UUID that I can't find anymore (I suspect it's the current sdf1 that thinks it's part of md2).  In that email, it had array slot 4, which is one of the missing devices in the current familt (that I assume --assemble would add as "3").  It also has 9663 hours on it, which makes it part of the original set of 4 members for this raid5 array.  The drive in slot 5 only has 7630 hours on it, so it should have been added later as part of a --grow opera
 tion.

Does all that make sense?  If so, then sdb1, (which says it's slot 0), sdi1 (at 9671 hours) and also thinks it's slot 0, sdj1 (at 9194 hours) which also says it's 0, and sdf1 (at 9663 hours) and used to apparently think it's slot 4 should be the original 4 drives of the array.  How can I figure out which is the real slot 0, and who is slot 1 and 2 if sdi1 and sdj1 all have the same event count and array slot id (0) and same device UUID?

This is way harder work than should be need to fix a problem.  :-)  But I am sure glad you gurus know how this stuff is supposed to work!

Thx
Mike

----- Original Message ----
From: NeilBrown <neilb@suse.de>
To: Mike Myers <mikesm559@yahoo.com>
Cc: Justin Piszcz <jpiszcz@lucidpixels.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Monday, January 5, 2009 8:00:43 PM
Subject: Re: Need urgent help in fixing raid5 array

On Tue, January 6, 2009 1:46 pm, Mike Myers wrote:
> BTW, don't I need to use the --assume-clean option in the create operation
> to have this work right?

No.  When you create a degraded raid5, it is always assumed to be clean,
because it doesn't make any sense for it to be dirty.
However it wouldn't hurt to use --assume-clean, but it won't make any
difference.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-06  5:55                   ` Mike Myers
@ 2009-01-06 23:23                     ` Neil Brown
  0 siblings, 0 replies; 46+ messages in thread
From: Neil Brown @ 2009-01-06 23:23 UTC (permalink / raw)
  To: Mike Myers; +Cc: Justin Piszcz, linux-raid, john lists

On Monday January 5, mikesm559@yahoo.com wrote:
> Neil, the devices have moved around a bit since my post on the
> --examine results of each drive as an attempt to bypass a possibly bad
> sata backplane and known bad sata controller.  So I have to recheck
> the array positions of the drives.  A few questions about how md acts: 
> 
> 1)  When I grow an array, do the existing members of the array
>     maintain their slot positions?  So if I had 4 drives sda1 sdb1
>     sdc1 and sdd1 as part of a RAID5 array, and then add sde1, would
>     sde1 take slot 4 or the array, or do the array slots get reset
>     during a reshape operation?   

Slots do not get reset.

> 
> The reason I ask is that from the smartctl -a data for each drive, I
> can get the total powered on hours for the drive.  If the drive has
> a lot of hours on it, it will have been added earlier than another
> drive, and if the array positiosn are constant, then I can kind of
> reconstruct the array order on an array that's been built
> incrementally by look at drive power on times. 
> 
> 2) If a drive goes bad an is replaced by a spare, does the spare
>    take the orginal array slot of the faulty drive? 

With v1.x metadata you need to be careful of the difference between the
'slot' and the 'role' of each drive.

The 'slot' is an arbitrary number that is assigned when the devices is
added to the array, whether as a full member or as a spare.  It
doesn't change.
The 'role' is how the drive is functioning in the array.  When a device
is added to an array, its role is set to 'spare'.  When a drive
fails and this spare is used to recover that drive, the 'role' is
change to match the role of the original drive.  But the slot stays
the same.
If you look at the "Array Slot" line in the "mdadm --examine" output
you will see something like 
    Array Slot : 8 (0, failed, failed, failed, 3, 4, failed, 5, 6)

That means that this device occupies slot 8, and that for the whole
array:
  The device in slot 0 is being used in role 0 (first live device in array)
  The devices in slots 1,2,3 have all failed
  The device in slot 4 has role 3
  The device in slot 5 has role 4
  The device in slot 6 has failed
  The devices in slots 7 and 8 have roles 5 and 6
So this device is in role 6.
This can be confirmed by looking at the "Array State" line:
   Array State : u__uuuU 4 failed

There is no device ('_') with role 1 or 2, This device ('U') is in the
last role, role 6, and roles 0,3,4,5 are filled by other working
devices in the array ('u').


> 
> 3)  It appears the slot number -1 is the member number?  That is, if
      I do an examine on /dev/sdc1, it tells me it's got slot 5 of
      md1.  But when I do an assemble operation with the --verbose
      flag, it says /dev/sdc1 "as added as 4".  The reason I ask if
      that's true, what would a slot number 0 mean in terms of what
      --assemble is supposed to add it as?  When I do the assemble,
      it's added as 0, which I don't understand if the slot number is
      supposed to be one higher? 

There is no simple arithmetical relationship between slot number and
member number (aka 'role').  They are assigned independently.

> 
> 4)  /dev/sdf1 (the new device name) thinks it's part of md2 (when I
>     do an examine), but can't be, because md2 is all seagate and
>     already has 7 members in it (the right number of drives).  So it
>     must be part of md1, which is missing a member.  When I first
>     tried to reassemble md1, it said it only found 5 good drives,
>     and couldn't start.  now it says it only finds 4 good drives.
>     So I assume sdf1 is one of the 5 good ones but got a weird
>     superblock written to it.  Other than the drive hours trick I
>     thought of earlier, is there any way to determine what it's slot
>     number should have been since I am missing slots 1, 2, and 3,
>     and have 3 candidates for slot 0? 

No.  That information lives in the superblock and is displayed by
--examine.  If the superblock as been corrupted somehow, then that
information is gone.

> 
> Lasttly, it REALLY would make life a LOT easier if the devices
> wouldn't change evrytime they were plugged into a different
> controller slot, or that controller slots wouldn't change based on
> boot order etc...  It is a pain in rear when you have a hardware
> outage or a disk that isn't detected properly on boot and then hot
> added to have it's /dev/sdx1 label change.  I know it's not md's
> fault the way this works, but in a hot swap world it makes it very
> hard to document drive configurations and map devices under linux to
> physical drives. 

Yes, it can be a pain.  The various link in e.g. /dev/disk/by-uuid are
supposed to make that a bit more manageable.  Whether they succeed is
less clear.

NeilBrown

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-06  6:24                   ` Mike Myers
@ 2009-01-06 23:31                     ` Neil Brown
  2009-01-06 23:54                       ` Mike Myers
  2009-01-13  5:38                       ` Mike Myers
  0 siblings, 2 replies; 46+ messages in thread
From: Neil Brown @ 2009-01-06 23:31 UTC (permalink / raw)
  To: Mike Myers; +Cc: Justin Piszcz, linux-raid, john lists

On Monday January 5, mikesm559@yahoo.com wrote:
> BTW, in the original email I sent that had the --examine info for
> each of these array members, three devices have the same device UUID
> and array slot, and two of them share an older event count, and one
> has a slightly newer event count.  Which of these should be the real
> array slot 0?  And I notice that one of the members in that email
> had a device UUID that I can't find anymore (I suspect it's the
> current sdf1 that thinks it's part of md2).  In that email, it had
> array slot 4, which is one of the missing devices in the current
> familt (that I assume --assemble would add as "3").  It also has
> 9663 hours on it, which makes it part of the original set of 4
> members for this raid5 array.  The drive in slot 5 only has 7630
> hours on it, so it should have been added later as part of a --grow
> operation. 
> 
> Does all that make sense?  If so, then sdb1, (which says it's slot
> 0), sdi1 (at 9671 hours) and also thinks it's slot 0, sdj1 (at 9194
> hours) which also says it's 0, and sdf1 (at 9663 hours) and used to
> apparently think it's slot 4 should be the original 4 drives of the
> array.  How can I figure out which is the real slot 0, and who is
> slot 1 and 2 if sdi1 and sdj1 all have the same event count and
> array slot id (0) and same device UUID? 

I had noticed the slot number was repeated.  I hadn't noticed the
device uuid was the same, though I guess that makes sense.  Somehow
the superblock for one device has been written to the other devices.
It is not really possible to be sure which is the original without
knowing how this happened, though I suspect that the one with the
higher event count is more likely to be the original.

Being a software guy, I tend to like to blame hardware, and I wonder
if your problematic backplane managed to send write requests to the
wrong drive somehow.  If it did, then my expectation of your success
just went down a few notches. :-(

The only option for you to try to find out which device is which is to
try various combinations and see what gives you access to the most
consistent data.

> 
> This is way harder work than should be need to fix a problem.  :-)
> But I am sure glad you gurus know how this stuff is supposed to
> work! 

I'm happy to help as much as I can... I just hope your hardware hasn't
done too much damage...

NeilBrown

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-06 23:31                     ` Neil Brown
@ 2009-01-06 23:54                       ` Mike Myers
  2009-01-07  0:19                         ` NeilBrown
  2009-01-13  5:38                       ` Mike Myers
  1 sibling, 1 reply; 46+ messages in thread
From: Mike Myers @ 2009-01-06 23:54 UTC (permalink / raw)
  To: Neil Brown; +Cc: Justin Piszcz, linux-raid, john lists

Thanks for this and the previous explanation of how roles and slots work.  I should be able to try and few combinations and see.  At this point, I am not sure if the issue was caused by a bad backplane, or bad controller or bad disk.  I can't tell for sure the backplane was bad, but I have a replacement sitting at my desk now, so I can go ahead and replace it just to be sure.  The LSI MPT controller that failed was connected only to drives in md2, but that array is up and running fine and so I don't think it broke something when it failed.

I had seen two smart alerts indicating a drive was failing, which is what caused me to try and replace the kicked drive with a new one and do a rebuild, which was the event that started this chain of events.  I swapped the drive (part of md1), but the OS did not indicate the SATA port went down and did not init the new drive.  When I rebooted to system (suspecting a temporary problem with the controller), everything went to hell.  I suspect this initial failure was due to the backplane problem, but it may have had some corruption on the disks as well.  

I may have fat fingered something after the reboot that caused the problem with a bad superblock being written to the sdf1 as the device names may have changed on boot, and I didn't catch that (I may have done a hotswap a month ago when I had my first near death experience with md2) leading me to use the wrong device in an mdadm command, but it's hard to tell that now.

With 15 hotswap drives in the system, I can tell you that device name changing is fraught with peril.  I am unfamilar with the /dev/disk/by-uuid functionality.  Is that documented in a howto somewhere?  How is that supposed to work?  

thx
mike

----- Original Message ----
From: Neil Brown <neilb@suse.de>
To: Mike Myers <mikesm559@yahoo.com>
Cc: Justin Piszcz <jpiszcz@lucidpixels.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Tuesday, January 6, 2009 3:31:44 PM
Subject: Re: Need urgent help in fixing raid5 array

On Monday January 5, mikesm559@yahoo.com wrote:
> BTW, in the original email I sent that had the --examine info for
> each of these array members, three devices have the same device UUID
> and array slot, and two of them share an older event count, and one
> has a slightly newer event count.  Which of these should be the real
> array slot 0?  And I notice that one of the members in that email
> had a device UUID that I can't find anymore (I suspect it's the
> current sdf1 that thinks it's part of md2).  In that email, it had
> array slot 4, which is one of the missing devices in the current
> familt (that I assume --assemble would add as "3").  It also has
> 9663 hours on it, which makes it part of the original set of 4
> members for this raid5 array.  The drive in slot 5 only has 7630
> hours on it, so it should have been added later as part of a --grow
> operation. 
> 
> Does all that make sense?  If so, then sdb1, (which says it's slot
> 0), sdi1 (at 9671 hours) and also thinks it's slot 0, sdj1 (at 9194
> hours) which also says it's 0, and sdf1 (at 9663 hours) and used to
> apparently think it's slot 4 should be the original 4 drives of the
> array.  How can I figure out which is the real slot 0, and who is
> slot 1 and 2 if sdi1 and sdj1 all have the same event count and
> array slot id (0) and same device UUID? 

I had noticed the slot number was repeated.  I hadn't noticed the
device uuid was the same, though I guess that makes sense.  Somehow
the superblock for one device has been written to the other devices.
It is not really possible to be sure which is the original without
knowing how this happened, though I suspect that the one with the
higher event count is more likely to be the original.

Being a software guy, I tend to like to blame hardware, and I wonder
if your problematic backplane managed to send write requests to the
wrong drive somehow.  If it did, then my expectation of your success
just went down a few notches. :-(

The only option for you to try to find out which device is which is to
try various combinations and see what gives you access to the most
consistent data.

> 
> This is way harder work than should be need to fix a problem.  :-)
> But I am sure glad you gurus know how this stuff is supposed to
> work! 

I'm happy to help as much as I can... I just hope your hardware hasn't
done too much damage...

NeilBrown

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-06 23:54                       ` Mike Myers
@ 2009-01-07  0:19                         ` NeilBrown
  0 siblings, 0 replies; 46+ messages in thread
From: NeilBrown @ 2009-01-07  0:19 UTC (permalink / raw)
  To: Mike Myers; +Cc: Justin Piszcz, linux-raid, john lists

On Wed, January 7, 2009 10:54 am, Mike Myers wrote:

>
> With 15 hotswap drives in the system, I can tell you that device name
> changing is fraught with peril.  I am unfamilar with the /dev/disk/by-uuid
> functionality.  Is that documented in a howto somewhere?  How is that
> supposed to work?
>

/dev/disk/by-id is probably the one you want.
I don't know about documentation, but if you

 ls -l /dev/disk/by-id | grep -v part

and have a look it should make sense.
I get:
lrwxrwxrwx 1 root root  9 2009-01-07 11:10 ata-ST3160812AS_4LS5YYHQ ->
../../sda
lrwxrwxrwx 1 root root  9 2009-01-07 11:10 ata-ST3160812AS_4LS5YZDG ->
../../sdf
lrwxrwxrwx 1 root root  9 2009-01-07 11:10 ata-ST3160812AS_4LS5YZJQ ->
../../sdc
lrwxrwxrwx 1 root root  9 2009-01-07 11:10 ata-ST3160812AS_4LS5Z05D ->
../../sdd
lrwxrwxrwx 1 root root  9 2009-01-07 11:10 ata-ST3160812AS_4LS5Z0B6 ->
../../sde
lrwxrwxrwx 1 root root  9 2009-01-07 11:10 ata-ST3160812AS_4LS5Z0BH ->
../../sdb
lrwxrwxrwx 1 root root 11 2009-01-07 11:10
md-uuid-8fd0af3f:4fbb94ea:12cc2127:f9855db5 -> ../../md_d6
lrwxrwxrwx 1 root root  9 2009-01-07 11:10 scsi-SATA_ST3160812AS_4LS5YYHQ
-> ../../sda
lrwxrwxrwx 1 root root  9 2009-01-07 11:10 scsi-SATA_ST3160812AS_4LS5YZDG
-> ../../sdf
lrwxrwxrwx 1 root root  9 2009-01-07 11:10 scsi-SATA_ST3160812AS_4LS5YZJQ
-> ../../sdc
lrwxrwxrwx 1 root root  9 2009-01-07 11:10 scsi-SATA_ST3160812AS_4LS5Z05D
-> ../../sdd
lrwxrwxrwx 1 root root  9 2009-01-07 11:10 scsi-SATA_ST3160812AS_4LS5Z0B6
-> ../../sde
lrwxrwxrwx 1 root root  9 2009-01-07 11:10 scsi-SATA_ST3160812AS_4LS5Z0BH
-> ../../sdb


which shows you my 6 drives (each listed twice) with their model numbers
and serial numbers.
If I move the drive around, they will get different 'sda' names, but
if you look at the lines in disk/by-id you can be sure that a given name
will always refer to the same physical device.
This is all managed by 'udev'.

NeilBrown


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-06 23:31                     ` Neil Brown
  2009-01-06 23:54                       ` Mike Myers
@ 2009-01-13  5:38                       ` Mike Myers
  2009-01-13  5:57                         ` Mike Myers
  1 sibling, 1 reply; 46+ messages in thread
From: Mike Myers @ 2009-01-13  5:38 UTC (permalink / raw)
  To: Neil Brown; +Cc: Justin Piszcz, linux-raid, john lists

Ok, still more help needed.  I finally got enough time scheduled tonight to be able to try recreating the raid array as per our conversation before.  When doing the create as you outlined in your earlier post, mdadm -C says the first two disks are part of an existing raid array (I assume this is a normal "error" for this sort of situation and will be ignored in the end), but for each of the last 4 devices I speciffy on the command line, it says: mdadm: cannot open /dev/sdc1: Device or resource busy (and gives this error for each of the 4 devices).

The devices are online though.  I can do an mdadm --examine on them, dd from them, and do smartctl operations on them.  Why would md think they were busy?

Thx
Mike

----- Original Message ----
From: Neil Brown <neilb@suse.de>
To: Mike Myers <mikesm559@yahoo.com>
Cc: Justin Piszcz <jpiszcz@lucidpixels.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Tuesday, January 6, 2009 3:31:44 PM
Subject: Re: Need urgent help in fixing raid5 array

On Monday January 5, mikesm559@yahoo.com wrote:
> BTW, in the original email I sent that had the --examine info for
> each of these array members, three devices have the same device UUID
> and array slot, and two of them share an older event count, and one
> has a slightly newer event count.  Which of these should be the real
> array slot 0?  And I notice that one of the members in that email
> had a device UUID that I can't find anymore (I suspect it's the
> current sdf1 that thinks it's part of md2).  In that email, it had
> array slot 4, which is one of the missing devices in the current
> familt (that I assume --assemble would add as "3").  It also has
> 9663 hours on it, which makes it part of the original set of 4
> members for this raid5 array.  The drive in slot 5 only has 7630
> hours on it, so it should have been added later as part of a --grow
> operation. 
> 
> Does all that make sense?  If so, then sdb1, (which says it's slot
> 0), sdi1 (at 9671 hours) and also thinks it's slot 0, sdj1 (at 9194
> hours) which also says it's 0, and sdf1 (at 9663 hours) and used to
> apparently think it's slot 4 should be the original 4 drives of the
> array.  How can I figure out which is the real slot 0, and who is
> slot 1 and 2 if sdi1 and sdj1 all have the same event count and
> array slot id (0) and same device UUID? 

I had noticed the slot number was repeated.  I hadn't noticed the
device uuid was the same, though I guess that makes sense.  Somehow
the superblock for one device has been written to the other devices.
It is not really possible to be sure which is the original without
knowing how this happened, though I suspect that the one with the
higher event count is more likely to be the original.

Being a software guy, I tend to like to blame hardware, and I wonder
if your problematic backplane managed to send write requests to the
wrong drive somehow.  If it did, then my expectation of your success
just went down a few notches. :-(

The only option for you to try to find out which device is which is to
try various combinations and see what gives you access to the most
consistent data.

> 
> This is way harder work than should be need to fix a problem.  :-)
> But I am sure glad you gurus know how this stuff is supposed to
> work! 

I'm happy to help as much as I can... I just hope your hardware hasn't
done too much damage...

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Need urgent help in fixing raid5 array
  2009-01-13  5:38                       ` Mike Myers
@ 2009-01-13  5:57                         ` Mike Myers
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Myers @ 2009-01-13  5:57 UTC (permalink / raw)
  To: Mike Myers, Neil Brown; +Cc: Justin Piszcz, linux-raid, john lists

Figured this out.  I had to stop md1 even though md couldn't assemble it.  The "good devices" were still running.

Will let you know how it goes.

thx
mike

----- Original Message ----
From: Mike Myers <mikesm559@yahoo.com>
To: Neil Brown <neilb@suse.de>
Cc: Justin Piszcz <jpiszcz@lucidpixels.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Monday, January 12, 2009 9:38:21 PM
Subject: Re: Need urgent help in fixing raid5 array

Ok, still more help needed.  I finally got enough time scheduled tonight to be able to try recreating the raid array as per our conversation before.  When doing the create as you outlined in your earlier post, mdadm -C says the first two disks are part of an existing raid array (I assume this is a normal "error" for this sort of situation and will be ignored in the end), but for each of the last 4 devices I speciffy on the command line, it says: mdadm: cannot open /dev/sdc1: Device or resource busy (and gives this error for each of the 4 devices).

The devices are online though.  I can do an mdadm --examine on them, dd from them, and do smartctl operations on them.  Why would md think they were busy?

Thx
Mike

----- Original Message ----
From: Neil Brown <neilb@suse.de>
To: Mike Myers <mikesm559@yahoo.com>
Cc: Justin Piszcz <jpiszcz@lucidpixels.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Tuesday, January 6, 2009 3:31:44 PM
Subject: Re: Need urgent help in fixing raid5 array

On Monday January 5, mikesm559@yahoo.com wrote:
> BTW, in the original email I sent that had the --examine info for
> each of these array members, three devices have the same device UUID
> and array slot, and two of them share an older event count, and one
> has a slightly newer event count.  Which of these should be the real
> array slot 0?  And I notice that one of the members in that email
> had a device UUID that I can't find anymore (I suspect it's the
> current sdf1 that thinks it's part of md2).  In that email, it had
> array slot 4, which is one of the missing devices in the current
> familt (that I assume --assemble would add as "3").  It also has
> 9663 hours on it, which makes it part of the original set of 4
> members for this raid5 array.  The drive in slot 5 only has 7630
> hours on it, so it should have been added later as part of a --grow
> operation. 
> 
> Does all that make sense?  If so, then sdb1, (which says it's slot
> 0), sdi1 (at 9671 hours) and also thinks it's slot 0, sdj1 (at 9194
> hours) which also says it's 0, and sdf1 (at 9663 hours) and used to
> apparently think it's slot 4 should be the original 4 drives of the
> array.  How can I figure out which is the real slot 0, and who is
> slot 1 and 2 if sdi1 and sdj1 all have the same event count and
> array slot id (0) and same device UUID? 

I had noticed the slot number was repeated.  I hadn't noticed the
device uuid was the same, though I guess that makes sense.  Somehow
the superblock for one device has been written to the other devices.
It is not really possible to be sure which is the original without
knowing how this happened, though I suspect that the one with the
higher event count is more likely to be the original.

Being a software guy, I tend to like to blame hardware, and I wonder
if your problematic backplane managed to send write requests to the
wrong drive somehow.  If it did, then my expectation of your success
just went down a few notches. :-(

The only option for you to try to find out which device is which is to
try various combinations and see what gives you access to the most
consistent data.

> 
> This is way harder work than should be need to fix a problem.  :-)
> But I am sure glad you gurus know how this stuff is supposed to
> work! 

I'm happy to help as much as I can... I just hope your hardware hasn't
done too much damage...

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2009-01-13  5:57 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <451872.61166.qm@web30802.mail.mud.yahoo.com>
2009-01-01 15:40 ` Need urgent help in fixing raid5 array Justin Piszcz
2009-01-01 17:51   ` Mike Myers
2009-01-01 18:29     ` Justin Piszcz
2009-01-01 18:40       ` Jon Nelson
2009-01-01 20:38         ` Mike Myers
2009-01-02  6:19       ` Mike Myers
2009-01-02 12:10         ` Justin Piszcz
2009-01-02 18:12           ` Mike Myers
2009-01-02 18:22             ` Justin Piszcz
2009-01-02 18:46               ` Mike Myers
2009-01-02 18:57                 ` Justin Piszcz
2009-01-02 20:46                   ` Mike Myers
2009-01-02 20:56                   ` Mike Myers
2009-01-02 21:37                   ` Mike Myers
2009-01-03  4:19                   ` Mike Myers
2009-01-03  4:43                     ` Guy Watkins
2009-01-03  5:02                       ` Mike Myers
2009-01-03 12:46                         ` John Robinson
2009-01-03 15:49                           ` Mike Myers
2009-01-03 16:14                             ` John Robinson
2009-01-03 16:47                               ` Mike Myers
2009-01-03 19:03                               ` Mike Myers
2009-01-05 22:11         ` Neil Brown
2009-01-05 22:22           ` Mike Myers
2009-01-05 22:53             ` NeilBrown
2009-01-06  2:46               ` Mike Myers
2009-01-06  4:00                 ` NeilBrown
2009-01-06  5:55                   ` Mike Myers
2009-01-06 23:23                     ` Neil Brown
2009-01-06  6:24                   ` Mike Myers
2009-01-06 23:31                     ` Neil Brown
2009-01-06 23:54                       ` Mike Myers
2009-01-07  0:19                         ` NeilBrown
2009-01-13  5:38                       ` Mike Myers
2009-01-13  5:57                         ` Mike Myers
2009-01-01 15:31 Mike Myers
  -- strict thread matches above, loose matches on Subject: below --
2008-12-05 17:03 Mike Myers
2008-12-06  0:18 ` Mike Myers
2008-12-06  0:24   ` Justin Piszcz
2008-12-06  0:47     ` Mike Myers
2008-12-06  0:51       ` Justin Piszcz
2008-12-06  0:58         ` Mike Myers
2008-12-06 19:02         ` Mike Myers
2008-12-06 19:30           ` Mike Myers
2008-12-06 20:14             ` Mike Myers
2008-12-06  0:52     ` David Lethe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).