* Need urgent help in fixing raid5 array
@ 2008-12-05 17:03 Mike Myers
2008-12-06 0:18 ` Mike Myers
0 siblings, 1 reply; 46+ messages in thread
From: Mike Myers @ 2008-12-05 17:03 UTC (permalink / raw)
To: linux-raid
I have a problem with repairing a raid5 array I really need some help with. I must be missing something here.
I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that. Both arrays have 7 1 TB disks in them. I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots. That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline. Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state. I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild. No problem.
Around 60% through the resync, smartd started reporting problems with one of the other drives in the array. Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline. Ugh...
Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing. As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here. I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered.
So I remove the old disk, re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead. Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array. What am I missing here? Why doesn't it view it as the old disk as a member and just start it?
thx
mike
^ permalink raw reply [flat|nested] 46+ messages in thread* Re: Need urgent help in fixing raid5 array 2008-12-05 17:03 Need urgent help in fixing raid5 array Mike Myers @ 2008-12-06 0:18 ` Mike Myers 2008-12-06 0:24 ` Justin Piszcz 0 siblings, 1 reply; 46+ messages in thread From: Mike Myers @ 2008-12-06 0:18 UTC (permalink / raw) To: Mike Myers, linux-raid Anyone? What am I missing here? thx mike ----- Original Message ---- From: Mike Myers <mikesm559@yahoo.com> To: linux-raid@vger.kernel.org Sent: Friday, December 5, 2008 9:03:22 AM Subject: Need urgent help in fixing raid5 array I have a problem with repairing a raid5 array I really need some help with. I must be missing something here. I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that. Both arrays have 7 1 TB disks in them. I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots. That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline. Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state. I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild. No problem. Around 60% through the resync, smartd started reporting problems with one of the other drives in the array. Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline. Ugh... Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing. As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here. I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered. So I remove the old disk, re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead. Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array. What am I missing here? Why doesn't it view it as the old disk as a member and just start it? thx mike -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2008-12-06 0:18 ` Mike Myers @ 2008-12-06 0:24 ` Justin Piszcz 2008-12-06 0:47 ` Mike Myers 2008-12-06 0:52 ` David Lethe 0 siblings, 2 replies; 46+ messages in thread From: Justin Piszcz @ 2008-12-06 0:24 UTC (permalink / raw) To: Mike Myers; +Cc: linux-raid You can try this as a last resort: http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07815.html (mdadm w/create and assume-clean) but only use this as a last resort, when I had two disk failures, I was able to see some of the data but ultimately it was lost, bottom line? i dont use raid5 anymore, raid6 only, in the 3ware docs they recommend if you use more than 4 disks you should use raid6 if you have the capability, i agree some others on the list may have more . less intrusive ideas . only use the above method as a LAST RESORT, i was able to assemble the array but I had problems getting xfs_repair to fix the filesystem On Fri, 5 Dec 2008, Mike Myers wrote: > Anyone? What am I missing here? > > thx > mike > > > > > ----- Original Message ---- > From: Mike Myers <mikesm559@yahoo.com> > To: linux-raid@vger.kernel.org > Sent: Friday, December 5, 2008 9:03:22 AM > Subject: Need urgent help in fixing raid5 array > > I have a problem with repairing a raid5 array I really need some help with. I must be missing something here. > > I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that. Both arrays have 7 1 TB disks in them. I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots. That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline. Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state. I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild. No problem. > > Around 60% through the resync, smartd started reporting problems with one of the other drives in the array. Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline. Ugh... > > Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing. As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here. I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered. > > So I remove the old disk, re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead. Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array. What am I missing here? Why doesn't it view it as the old disk as a member and just start it? > > thx > mike > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2008-12-06 0:24 ` Justin Piszcz @ 2008-12-06 0:47 ` Mike Myers 2008-12-06 0:51 ` Justin Piszcz 2008-12-06 0:52 ` David Lethe 1 sibling, 1 reply; 46+ messages in thread From: Mike Myers @ 2008-12-06 0:47 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid Thanks very much. All the disks I am trying to assemble have the same event count and UUID, which is why I don't understand why it's not assembling. I'll try the assume-clean option and see if that helps. It would be great to understand how md determines if the drives are in sync with each other. I thought the uuid and event count was all you needed... thx mike ----- Original Message ---- From: Justin Piszcz <jpiszcz@lucidpixels.com> To: Mike Myers <mikesm559@yahoo.com> Cc: linux-raid@vger.kernel.org Sent: Friday, December 5, 2008 4:24:46 PM Subject: Re: Need urgent help in fixing raid5 array You can try this as a last resort: http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07815.html (mdadm w/create and assume-clean) but only use this as a last resort, when I had two disk failures, I was able to see some of the data but ultimately it was lost, bottom line? i dont use raid5 anymore, raid6 only, in the 3ware docs they recommend if you use more than 4 disks you should use raid6 if you have the capability, i agree some others on the list may have more . less intrusive ideas . only use the above method as a LAST RESORT, i was able to assemble the array but I had problems getting xfs_repair to fix the filesystem On Fri, 5 Dec 2008, Mike Myers wrote: > Anyone? What am I missing here? > > thx > mike > > > > > ----- Original Message ---- > From: Mike Myers <mikesm559@yahoo.com> > To: linux-raid@vger.kernel.org > Sent: Friday, December 5, 2008 9:03:22 AM > Subject: Need urgent help in fixing raid5 array > > I have a problem with repairing a raid5 array I really need some help with. I must be missing something here. > > I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that. Both arrays have 7 1 TB disks in them. I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots. That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline. Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state. I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild. No problem. > > Around 60% through the resync, smartd started reporting problems with one of the other drives in the array. Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline. Ugh... > > Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing. As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here. I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered. > > So I remove the old disk, re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead. Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array. What am I missing here? Why doesn't it view it as the old disk as a member and just start it? > > thx > mike > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2008-12-06 0:47 ` Mike Myers @ 2008-12-06 0:51 ` Justin Piszcz 2008-12-06 0:58 ` Mike Myers 2008-12-06 19:02 ` Mike Myers 0 siblings, 2 replies; 46+ messages in thread From: Justin Piszcz @ 2008-12-06 0:51 UTC (permalink / raw) To: Mike Myers; +Cc: linux-raid Only use it as A LAST resort, check the mailing list or wait for Neil/someone else with/whose had a similar issue who can maybe help more here. On Fri, 5 Dec 2008, Mike Myers wrote: > Thanks very much. All the disks I am trying to assemble have the same event count and UUID, which is why I don't understand why it's not assembling. I'll try the assume-clean option and see if that helps. > > It would be great to understand how md determines if the drives are in sync with each other. I thought the uuid and event count was all you needed... > > thx > mike > > > > > ----- Original Message ---- > From: Justin Piszcz <jpiszcz@lucidpixels.com> > To: Mike Myers <mikesm559@yahoo.com> > Cc: linux-raid@vger.kernel.org > Sent: Friday, December 5, 2008 4:24:46 PM > Subject: Re: Need urgent help in fixing raid5 array > > > You can try this as a last resort: > http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07815.html > > (mdadm w/create and assume-clean) but only use this as a last resort, when > I had two disk failures, I was able to see some of the data but ultimately > it was lost, bottom line? i dont use raid5 anymore, raid6 only, in the > 3ware docs they recommend if you use more than 4 disks you should use > raid6 if you have the capability, i agree > > some others on the list may have more . less intrusive ideas . only use > the above method as a LAST RESORT, i was able to assemble the array but I > had problems getting xfs_repair to fix the filesystem > > On Fri, 5 Dec 2008, Mike Myers wrote: > >> Anyone? What am I missing here? >> >> thx >> mike >> >> >> >> >> ----- Original Message ---- >> From: Mike Myers <mikesm559@yahoo.com> >> To: linux-raid@vger.kernel.org >> Sent: Friday, December 5, 2008 9:03:22 AM >> Subject: Need urgent help in fixing raid5 array >> >> I have a problem with repairing a raid5 array I really need some help with. I must be missing something here. >> >> I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that. Both arrays have 7 1 TB disks in them. I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots. That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline. Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state. I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild. No problem. >> >> Around 60% through the resync, smartd started reporting problems with one of the other drives in the array. Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline. Ugh... >> >> Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing. As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here. I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered. >> >> So I remove the old disk, re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead. Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array. What am I missing here? Why doesn't it view it as the old disk as a member and just start it? >> >> thx >> mike >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > > > > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2008-12-06 0:51 ` Justin Piszcz @ 2008-12-06 0:58 ` Mike Myers 2008-12-06 19:02 ` Mike Myers 1 sibling, 0 replies; 46+ messages in thread From: Mike Myers @ 2008-12-06 0:58 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid Ok, I'll wait some more before doing that and see if Neil or someone else pipes in. I am really trying to avoid recreating the superblock structures, though it's pretty clear what the sequence of devices is from doing examines on them. thx mike ----- Original Message ---- From: Justin Piszcz <jpiszcz@lucidpixels.com> To: Mike Myers <mikesm559@yahoo.com> Cc: linux-raid@vger.kernel.org Sent: Friday, December 5, 2008 4:51:26 PM Subject: Re: Need urgent help in fixing raid5 array Only use it as A LAST resort, check the mailing list or wait for Neil/someone else with/whose had a similar issue who can maybe help more here. On Fri, 5 Dec 2008, Mike Myers wrote: > Thanks very much. All the disks I am trying to assemble have the same event count and UUID, which is why I don't understand why it's not assembling. I'll try the assume-clean option and see if that helps. > > It would be great to understand how md determines if the drives are in sync with each other. I thought the uuid and event count was all you needed... > > thx > mike > > > > > ----- Original Message ---- > From: Justin Piszcz <jpiszcz@lucidpixels.com> > To: Mike Myers <mikesm559@yahoo.com> > Cc: linux-raid@vger.kernel.org > Sent: Friday, December 5, 2008 4:24:46 PM > Subject: Re: Need urgent help in fixing raid5 array > > > You can try this as a last resort: > http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07815.html > > (mdadm w/create and assume-clean) but only use this as a last resort, when > I had two disk failures, I was able to see some of the data but ultimately > it was lost, bottom line? i dont use raid5 anymore, raid6 only, in the > 3ware docs they recommend if you use more than 4 disks you should use > raid6 if you have the capability, i agree > > some others on the list may have more . less intrusive ideas . only use > the above method as a LAST RESORT, i was able to assemble the array but I > had problems getting xfs_repair to fix the filesystem > > On Fri, 5 Dec 2008, Mike Myers wrote: > >> Anyone? What am I missing here? >> >> thx >> mike >> >> >> >> >> ----- Original Message ---- >> From: Mike Myers <mikesm559@yahoo.com> >> To: linux-raid@vger.kernel.org >> Sent: Friday, December 5, 2008 9:03:22 AM >> Subject: Need urgent help in fixing raid5 array >> >> I have a problem with repairing a raid5 array I really need some help with. I must be missing something here. >> >> I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that. Both arrays have 7 1 TB disks in them. I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots. That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline. Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state. I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild. No problem. >> >> Around 60% through the resync, smartd started reporting problems with one of the other drives in the array. Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline. Ugh... >> >> Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing. As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here. I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered. >> >> So I remove the old disk, re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead. Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array. What am I missing here? Why doesn't it view it as the old disk as a member and just start it? >> >> thx >> mike >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > > > > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2008-12-06 0:51 ` Justin Piszcz 2008-12-06 0:58 ` Mike Myers @ 2008-12-06 19:02 ` Mike Myers 2008-12-06 19:30 ` Mike Myers 1 sibling, 1 reply; 46+ messages in thread From: Mike Myers @ 2008-12-06 19:02 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid Ok, here is some more info on this odd problem. I have seven 1 TB drives in raid5 array. sdb1 sdc1 sdf1 sdh1 sdi1 sdj1 sdk1 They all have the same uuid, and events is the same for each except sdk1, which I think was the disk being resynced. As I understand it, when the array is being resynced, the events counter on the newly added drive is different than the others. The other drives have the same events and uuid, and all show clean. When I try and assemble the array of the 7 drives, it tells me there are 5 drives and 1 spare, not enough to start the array. If I remove sdk1 (the drive with the different event number on it), I get the same exact same message. By removing one drive at a time from the assemble command, I determined that md thinks that sdh1 is the spare, even though events are the same and the the UUID is the same, the checksum says it's ok. Why does it think this drive is a spare and not a data drive? I had cloned the data drive that had failed and got almost everything copied over, all but 12kb. So I think it's fine and is not being a problem. How does md decide which drive is a spare and which is an active synced drive, etc... ? I can't seem to find a document that outlines all this. Thx mike ----- Original Message ---- From: Justin Piszcz <jpiszcz@lucidpixels.com> To: Mike Myers <mikesm559@yahoo.com> Cc: linux-raid@vger.kernel.org Sent: Friday, December 5, 2008 4:51:26 PM Subject: Re: Need urgent help in fixing raid5 array Only use it as A LAST resort, check the mailing list or wait for Neil/someone else with/whose had a similar issue who can maybe help more here. On Fri, 5 Dec 2008, Mike Myers wrote: > Thanks very much. All the disks I am trying to assemble have the same event count and UUID, which is why I don't understand why it's not assembling. I'll try the assume-clean option and see if that helps. > > It would be great to understand how md determines if the drives are in sync with each other. I thought the uuid and event count was all you needed... > > thx > mike > > > > > ----- Original Message ---- > From: Justin Piszcz <jpiszcz@lucidpixels.com> > To: Mike Myers <mikesm559@yahoo.com> > Cc: linux-raid@vger.kernel.org > Sent: Friday, December 5, 2008 4:24:46 PM > Subject: Re: Need urgent help in fixing raid5 array > > > You can try this as a last resort: > http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07815.html > > (mdadm w/create and assume-clean) but only use this as a last resort, when > I had two disk failures, I was able to see some of the data but ultimately > it was lost, bottom line? i dont use raid5 anymore, raid6 only, in the > 3ware docs they recommend if you use more than 4 disks you should use > raid6 if you have the capability, i agree > > some others on the list may have more . less intrusive ideas . only use > the above method as a LAST RESORT, i was able to assemble the array but I > had problems getting xfs_repair to fix the filesystem > > On Fri, 5 Dec 2008, Mike Myers wrote: > >> Anyone? What am I missing here? >> >> thx >> mike >> >> >> >> >> ----- Original Message ---- >> From: Mike Myers <mikesm559@yahoo.com> >> To: linux-raid@vger.kernel.org >> Sent: Friday, December 5, 2008 9:03:22 AM >> Subject: Need urgent help in fixing raid5 array >> >> I have a problem with repairing a raid5 array I really need some help with. I must be missing something here. >> >> I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that. Both arrays have 7 1 TB disks in them. I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots. That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline. Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state. I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild. No problem. >> >> Around 60% through the resync, smartd started reporting problems with one of the other drives in the array. Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline. Ugh... >> >> Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing. As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here. I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered. >> >> So I remove the old disk, re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead. Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array. What am I missing here? Why doesn't it view it as the old disk as a member and just start it? >> >> thx >> mike >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2008-12-06 19:02 ` Mike Myers @ 2008-12-06 19:30 ` Mike Myers 2008-12-06 20:14 ` Mike Myers 0 siblings, 1 reply; 46+ messages in thread From: Mike Myers @ 2008-12-06 19:30 UTC (permalink / raw) To: Mike Myers, Justin Piszcz; +Cc: linux-raid /dev/sdk1: Magic : a92b4efc Version : 00.90.00 UUID : e70e0697:a10a5b75:941dd76f:196d9e4e Creation Time : Tue Aug 19 21:31:10 2008 Raid Level : raid5 Used Dev Size : 976759936 (931.51 GiB 1000.20 GB) Array Size : 5860559616 (5589.07 GiB 6001.21 GB) Raid Devices : 7 Total Devices : 7 Preferred Minor : 2 Update Time : Thu Dec 4 15:32:09 2008 State : clean Active Devices : 6 Working Devices : 7 Failed Devices : 0 Spare Devices : 1 Checksum : ab1934d5 - correct Events : 0.1436484 Layout : left-symmetric Chunk Size : 128K Number Major Minor RaidDevice State this 6 8 161 6 active sync /dev/sdk1 0 0 0 0 0 removed 1 1 8 81 1 active sync /dev/sdf1 2 2 8 145 2 active sync /dev/sdj1 3 3 8 17 3 active sync /dev/sdb1 4 4 8 33 4 active sync /dev/sdc1 5 5 8 129 5 active sync /dev/sdi1 6 6 8 161 6 active sync /dev/sdk1 7 7 8 113 7 spare /dev/sdh1 Here is the examine from sdh1 (which I thought was the the disk being replaced by now appears to be the spare): /dev/sdh1: Magic : a92b4efc Version : 00.90.00 UUID : e70e0697:a10a5b75:941dd76f:196d9e4e Creation Time : Tue Aug 19 21:31:10 2008 Raid Level : raid5 Used Dev Size : 976759936 (931.51 GiB 1000.20 GB) Array Size : 5860559616 (5589.07 GiB 6001.21 GB) Raid Devices : 7 Total Devices : 8 Preferred Minor : 2 Update Time : Fri Dec 5 08:15:16 2008 State : clean Active Devices : 5 Working Devices : 7 Failed Devices : 1 Spare Devices : 2 Checksum : ab1a2d37 - correct Events : 0.1438064 Layout : left-symmetric Chunk Size : 128K Number Major Minor RaidDevice State this 8 8 113 8 spare /dev/sdh1 0 0 0 0 0 removed 1 1 8 81 1 active sync /dev/sdf1 2 2 8 145 2 active sync /dev/sdj1 3 3 8 17 3 active sync /dev/sdb1 4 4 8 33 4 active sync /dev/sdc1 5 5 8 129 5 active sync /dev/sdi1 6 6 0 0 6 faulty removed 7 7 8 241 7 spare /dev/sdp1 8 8 8 113 8 spare /dev/sdh1 And here is the output of the examine of a known good member sdb1: /dev/sdb1: Magic : a92b4efc Version : 00.90.00 UUID : e70e0697:a10a5b75:941dd76f:196d9e4e Creation Time : Tue Aug 19 21:31:10 2008 Raid Level : raid5 Used Dev Size : 976759936 (931.51 GiB 1000.20 GB) Array Size : 5860559616 (5589.07 GiB 6001.21 GB) Raid Devices : 7 Total Devices : 8 Preferred Minor : 2 Update Time : Fri Dec 5 08:15:16 2008 State : clean Active Devices : 5 Working Devices : 7 Failed Devices : 1 Spare Devices : 2 Checksum : ab1a2cd3 - correct Events : 0.1438064 Layout : left-symmetric Chunk Size : 128K Number Major Minor RaidDevice State this 3 8 17 3 active sync /dev/sdb1 0 0 0 0 0 removed 1 1 8 81 1 active sync /dev/sdf1 2 2 8 145 2 active sync /dev/sdj1 3 3 8 17 3 active sync /dev/sdb1 4 4 8 33 4 active sync /dev/sdc1 5 5 8 129 5 active sync /dev/sdi1 6 6 0 0 6 faulty removed 7 7 8 241 7 spare /dev/sdp1 8 8 8 113 8 spare /dev/sdh1 Any more ideas as to what's going on? Thanks, Mike ----- Original Message ---- From: Mike Myers <mikesm559@yahoo.com> To: Justin Piszcz <jpiszcz@lucidpixels.com> Cc: linux-raid@vger.kernel.org Sent: Saturday, December 6, 2008 11:02:39 AM Subject: Re: Need urgent help in fixing raid5 array Ok, here is some more info on this odd problem. I have seven 1 TB drives in raid5 array. sdb1 sdc1 sdf1 sdh1 sdi1 sdj1 sdk1 They all have the same uuid, and events is the same for each except sdk1, which I think was the disk being resynced. As I understand it, when the array is being resynced, the events counter on the newly added drive is different than the others. The other drives have the same events and uuid, and all show clean. When I try and assemble the array of the 7 drives, it tells me there are 5 drives and 1 spare, not enough to start the array. If I remove sdk1 (the drive with the different event number on it), I get the same exact same message. By removing one drive at a time from the assemble command, I determined that md thinks that sdh1 is the spare, even though events are the same and the the UUID is the same, the checksum says it's ok. Why does it think this drive is a spare and not a data drive? I had cloned the data drive that had failed and got almost everything copied over, all but 12kb. So I think it's fine and is not being a problem. How does md decide which drive is a spare and which is an active synced drive, etc... ? I can't seem to find a document that outlines all this. Thx mike ----- Original Message ---- From: Justin Piszcz <jpiszcz@lucidpixels.com> To: Mike Myers <mikesm559@yahoo.com> Cc: linux-raid@vger.kernel.org Sent: Friday, December 5, 2008 4:51:26 PM Subject: Re: Need urgent help in fixing raid5 array Only use it as A LAST resort, check the mailing list or wait for Neil/someone else with/whose had a similar issue who can maybe help more here. On Fri, 5 Dec 2008, Mike Myers wrote: > Thanks very much. All the disks I am trying to assemble have the same event count and UUID, which is why I don't understand why it's not assembling. I'll try the assume-clean option and see if that helps. > > It would be great to understand how md determines if the drives are in sync with each other. I thought the uuid and event count was all you needed... > > thx > mike > > > > > ----- Original Message ---- > From: Justin Piszcz <jpiszcz@lucidpixels.com> > To: Mike Myers <mikesm559@yahoo.com> > Cc: linux-raid@vger.kernel.org > Sent: Friday, December 5, 2008 4:24:46 PM > Subject: Re: Need urgent help in fixing raid5 array > > > You can try this as a last resort: > http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07815.html > > (mdadm w/create and assume-clean) but only use this as a last resort, when > I had two disk failures, I was able to see some of the data but ultimately > it was lost, bottom line? i dont use raid5 anymore, raid6 only, in the > 3ware docs they recommend if you use more than 4 disks you should use > raid6 if you have the capability, i agree > > some others on the list may have more . less intrusive ideas . only use > the above method as a LAST RESORT, i was able to assemble the array but I > had problems getting xfs_repair to fix the filesystem > > On Fri, 5 Dec 2008, Mike Myers wrote: > >> Anyone? What am I missing here? >> >> thx >> mike >> >> >> >> >> ----- Original Message ---- >> From: Mike Myers <mikesm559@yahoo.com> >> To: linux-raid@vger.kernel.org >> Sent: Friday, December 5, 2008 9:03:22 AM >> Subject: Need urgent help in fixing raid5 array >> >> I have a problem with repairing a raid5 array I really need some help with. I must be missing something here. >> >> I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that. Both arrays have 7 1 TB disks in them. I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots. That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline. Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state. I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild. No problem. >> >> Around 60% through the resync, smartd started reporting problems with one of the other drives in the array. Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline. Ugh... >> >> Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing. As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here. I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered. >> >> So I remove the old disk, re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead. Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array. What am I missing here? Why doesn't it view it as the old disk as a member and just start it? >> >> thx >> mike >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2008-12-06 19:30 ` Mike Myers @ 2008-12-06 20:14 ` Mike Myers 0 siblings, 0 replies; 46+ messages in thread From: Mike Myers @ 2008-12-06 20:14 UTC (permalink / raw) To: Mike Myers, Justin Piszcz; +Cc: linux-raid Ok, I seem to have recovered... Once I realized that event though the event number on sdk1 was slightly different than the rest, but that I could confirm it was the new drive that had the cloned data from the original failing drive, I did an assemble with a --force option, and then the array came up just fine. I rebooted for good measure and lvm and xfs came up fine on boot, and all the files are there and perfectly accessible. There was about 12kb of data that couldn't be recovered, but since this storage volume stores mostly large TV recordings, I think it will be Ok. It would have bene very hard to track down which file those sectors were in any case. I then added the spare again and the system is now rebuilding just fine (should be done in abt 5 hrs)... Thanks for all the advice everyone. Thx mike ----- Original Message ---- From: Mike Myers <mikesm559@yahoo.com> To: Mike Myers <mikesm559@yahoo.com>; Justin Piszcz <jpiszcz@lucidpixels.com> Cc: linux-raid@vger.kernel.org Sent: Saturday, December 6, 2008 11:30:00 AM Subject: Re: Need urgent help in fixing raid5 array /dev/sdk1: Magic : a92b4efc Version : 00.90.00 UUID : e70e0697:a10a5b75:941dd76f:196d9e4e Creation Time : Tue Aug 19 21:31:10 2008 Raid Level : raid5 Used Dev Size : 976759936 (931.51 GiB 1000.20 GB) Array Size : 5860559616 (5589.07 GiB 6001.21 GB) Raid Devices : 7 Total Devices : 7 Preferred Minor : 2 Update Time : Thu Dec 4 15:32:09 2008 State : clean Active Devices : 6 Working Devices : 7 Failed Devices : 0 Spare Devices : 1 Checksum : ab1934d5 - correct Events : 0.1436484 Layout : left-symmetric Chunk Size : 128K Number Major Minor RaidDevice State this 6 8 161 6 active sync /dev/sdk1 0 0 0 0 0 removed 1 1 8 81 1 active sync /dev/sdf1 2 2 8 145 2 active sync /dev/sdj1 3 3 8 17 3 active sync /dev/sdb1 4 4 8 33 4 active sync /dev/sdc1 5 5 8 129 5 active sync /dev/sdi1 6 6 8 161 6 active sync /dev/sdk1 7 7 8 113 7 spare /dev/sdh1 Here is the examine from sdh1 (which I thought was the the disk being replaced by now appears to be the spare): /dev/sdh1: Magic : a92b4efc Version : 00.90.00 UUID : e70e0697:a10a5b75:941dd76f:196d9e4e Creation Time : Tue Aug 19 21:31:10 2008 Raid Level : raid5 Used Dev Size : 976759936 (931.51 GiB 1000.20 GB) Array Size : 5860559616 (5589.07 GiB 6001.21 GB) Raid Devices : 7 Total Devices : 8 Preferred Minor : 2 Update Time : Fri Dec 5 08:15:16 2008 State : clean Active Devices : 5 Working Devices : 7 Failed Devices : 1 Spare Devices : 2 Checksum : ab1a2d37 - correct Events : 0.1438064 Layout : left-symmetric Chunk Size : 128K Number Major Minor RaidDevice State this 8 8 113 8 spare /dev/sdh1 0 0 0 0 0 removed 1 1 8 81 1 active sync /dev/sdf1 2 2 8 145 2 active sync /dev/sdj1 3 3 8 17 3 active sync /dev/sdb1 4 4 8 33 4 active sync /dev/sdc1 5 5 8 129 5 active sync /dev/sdi1 6 6 0 0 6 faulty removed 7 7 8 241 7 spare /dev/sdp1 8 8 8 113 8 spare /dev/sdh1 And here is the output of the examine of a known good member sdb1: /dev/sdb1: Magic : a92b4efc Version : 00.90.00 UUID : e70e0697:a10a5b75:941dd76f:196d9e4e Creation Time : Tue Aug 19 21:31:10 2008 Raid Level : raid5 Used Dev Size : 976759936 (931.51 GiB 1000.20 GB) Array Size : 5860559616 (5589.07 GiB 6001.21 GB) Raid Devices : 7 Total Devices : 8 Preferred Minor : 2 Update Time : Fri Dec 5 08:15:16 2008 State : clean Active Devices : 5 Working Devices : 7 Failed Devices : 1 Spare Devices : 2 Checksum : ab1a2cd3 - correct Events : 0.1438064 Layout : left-symmetric Chunk Size : 128K Number Major Minor RaidDevice State this 3 8 17 3 active sync /dev/sdb1 0 0 0 0 0 removed 1 1 8 81 1 active sync /dev/sdf1 2 2 8 145 2 active sync /dev/sdj1 3 3 8 17 3 active sync /dev/sdb1 4 4 8 33 4 active sync /dev/sdc1 5 5 8 129 5 active sync /dev/sdi1 6 6 0 0 6 faulty removed 7 7 8 241 7 spare /dev/sdp1 8 8 8 113 8 spare /dev/sdh1 Any more ideas as to what's going on? Thanks, Mike ----- Original Message ---- From: Mike Myers <mikesm559@yahoo.com> To: Justin Piszcz <jpiszcz@lucidpixels.com> Cc: linux-raid@vger.kernel.org Sent: Saturday, December 6, 2008 11:02:39 AM Subject: Re: Need urgent help in fixing raid5 array Ok, here is some more info on this odd problem. I have seven 1 TB drives in raid5 array. sdb1 sdc1 sdf1 sdh1 sdi1 sdj1 sdk1 They all have the same uuid, and events is the same for each except sdk1, which I think was the disk being resynced. As I understand it, when the array is being resynced, the events counter on the newly added drive is different than the others. The other drives have the same events and uuid, and all show clean. When I try and assemble the array of the 7 drives, it tells me there are 5 drives and 1 spare, not enough to start the array. If I remove sdk1 (the drive with the different event number on it), I get the same exact same message. By removing one drive at a time from the assemble command, I determined that md thinks that sdh1 is the spare, even though events are the same and the the UUID is the same, the checksum says it's ok. Why does it think this drive is a spare and not a data drive? I had cloned the data drive that had failed and got almost everything copied over, all but 12kb. So I think it's fine and is not being a problem. How does md decide which drive is a spare and which is an active synced drive, etc... ? I can't seem to find a document that outlines all this. Thx mike ----- Original Message ---- From: Justin Piszcz <jpiszcz@lucidpixels.com> To: Mike Myers <mikesm559@yahoo.com> Cc: linux-raid@vger.kernel.org Sent: Friday, December 5, 2008 4:51:26 PM Subject: Re: Need urgent help in fixing raid5 array Only use it as A LAST resort, check the mailing list or wait for Neil/someone else with/whose had a similar issue who can maybe help more here. On Fri, 5 Dec 2008, Mike Myers wrote: > Thanks very much. All the disks I am trying to assemble have the same event count and UUID, which is why I don't understand why it's not assembling. I'll try the assume-clean option and see if that helps. > > It would be great to understand how md determines if the drives are in sync with each other. I thought the uuid and event count was all you needed... > > thx > mike > > > > > ----- Original Message ---- > From: Justin Piszcz <jpiszcz@lucidpixels.com> > To: Mike Myers <mikesm559@yahoo.com> > Cc: linux-raid@vger.kernel.org > Sent: Friday, December 5, 2008 4:24:46 PM > Subject: Re: Need urgent help in fixing raid5 array > > > You can try this as a last resort: > http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07815.html > > (mdadm w/create and assume-clean) but only use this as a last resort, when > I had two disk failures, I was able to see some of the data but ultimately > it was lost, bottom line? i dont use raid5 anymore, raid6 only, in the > 3ware docs they recommend if you use more than 4 disks you should use > raid6 if you have the capability, i agree > > some others on the list may have more . less intrusive ideas . only use > the above method as a LAST RESORT, i was able to assemble the array but I > had problems getting xfs_repair to fix the filesystem > > On Fri, 5 Dec 2008, Mike Myers wrote: > >> Anyone? What am I missing here? >> >> thx >> mike >> >> >> >> >> ----- Original Message ---- >> From: Mike Myers <mikesm559@yahoo.com> >> To: linux-raid@vger.kernel.org >> Sent: Friday, December 5, 2008 9:03:22 AM >> Subject: Need urgent help in fixing raid5 array >> >> I have a problem with repairing a raid5 array I really need some help with. I must be missing something here. >> >> I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that. Both arrays have 7 1 TB disks in them. I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots. That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline. Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state. I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild. No problem. >> >> Around 60% through the resync, smartd started reporting problems with one of the other drives in the array. Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline. Ugh... >> >> Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing. As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here. I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered. >> >> So I remove the old disk, re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead. Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array. What am I missing here? Why doesn't it view it as the old disk as a member and just start it? >> >> thx >> mike >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 46+ messages in thread
* RE: Need urgent help in fixing raid5 array 2008-12-06 0:24 ` Justin Piszcz 2008-12-06 0:47 ` Mike Myers @ 2008-12-06 0:52 ` David Lethe 1 sibling, 0 replies; 46+ messages in thread From: David Lethe @ 2008-12-06 0:52 UTC (permalink / raw) To: Justin Piszcz, Mike Myers; +Cc: linux-raid > -----Original Message----- > From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- > owner@vger.kernel.org] On Behalf Of Justin Piszcz > Sent: Friday, December 05, 2008 6:25 PM Mike - Really strongly consider bringing in a pro, or at very least, buying some scratch disks so you can work with copies. Forced assembly doesn't look at the data if you are degraded, so it has high potential to make things worse. Too many things going on here ... If data is backed up and you just want to save some time with an experiment ... then go forward with assume-clean. But it will likely destroy a large chunk of your data in the process, destroy it forever. David ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array
@ 2009-01-01 15:31 Mike Myers
0 siblings, 0 replies; 46+ messages in thread
From: Mike Myers @ 2009-01-01 15:31 UTC (permalink / raw)
To: linux-raid
Well, thanks for all your help last month. As i posted, things came
back up and I survived the failure. Now, I have yet another problem.
:( After 5 years of running a linux server as a dedicated NAS, I am
hitting some very weird problems. This server started as an single
processor AMD system with 4 320GB drives, and has been upgraded
multiple times so that it is now a quad core Intel rackmounted 4U
system with 14 1 TB drives and I have never lost data in any of the
upgrades of CPU, motherboard and disk controller hardware and disk
drives. Now after last month's near death experience I am faced with
another serious problem in less than a month. Any help you guys could
give me would be most appreciated. This is a sucky way to start the
new year.
The array I had problems with last month (md2 comprised of 7 1 TB drives in a RAID5 config) is running just fine.
md1, which is built of 7 1 TB hitachi 7K1000 drives is now having
problems. We returned from a 10 day family visit with everything
running just fine. There ws a brief power outage today, abt 3 mins,
but I can't see how that could be related as the server is on a high
quality rackmount 3U APC UPS that handled the outage just fine. I was
working on the system getting X to work again after a nvidia driver
update, and when that was working fine, checked the disks to discover
that md1 was in a degraded state, with /dev/sdl1 kicked out of the
array (removed). I tried to do a dd from the drive to verify it's
location in the rack, but I got an i/o error. This was most odd, and
so went to the rack and pulled the disk and reinserted it. No system
log entries recorded the device being pulled or re-installed. So I am
thinking that a cable somehow
has come loose. I power the system
down, pull it out of the rack, look at the cable that goes to the
drive, everything looks fine.
So I reboot the system, and now the array won't come online because now in addition to the drive that
shows as (removed), one of the other drives shows as a faulty spare.
Well, learning from the last go around, I reassemble the array with the
--force option, and the array comes back up. But LVM won't come back
up because it sees the physical volume that maps to md1 as missing.
Now I am very concerned. After trying a bunch of things, I do a
pvcreate with the missing UUID on md1, restart the vg and the logical
volume comes back up. I was thinking I may have told lvm to use an
array of bad data, but to my surprise, I mounted the filesystem and
everything looked intact! Ok, sometimes you win. So I do one more
reboot to get the system back up in multiuser so I can back up some of
the more important media stored on the volume (it's got about 10 Tb
used, but most of that is PVR recordings, but there is a lot of ripped
music and DVD's that I really don't
want to rerip) on a another server that has some space on it while I figure out what has been happening.
The reboot again fails because of a problem with md1. This time, another
one of the drives shows as removed (/dev/sdm1), and I can't reassemble
the array with a --force option. It is acting like /dev/sdl1 (the
other removed unit), and even though I can read from the drives fine,
their UUID is fine, etc..., md does not consider them as part of the
array. /dev/sdo1 (which was the drive that looked like a faulty spare)
seems OK when trying to do the assemble. sdm1 seemed just fine before
the reboot, and was showing no problems before. They are not hooked up
on the same controller cable ( a SAS to SATA fanout), and the LSI MPT
controller card seems to talk to the other disks just fine.
Anyways,I have no idea as to what's going on. When I try to add sdm1 or sdl1
back into the array, md complains the device is busy, which is very odd
because it's not part of another array or doing anything else in the
system.
Any idea as to what could be happening here? I am beyond frustrated.
thanks,
Mike
^ permalink raw reply [flat|nested] 46+ messages in thread[parent not found: <451872.61166.qm@web30802.mail.mud.yahoo.com>]
* Re: Need urgent help in fixing raid5 array [not found] <451872.61166.qm@web30802.mail.mud.yahoo.com> @ 2009-01-01 15:40 ` Justin Piszcz 2009-01-01 17:51 ` Mike Myers 0 siblings, 1 reply; 46+ messages in thread From: Justin Piszcz @ 2009-01-01 15:40 UTC (permalink / raw) To: Mike Myers; +Cc: linux-raid, john lists On Thu, 1 Jan 2009, Mike Myers wrote: > Well, thanks for all your help last month. As i posted, things came > back up and I survived the failure. Now, I have yet another problem. > :( After 5 years of running a linux server as a dedicated NAS, I am > hitting some very weird problems. This server started as an single > processor AMD system with 4 320GB drives, and has been upgraded > multiple times so that it is now a quad core Intel rackmounted 4U > system with 14 1 TB drives and I have never lost data in any of the > upgrades of CPU, motherboard and disk controller hardware and disk > drives. Now after last month's near death experience I am faced with > another serious problem in less than a month. Any help you guys could > give me would be most appreciated. This is a sucky way to start the > new year. > > The array I had problems with last month (md2 > comprised of 7 1 TB drives in a RAID5 config) is running just fine. > md1, which is built of 7 1 TB hitachi 7K1000 drives is now having > problems. We returned from a 10 day family visit with everything > running just fine. There ws a brief power outage today, abt 3 mins, > but I can't see how that could be related as the server is on a high > quality rackmount 3U APC UPS that handled the outage just fine. I was > working on the system getting X to work again after a nvidia driver > update, and when that was working fine, checked the disks to discover > that md1 was in a degraded state, with /dev/sdl1 kicked out of the > array (removed). I tried to do a dd from the drive to verify it's > location in the rack, but I got an i/o error. This was most odd, and > so went to the rack and pulled the disk and reinserted it. No system > log entries recorded the device being pulled or re-installed. So I am > thinking that a cable somehow > has come loose. I power the system > down, pull it out of the rack, look at the cable that goes to the > drive, everything looks fine. > > So I reboot the system, and now > the array won't come online because now in addition to the drive that > shows as (removed), one of the other drives shows as a faulty spare. > Well, learning from the last go around, I reassemble the array with the > --force option, and the array comes back up. But LVM won't come back > up because it sees the physical volume that maps to md1 as missing. > Now I am very concerned. After trying a bunch of things, I do a > pvcreate with the missing UUID on md1, restart the vg and the logical > volume comes back up. I was thinking I may have told lvm to use an > array of bad data, but to my surprise, I mounted the filesystem and > everything looked intact! Ok, sometimes you win. So I do one more > reboot to get the system back up in multiuser so I can back up some of > the more important media stored on the volume (it's got about 10 Tb > used, but most of that is PVR recordings, but there is a lot of ripped > music and DVD's that I really don't > want to rerip) on a another server that has some space on it while I figure out what has been happening. > > The > reboot again fails because of a problem with md1. This time, another > one of the drives shows as removed (/dev/sdm1), and I can't reassemble > the array with a --force option. It is acting like /dev/sdl1 (the > other removed unit), and even though I can read from the drives fine, > their UUID is fine, etc..., md does not consider them as part of the > array. /dev/sdo1 (which was the drive that looked like a faulty spare) > seems OK when trying to do the assemble. sdm1 seemed just fine before > the reboot, and was showing no problems before. They are not hooked up > on the same controller cable ( a SAS to SATA fanout), and the LSI MPT > controller card seems to talk to the other disks just fine. > > Anyways, > I have no idea as to what's going on. When I try to add sdm1 or sdl1 > back into the array, md complains the device is busy, which is very odd > because it's not part of another array or doing anything else in the > system. > > Any idea as to what could be happening here? I am beyond frustrated. > > thanks, > Mike > > > If you are using a hotswap chasis, then it has some sort of sata-backplane. I have seen backplanes go bad in the past, that would be my first replacement. Justin. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-01 15:40 ` Justin Piszcz @ 2009-01-01 17:51 ` Mike Myers 2009-01-01 18:29 ` Justin Piszcz 0 siblings, 1 reply; 46+ messages in thread From: Mike Myers @ 2009-01-01 17:51 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid, john lists The disks that are problematic are still online as far as the OS can tell. I can do a dd from them and pull off data at the normal speeds, so I don't understand if that's the case why the backplane would be a problem here. I can try and move them to another slot however (I have a 20 slot SATA backplane in there) and see if that changes how md deals with it. The OS sees the drive, it inits fine, but md shows it as removed and won't let me add it back to the array because of the "device being busy". I don't understand the criteria that md uses to add a drive I guess. The uuid looks fine, and if the events is off, then the -f flag should take care of that. I've never seen a "device busy" failure on an add before. thx mike ----- Original Message ---- From: Justin Piszcz <jpiszcz@lucidpixels.com> To: Mike Myers <mikesm559@yahoo.com> Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> Sent: Thursday, January 1, 2009 7:40:21 AM Subject: Re: Need urgent help in fixing raid5 array On Thu, 1 Jan 2009, Mike Myers wrote: > Well, thanks for all your help last month. As i posted, things came > back up and I survived the failure. Now, I have yet another problem. > :( After 5 years of running a linux server as a dedicated NAS, I am > hitting some very weird problems. This server started as an single > processor AMD system with 4 320GB drives, and has been upgraded > multiple times so that it is now a quad core Intel rackmounted 4U > system with 14 1 TB drives and I have never lost data in any of the > upgrades of CPU, motherboard and disk controller hardware and disk > drives. Now after last month's near death experience I am faced with > another serious problem in less than a month. Any help you guys could > give me would be most appreciated. This is a sucky way to start the > new year. > > The array I had problems with last month (md2 > comprised of 7 1 TB drives in a RAID5 config) is running just fine. > md1, which is built of 7 1 TB hitachi 7K1000 drives is now having > problems. We returned from a 10 day family visit with everything > running just fine. There ws a brief power outage today, abt 3 mins, > but I can't see how that could be related as the server is on a high > quality rackmount 3U APC UPS that handled the outage just fine. I was > working on the system getting X to work again after a nvidia driver > update, and when that was working fine, checked the disks to discover > that md1 was in a degraded state, with /dev/sdl1 kicked out of the > array (removed). I tried to do a dd from the drive to verify it's > location in the rack, but I got an i/o error. This was most odd, and > so went to the rack and pulled the disk and reinserted it. No system > log entries recorded the device being pulled or re-installed. So I am > thinking that a cable somehow > has come loose. I power the system > down, pull it out of the rack, look at the cable that goes to the > drive, everything looks fine. > > So I reboot the system, and now > the array won't come online because now in addition to the drive that > shows as (removed), one of the other drives shows as a faulty spare. > Well, learning from the last go around, I reassemble the array with the > --force option, and the array comes back up. But LVM won't come back > up because it sees the physical volume that maps to md1 as missing. > Now I am very concerned. After trying a bunch of things, I do a > pvcreate with the missing UUID on md1, restart the vg and the logical > volume comes back up. I was thinking I may have told lvm to use an > array of bad data, but to my surprise, I mounted the filesystem and > everything looked intact! Ok, sometimes you win. So I do one more > reboot to get the system back up in multiuser so I can back up some of > the more important media stored on the volume (it's got about 10 Tb > used, but most of that is PVR recordings, but there is a lot of ripped > music and DVD's that I really don't > want to rerip) on a another server that has some space on it while I figure out what has been happening. > > The > reboot again fails because of a problem with md1. This time, another > one of the drives shows as removed (/dev/sdm1), and I can't reassemble > the array with a --force option. It is acting like /dev/sdl1 (the > other removed unit), and even though I can read from the drives fine, > their UUID is fine, etc..., md does not consider them as part of the > array. /dev/sdo1 (which was the drive that looked like a faulty spare) > seems OK when trying to do the assemble. sdm1 seemed just fine before > the reboot, and was showing no problems before. They are not hooked up > on the same controller cable ( a SAS to SATA fanout), and the LSI MPT > controller card seems to talk to the other disks just fine. > > Anyways, > I have no idea as to what's going on. When I try to add sdm1 or sdl1 > back into the array, md complains the device is busy, which is very odd > because it's not part of another array or doing anything else in the > system. > > Any idea as to what could be happening here? I am beyond frustrated. > > thanks, > Mike > > > If you are using a hotswap chasis, then it has some sort of sata-backplane. I have seen backplanes go bad in the past, that would be my first replacement. Justin. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-01 17:51 ` Mike Myers @ 2009-01-01 18:29 ` Justin Piszcz 2009-01-01 18:40 ` Jon Nelson 2009-01-02 6:19 ` Mike Myers 0 siblings, 2 replies; 46+ messages in thread From: Justin Piszcz @ 2009-01-01 18:29 UTC (permalink / raw) To: Mike Myers; +Cc: linux-raid, john lists I think some output would be pertinent here: mdadm -D /dev/md0..1..2 etc cat /proc/mdstat dmesg/syslog of the errors you are seeing etc On Thu, 1 Jan 2009, Mike Myers wrote: > The disks that are problematic are still online as far as the OS can tell. I can do a dd from them and pull off data at the normal speeds, so I don't understand if that's the case why the backplane would be a problem here. I can try and move them to another slot however (I have a 20 slot SATA backplane in there) and see if that changes how md deals with it. > > The OS sees the drive, it inits fine, but md shows it as removed and won't let me add it back to the array because of the "device being busy". I don't understand the criteria that md uses to add a drive I guess. The uuid looks fine, and if the events is off, then the -f flag should take care of that. I've never seen a "device busy" failure on an add before. > > thx > mike > > > > > ----- Original Message ---- > From: Justin Piszcz <jpiszcz@lucidpixels.com> > To: Mike Myers <mikesm559@yahoo.com> > Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> > Sent: Thursday, January 1, 2009 7:40:21 AM > Subject: Re: Need urgent help in fixing raid5 array > > > > On Thu, 1 Jan 2009, Mike Myers wrote: > >> Well, thanks for all your help last month. As i posted, things came >> back up and I survived the failure. Now, I have yet another problem. >> :( After 5 years of running a linux server as a dedicated NAS, I am >> hitting some very weird problems. This server started as an single >> processor AMD system with 4 320GB drives, and has been upgraded >> multiple times so that it is now a quad core Intel rackmounted 4U >> system with 14 1 TB drives and I have never lost data in any of the >> upgrades of CPU, motherboard and disk controller hardware and disk >> drives. Now after last month's near death experience I am faced with >> another serious problem in less than a month. Any help you guys could >> give me would be most appreciated. This is a sucky way to start the >> new year. >> >> The array I had problems with last month (md2 >> comprised of 7 1 TB drives in a RAID5 config) is running just fine. >> md1, which is built of 7 1 TB hitachi 7K1000 drives is now having >> problems. We returned from a 10 day family visit with everything >> running just fine. There ws a brief power outage today, abt 3 mins, >> but I can't see how that could be related as the server is on a high >> quality rackmount 3U APC UPS that handled the outage just fine. I was >> working on the system getting X to work again after a nvidia driver >> update, and when that was working fine, checked the disks to discover >> that md1 was in a degraded state, with /dev/sdl1 kicked out of the >> array (removed). I tried to do a dd from the drive to verify it's >> location in the rack, but I got an i/o error. This was most odd, and >> so went to the rack and pulled the disk and reinserted it. No system >> log entries recorded the device being pulled or re-installed. So I am >> thinking that a cable somehow >> has come loose. I power the system >> down, pull it out of the rack, look at the cable that goes to the >> drive, everything looks fine. >> >> So I reboot the system, and now >> the array won't come online because now in addition to the drive that >> shows as (removed), one of the other drives shows as a faulty spare. >> Well, learning from the last go around, I reassemble the array with the >> --force option, and the array comes back up. But LVM won't come back >> up because it sees the physical volume that maps to md1 as missing. >> Now I am very concerned. After trying a bunch of things, I do a >> pvcreate with the missing UUID on md1, restart the vg and the logical >> volume comes back up. I was thinking I may have told lvm to use an >> array of bad data, but to my surprise, I mounted the filesystem and >> everything looked intact! Ok, sometimes you win. So I do one more >> reboot to get the system back up in multiuser so I can back up some of >> the more important media stored on the volume (it's got about 10 Tb >> used, but most of that is PVR recordings, but there is a lot of ripped >> music and DVD's that I really don't >> want to rerip) on a another server that has some space on it while I figure out what has been happening. >> >> The >> reboot again fails because of a problem with md1. This time, another >> one of the drives shows as removed (/dev/sdm1), and I can't reassemble >> the array with a --force option. It is acting like /dev/sdl1 (the >> other removed unit), and even though I can read from the drives fine, >> their UUID is fine, etc..., md does not consider them as part of the >> array. /dev/sdo1 (which was the drive that looked like a faulty spare) >> seems OK when trying to do the assemble. sdm1 seemed just fine before >> the reboot, and was showing no problems before. They are not hooked up >> on the same controller cable ( a SAS to SATA fanout), and the LSI MPT >> controller card seems to talk to the other disks just fine. >> >> Anyways, >> I have no idea as to what's going on. When I try to add sdm1 or sdl1 >> back into the array, md complains the device is busy, which is very odd >> because it's not part of another array or doing anything else in the >> system. >> >> Any idea as to what could be happening here? I am beyond frustrated. >> >> thanks, >> Mike >> >> >> > > If you are using a hotswap chasis, then it has some sort of > sata-backplane. I have seen backplanes go bad in the past, that would be > my first replacement. > > Justin. > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-01 18:29 ` Justin Piszcz @ 2009-01-01 18:40 ` Jon Nelson 2009-01-01 20:38 ` Mike Myers 2009-01-02 6:19 ` Mike Myers 1 sibling, 1 reply; 46+ messages in thread From: Jon Nelson @ 2009-01-01 18:40 UTC (permalink / raw) To: Justin Piszcz; +Cc: Mike Myers, linux-raid, john lists Also the contents of /etc/mdadm.conf On Thu, Jan 1, 2009 at 12:29 PM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote: > I think some output would be pertinent here: > > mdadm -D /dev/md0..1..2 etc > > cat /proc/mdstat > > dmesg/syslog of the errors you are seeing etc > > > > On Thu, 1 Jan 2009, Mike Myers wrote: > >> The disks that are problematic are still online as far as the OS can tell. >> I can do a dd from them and pull off data at the normal speeds, so I don't >> understand if that's the case why the backplane would be a problem here. I >> can try and move them to another slot however (I have a 20 slot SATA >> backplane in there) and see if that changes how md deals with it. >> >> The OS sees the drive, it inits fine, but md shows it as removed and won't >> let me add it back to the array because of the "device being busy". I don't >> understand the criteria that md uses to add a drive I guess. The uuid looks >> fine, and if the events is off, then the -f flag should take care of that. >> I've never seen a "device busy" failure on an add before. >> >> thx >> mike >> >> >> >> >> ----- Original Message ---- >> From: Justin Piszcz <jpiszcz@lucidpixels.com> >> To: Mike Myers <mikesm559@yahoo.com> >> Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> >> Sent: Thursday, January 1, 2009 7:40:21 AM >> Subject: Re: Need urgent help in fixing raid5 array >> >> >> >> On Thu, 1 Jan 2009, Mike Myers wrote: >> >>> Well, thanks for all your help last month. As i posted, things came >>> back up and I survived the failure. Now, I have yet another problem. >>> :( After 5 years of running a linux server as a dedicated NAS, I am >>> hitting some very weird problems. This server started as an single >>> processor AMD system with 4 320GB drives, and has been upgraded >>> multiple times so that it is now a quad core Intel rackmounted 4U >>> system with 14 1 TB drives and I have never lost data in any of the >>> upgrades of CPU, motherboard and disk controller hardware and disk >>> drives. Now after last month's near death experience I am faced with >>> another serious problem in less than a month. Any help you guys could >>> give me would be most appreciated. This is a sucky way to start the >>> new year. >>> >>> The array I had problems with last month (md2 >>> comprised of 7 1 TB drives in a RAID5 config) is running just fine. >>> md1, which is built of 7 1 TB hitachi 7K1000 drives is now having >>> problems. We returned from a 10 day family visit with everything >>> running just fine. There ws a brief power outage today, abt 3 mins, >>> but I can't see how that could be related as the server is on a high >>> quality rackmount 3U APC UPS that handled the outage just fine. I was >>> working on the system getting X to work again after a nvidia driver >>> update, and when that was working fine, checked the disks to discover >>> that md1 was in a degraded state, with /dev/sdl1 kicked out of the >>> array (removed). I tried to do a dd from the drive to verify it's >>> location in the rack, but I got an i/o error. This was most odd, and >>> so went to the rack and pulled the disk and reinserted it. No system >>> log entries recorded the device being pulled or re-installed. So I am >>> thinking that a cable somehow >>> has come loose. I power the system >>> down, pull it out of the rack, look at the cable that goes to the >>> drive, everything looks fine. >>> >>> So I reboot the system, and now >>> the array won't come online because now in addition to the drive that >>> shows as (removed), one of the other drives shows as a faulty spare. >>> Well, learning from the last go around, I reassemble the array with the >>> --force option, and the array comes back up. But LVM won't come back >>> up because it sees the physical volume that maps to md1 as missing. >>> Now I am very concerned. After trying a bunch of things, I do a >>> pvcreate with the missing UUID on md1, restart the vg and the logical >>> volume comes back up. I was thinking I may have told lvm to use an >>> array of bad data, but to my surprise, I mounted the filesystem and >>> everything looked intact! Ok, sometimes you win. So I do one more >>> reboot to get the system back up in multiuser so I can back up some of >>> the more important media stored on the volume (it's got about 10 Tb >>> used, but most of that is PVR recordings, but there is a lot of ripped >>> music and DVD's that I really don't >>> want to rerip) on a another server that has some space on it while I >>> figure out what has been happening. >>> >>> The >>> reboot again fails because of a problem with md1. This time, another >>> one of the drives shows as removed (/dev/sdm1), and I can't reassemble >>> the array with a --force option. It is acting like /dev/sdl1 (the >>> other removed unit), and even though I can read from the drives fine, >>> their UUID is fine, etc..., md does not consider them as part of the >>> array. /dev/sdo1 (which was the drive that looked like a faulty spare) >>> seems OK when trying to do the assemble. sdm1 seemed just fine before >>> the reboot, and was showing no problems before. They are not hooked up >>> on the same controller cable ( a SAS to SATA fanout), and the LSI MPT >>> controller card seems to talk to the other disks just fine. >>> >>> Anyways, >>> I have no idea as to what's going on. When I try to add sdm1 or sdl1 >>> back into the array, md complains the device is busy, which is very odd >>> because it's not part of another array or doing anything else in the >>> system. >>> >>> Any idea as to what could be happening here? I am beyond frustrated. >>> >>> thanks, >>> Mike >>> >>> >>> >> >> If you are using a hotswap chasis, then it has some sort of >> sata-backplane. I have seen backplanes go bad in the past, that would be >> my first replacement. >> >> Justin. >> >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Jon ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-01 18:40 ` Jon Nelson @ 2009-01-01 20:38 ` Mike Myers 0 siblings, 0 replies; 46+ messages in thread From: Mike Myers @ 2009-01-01 20:38 UTC (permalink / raw) To: Jon Nelson, Justin Piszcz; +Cc: linux-raid, john lists Thanks for the help. After a night of being powered off, it now appears that one of the 8 port MPT cards won't initialize properly and be seen by the BIOS (and consequently by linux). The funny thing is that this card didn't connect any of the troublesome drives, but rather the drives in md2 which were working just fine. It looks like a hardware failure, since swapping slots doesn't fix it, and if I swap the 8087 connectors, whatever drives are on the "good" card are detected and come up in linux. I have an old marvell based 4 port SATA controller that I will dig out and see if I can't get running, and if I move the ports arounds I should be able to get all the drives visible again. I must have sinned greviously somehow to suffer with this problem on the first day of the year. And I don't even work for a wall street firm. More later. thx mike ----- Original Message ---- From: Jon Nelson <jnelson-linux-raid@jamponi.net> To: Justin Piszcz <jpiszcz@lucidpixels.com> Cc: Mike Myers <mikesm559@yahoo.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> Sent: Thursday, January 1, 2009 10:40:18 AM Subject: Re: Need urgent help in fixing raid5 array Also the contents of /etc/mdadm.conf On Thu, Jan 1, 2009 at 12:29 PM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote: > I think some output would be pertinent here: > > mdadm -D /dev/md0..1..2 etc > > cat /proc/mdstat > > dmesg/syslog of the errors you are seeing etc > > > > On Thu, 1 Jan 2009, Mike Myers wrote: > >> The disks that are problematic are still online as far as the OS can tell. >> I can do a dd from them and pull off data at the normal speeds, so I don't >> understand if that's the case why the backplane would be a problem here. I >> can try and move them to another slot however (I have a 20 slot SATA >> backplane in there) and see if that changes how md deals with it. >> >> The OS sees the drive, it inits fine, but md shows it as removed and won't >> let me add it back to the array because of the "device being busy". I don't >> understand the criteria that md uses to add a drive I guess. The uuid looks >> fine, and if the events is off, then the -f flag should take care of that. >> I've never seen a "device busy" failure on an add before. >> >> thx >> mike >> >> >> >> >> ----- Original Message ---- >> From: Justin Piszcz <jpiszcz@lucidpixels.com> >> To: Mike Myers <mikesm559@yahoo.com> >> Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> >> Sent: Thursday, January 1, 2009 7:40:21 AM >> Subject: Re: Need urgent help in fixing raid5 array >> >> >> >> On Thu, 1 Jan 2009, Mike Myers wrote: >> >>> Well, thanks for all your help last month. As i posted, things came >>> back up and I survived the failure. Now, I have yet another problem. >>> :( After 5 years of running a linux server as a dedicated NAS, I am >>> hitting some very weird problems. This server started as an single >>> processor AMD system with 4 320GB drives, and has been upgraded >>> multiple times so that it is now a quad core Intel rackmounted 4U >>> system with 14 1 TB drives and I have never lost data in any of the >>> upgrades of CPU, motherboard and disk controller hardware and disk >>> drives. Now after last month's near death experience I am faced with >>> another serious problem in less than a month. Any help you guys could >>> give me would be most appreciated. This is a sucky way to start the >>> new year. >>> >>> The array I had problems with last month (md2 >>> comprised of 7 1 TB drives in a RAID5 config) is running just fine. >>> md1, which is built of 7 1 TB hitachi 7K1000 drives is now having >>> problems. We returned from a 10 day family visit with everything >>> running just fine. There ws a brief power outage today, abt 3 mins, >>> but I can't see how that could be related as the server is on a high >>> quality rackmount 3U APC UPS that handled the outage just fine. I was >>> working on the system getting X to work again after a nvidia driver >>> update, and when that was working fine, checked the disks to discover >>> that md1 was in a degraded state, with /dev/sdl1 kicked out of the >>> array (removed). I tried to do a dd from the drive to verify it's >>> location in the rack, but I got an i/o error. This was most odd, and >>> so went to the rack and pulled the disk and reinserted it. No system >>> log entries recorded the device being pulled or re-installed. So I am >>> thinking that a cable somehow >>> has come loose. I power the system >>> down, pull it out of the rack, look at the cable that goes to the >>> drive, everything looks fine. >>> >>> So I reboot the system, and now >>> the array won't come online because now in addition to the drive that >>> shows as (removed), one of the other drives shows as a faulty spare. >>> Well, learning from the last go around, I reassemble the array with the >>> --force option, and the array comes back up. But LVM won't come back >>> up because it sees the physical volume that maps to md1 as missing. >>> Now I am very concerned. After trying a bunch of things, I do a >>> pvcreate with the missing UUID on md1, restart the vg and the logical >>> volume comes back up. I was thinking I may have told lvm to use an >>> array of bad data, but to my surprise, I mounted the filesystem and >>> everything looked intact! Ok, sometimes you win. So I do one more >>> reboot to get the system back up in multiuser so I can back up some of >>> the more important media stored on the volume (it's got about 10 Tb >>> used, but most of that is PVR recordings, but there is a lot of ripped >>> music and DVD's that I really don't >>> want to rerip) on a another server that has some space on it while I >>> figure out what has been happening. >>> >>> The >>> reboot again fails because of a problem with md1. This time, another >>> one of the drives shows as removed (/dev/sdm1), and I can't reassemble >>> the array with a --force option. It is acting like /dev/sdl1 (the >>> other removed unit), and even though I can read from the drives fine, >>> their UUID is fine, etc..., md does not consider them as part of the >>> array. /dev/sdo1 (which was the drive that looked like a faulty spare) >>> seems OK when trying to do the assemble. sdm1 seemed just fine before >>> the reboot, and was showing no problems before. They are not hooked up >>> on the same controller cable ( a SAS to SATA fanout), and the LSI MPT >>> controller card seems to talk to the other disks just fine. >>> >>> Anyways, >>> I have no idea as to what's going on. When I try to add sdm1 or sdl1 >>> back into the array, md complains the device is busy, which is very odd >>> because it's not part of another array or doing anything else in the >>> system. >>> >>> Any idea as to what could be happening here? I am beyond frustrated. >>> >>> thanks, >>> Mike >>> >>> >>> >> >> If you are using a hotswap chasis, then it has some sort of >> sata-backplane. I have seen backplanes go bad in the past, that would be >> my first replacement. >> >> Justin. >> >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Jon ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-01 18:29 ` Justin Piszcz 2009-01-01 18:40 ` Jon Nelson @ 2009-01-02 6:19 ` Mike Myers 2009-01-02 12:10 ` Justin Piszcz 2009-01-05 22:11 ` Neil Brown 1 sibling, 2 replies; 46+ messages in thread From: Mike Myers @ 2009-01-02 6:19 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid, john lists Ok, the bad MPT board is out, replaced by a SI3132, and I rejiggered the drives around so that all the drives are connected. It brought me back to the main problem. md2 is running fine, md1 cannot assemble with only 5 drives out of the 7. Here is the data you requested: (none):~ # cat /etc/mdadm.conf DEVICE partitions ARRAY /dev/md0 level=raid0 UUID=9412e7e1:fd56806c:0f9cc200:95c7ed98 ARRAY /dev/md3 level=raid0 UUID=67999c69:4a9ca9f9:7d4d6b81:91c98b1f ARRAY /dev/md1 level=raid5 UUID=b737af5c:7c0a70a9:99a648a0:7f693c7d ARRAY /dev/md2 level=raid5 UUID=e70e0697:a10a5b75:941dd76f:196d9e4e #ARRAY /dev/md2 level=raid0 UUID=658369ee:23081b79:c990e3a2:15f38c70 #ARRAY /dev/md3 level=raid0 UUID=e2c910ae:0052c38e:a5e19298:0d057e34 MAILADDR root (md0 and md3 are old arrays that have since been removed - no disks with their uuids are in the system) (none):~> mdadm -D /dev/md1 mdadm: md device /dev/md1 does not appear to be active. (none):~> mdadm -D /dev/md2 /dev/md2: Version : 00.90.03 Creation Time : Tue Aug 19 21:31:10 2008 Raid Level : raid5 Array Size : 5860559616 (5589.07 GiB 6001.21 GB) Used Dev Size : 976759936 (931.51 GiB 1000.20 GB) Raid Devices : 7 Total Devices : 7 Preferred Minor : 2 Persistence : Superblock is persistent Update Time : Thu Jan 1 21:59:20 2009 State : clean Active Devices : 7 Working Devices : 7 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 128K UUID : e70e0697:a10a5b75:941dd76f:196d9e4e Events : 0.1438838 Number Major Minor RaidDevice State 0 8 209 0 active sync /dev/sdn1 1 8 129 1 active sync /dev/sdi1 2 8 177 2 active sync /dev/sdl1 3 8 17 3 active sync /dev/sdb1 4 8 33 4 active sync /dev/sdc1 5 8 65 5 active sync /dev/sde1 6 8 193 6 active sync /dev/sdm1 (md1 is comprised of sdd1 sdf1 sdg1 sdh1 sdj1 sdk1 sdo1) (none):~> mdadm --examine /dev/sdd1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdd1: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d Name : 1 Creation Time : Fri Nov 23 12:15:39 2007 Raid Level : raid5 Raid Devices : 7 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB) Array Size : 11721117696 (5589.06 GiB 6001.21 GB) Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB) Super Offset : 1953519984 sectors State : clean Device UUID : 8ea6369b:cfd1c103:845a1a65:d8b1f254 Internal Bitmap : -234 sectors from superblock Update Time : Wed Dec 31 22:43:01 2008 Checksum : ce94ad09 - correct Events : 2295122 Layout : left-symmetric Chunk Size : 128K Array Slot : 7 (0, failed, failed, failed, 3, 4, failed, 5, 6) Array State : u__uuUu 4 failed /dev/sdf1: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d Name : 1 Creation Time : Fri Nov 23 12:15:39 2007 Raid Level : raid5 Raid Devices : 7 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB) Array Size : 11721117696 (5589.06 GiB 6001.21 GB) Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB) Super Offset : 1953519984 sectors State : clean Device UUID : 50c2e80e:e36efc92:5ddac3b0:4d847236 Internal Bitmap : -234 sectors from superblock Update Time : Wed Dec 31 22:43:01 2008 Checksum : feaab82b - correct Events : 2295122 Layout : left-symmetric Chunk Size : 128K Array Slot : 5 (0, failed, failed, failed, 3, 4, failed, 5, 6) Array State : u__uUuu 4 failed /dev/sdg1: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d Name : 1 Creation Time : Fri Nov 23 12:15:39 2007 Raid Level : raid5 Raid Devices : 7 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB) Array Size : 11721117696 (5589.06 GiB 6001.21 GB) Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB) Super Offset : 1953519984 sectors State : clean Device UUID : c9809a0c:bd4eabbe:c110a056:0cdd3691 Internal Bitmap : -234 sectors from superblock Update Time : Fri Jan 2 17:30:13 2009 Checksum : 28b13f46 - correct Events : 2295116 Layout : left-symmetric Chunk Size : 128K Array Slot : 0 (0, 1, failed, failed, 3, 4, failed, 5, 6) Array State : Uu_uuuu 3 failed /dev/sdh1: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d Name : 1 Creation Time : Fri Nov 23 12:15:39 2007 Raid Level : raid5 Raid Devices : 7 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB) Array Size : 11721117696 (5589.06 GiB 6001.21 GB) Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB) Super Offset : 1953519984 sectors State : clean Device UUID : c9809a0c:bd4eabbe:c110a056:0cdd3691 Internal Bitmap : -234 sectors from superblock Update Time : Wed Dec 31 22:43:01 2008 Checksum : 28abe59d - correct Events : 2295122 Layout : left-symmetric Chunk Size : 128K Array Slot : 0 (0, failed, failed, failed, 3, 4, failed, 5, 6) Array State : U__uuuu 4 failed (none):~> mdadm --examine /dev/sdj1 /dev/sdk1 /dev/sdo1 /dev/sdj1: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d Name : 1 Creation Time : Fri Nov 23 12:15:39 2007 Raid Level : raid5 Raid Devices : 7 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB) Array Size : 11721117696 (5589.06 GiB 6001.21 GB) Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB) Super Offset : 1953519984 sectors State : clean Device UUID : c61e1d1a:b123f01a:4098ab5e:e8932eb6 Internal Bitmap : -234 sectors from superblock Update Time : Wed Dec 31 22:43:01 2008 Checksum : bf7696f0 - correct Events : 2295122 Layout : left-symmetric Chunk Size : 128K Array Slot : 8 (0, failed, failed, failed, 3, 4, failed, 5, 6) Array State : u__uuuU 4 failed /dev/sdk1: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d Name : 1 Creation Time : Fri Nov 23 12:15:39 2007 Raid Level : raid5 Raid Devices : 7 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB) Array Size : 11721117696 (5589.06 GiB 6001.21 GB) Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB) Super Offset : 1953519984 sectors State : clean Device UUID : f1417b9d:64d9c93d:c32d16e8:470ab7af Internal Bitmap : -234 sectors from superblock Update Time : Wed Dec 31 22:43:01 2008 Checksum : e8a17bad - correct Events : 2295122 Layout : left-symmetric Chunk Size : 128K Array Slot : 4 (0, failed, failed, failed, 3, 4, failed, 5, 6) Array State : u__Uuuu 4 failed /dev/sdo1: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : b737af5c:7c0a70a9:99a648a0:7f693c7d Name : 1 Creation Time : Fri Nov 23 12:15:39 2007 Raid Level : raid5 Raid Devices : 7 Avail Dev Size : 1953519728 (931.51 GiB 1000.20 GB) Array Size : 11721117696 (5589.06 GiB 6001.21 GB) Used Dev Size : 1953519616 (931.51 GiB 1000.20 GB) Super Offset : 1953519984 sectors State : clean Device UUID : c9809a0c:bd4eabbe:c110a056:0cdd3691 Internal Bitmap : -234 sectors from superblock Update Time : Fri Jan 2 17:17:40 2009 Checksum : 28b13bcd - correct Events : 2294980 Layout : left-symmetric Chunk Size : 128K Array Slot : 0 (0, 1, failed, failed, 3, 4, failed, 5, 6) Array State : Uu_uuuu 3 failed (none):~> cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md2 : active raid5 sdn1[0] sdm1[6] sde1[5] sdc1[4] sdb1[3] sdl1[2] sdi1[1] 5860559616 blocks level 5, 128k chunk, algorithm 2 [7/7] [UUUUUUU] md1 : inactive sdh1[0](S) sdj1[8](S) sdd1[7](S) sdf1[5](S) sdk1[4](S) 4883799040 blocks super 1.0 unused devices: <none> I'm not seeing any errors on boot - all the drives come up now. It's just that md can't put md1 back together again. Once that happens, then I can try with lvm and see if I can't get the filesystem online. Anything else that would be helpful? I am happy to attach the whole bootup log, but it's a little long... thanks VERY much! Mike ----- Original Message ---- From: Justin Piszcz <jpiszcz@lucidpixels.com> To: Mike Myers <mikesm559@yahoo.com> Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> Sent: Thursday, January 1, 2009 10:29:15 AM Subject: Re: Need urgent help in fixing raid5 array I think some output would be pertinent here: mdadm -D /dev/md0..1..2 etc cat /proc/mdstat dmesg/syslog of the errors you are seeing etc On Thu, 1 Jan 2009, Mike Myers wrote: > The disks that are problematic are still online as far as the OS can tell. I can do a dd from them and pull off data at the normal speeds, so I don't understand if that's the case why the backplane would be a problem here. I can try and move them to another slot however (I have a 20 slot SATA backplane in there) and see if that changes how md deals with it. > > The OS sees the drive, it inits fine, but md shows it as removed and won't let me add it back to the array because of the "device being busy". I don't understand the criteria that md uses to add a drive I guess. The uuid looks fine, and if the events is off, then the -f flag should take care of that. I've never seen a "device busy" failure on an add before. > > thx > mike > > > > > ----- Original Message ---- > From: Justin Piszcz <jpiszcz@lucidpixels.com> > To: Mike Myers <mikesm559@yahoo.com> > Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> > Sent: Thursday, January 1, 2009 7:40:21 AM > Subject: Re: Need urgent help in fixing raid5 array > > > > On Thu, 1 Jan 2009, Mike Myers wrote: > >> Well, thanks for all your help last month. As i posted, things came >> back up and I survived the failure. Now, I have yet another problem. >> :( After 5 years of running a linux server as a dedicated NAS, I am >> hitting some very weird problems. This server started as an single >> processor AMD system with 4 320GB drives, and has been upgraded >> multiple times so that it is now a quad core Intel rackmounted 4U >> system with 14 1 TB drives and I have never lost data in any of the >> upgrades of CPU, motherboard and disk controller hardware and disk >> drives. Now after last month's near death experience I am faced with >> another serious problem in less than a month. Any help you guys could >> give me would be most appreciated. This is a sucky way to start the >> new year. >> >> The array I had problems with last month (md2 >> comprised of 7 1 TB drives in a RAID5 config) is running just fine. >> md1, which is built of 7 1 TB hitachi 7K1000 drives is now having >> problems. We returned from a 10 day family visit with everything >> running just fine. There ws a brief power outage today, abt 3 mins, >> but I can't see how that could be related as the server is on a high >> quality rackmount 3U APC UPS that handled the outage just fine. I was >> working on the system getting X to work again after a nvidia driver >> update, and when that was working fine, checked the disks to discover >> that md1 was in a degraded state, with /dev/sdl1 kicked out of the >> array (removed). I tried to do a dd from the drive to verify it's >> location in the rack, but I got an i/o error. This was most odd, and >> so went to the rack and pulled the disk and reinserted it. No system >> log entries recorded the device being pulled or re-installed. So I am >> thinking that a cable somehow >> has come loose. I power the system >> down, pull it out of the rack, look at the cable that goes to the >> drive, everything looks fine. >> >> So I reboot the system, and now >> the array won't come online because now in addition to the drive that >> shows as (removed), one of the other drives shows as a faulty spare. >> Well, learning from the last go around, I reassemble the array with the >> --force option, and the array comes back up. But LVM won't come back >> up because it sees the physical volume that maps to md1 as missing. >> Now I am very concerned. After trying a bunch of things, I do a >> pvcreate with the missing UUID on md1, restart the vg and the logical >> volume comes back up. I was thinking I may have told lvm to use an >> array of bad data, but to my surprise, I mounted the filesystem and >> everything looked intact! Ok, sometimes you win. So I do one more >> reboot to get the system back up in multiuser so I can back up some of >> the more important media stored on the volume (it's got about 10 Tb >> used, but most of that is PVR recordings, but there is a lot of ripped >> music and DVD's that I really don't >> want to rerip) on a another server that has some space on it while I figure out what has been happening. >> >> The >> reboot again fails because of a problem with md1. This time, another >> one of the drives shows as removed (/dev/sdm1), and I can't reassemble >> the array with a --force option. It is acting like /dev/sdl1 (the >> other removed unit), and even though I can read from the drives fine, >> their UUID is fine, etc..., md does not consider them as part of the >> array. /dev/sdo1 (which was the drive that looked like a faulty spare) >> seems OK when trying to do the assemble. sdm1 seemed just fine before >> the reboot, and was showing no problems before. They are not hooked up >> on the same controller cable ( a SAS to SATA fanout), and the LSI MPT >> controller card seems to talk to the other disks just fine. >> >> Anyways, >> I have no idea as to what's going on. When I try to add sdm1 or sdl1 >> back into the array, md complains the device is busy, which is very odd >> because it's not part of another array or doing anything else in the >> system. >> >> Any idea as to what could be happening here? I am beyond frustrated. >> >> thanks, >> Mike >> >> >> > > If you are using a hotswap chasis, then it has some sort of > sata-backplane. I have seen backplanes go bad in the past, that would be > my first replacement. > > Justin. > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-02 6:19 ` Mike Myers @ 2009-01-02 12:10 ` Justin Piszcz 2009-01-02 18:12 ` Mike Myers 2009-01-05 22:11 ` Neil Brown 1 sibling, 1 reply; 46+ messages in thread From: Justin Piszcz @ 2009-01-02 12:10 UTC (permalink / raw) To: Mike Myers; +Cc: linux-raid, john lists On Thu, 1 Jan 2009, Mike Myers wrote: > Ok, the bad MPT board is out, replaced by a SI3132, and I rejiggered the drives around so that all the drives are connected. It brought me back to the main problem. md2 is running fine, md1 cannot assemble with only 5 drives out of the 7. > > Here is the data you requested: > > (none):~ # cat /etc/mdadm.conf > DEVICE partitions > ARRAY /dev/md0 level=raid0 UUID=9412e7e1:fd56806c:0f9cc200:95c7ed98 > ARRAY /dev/md3 level=raid0 UUID=67999c69:4a9ca9f9:7d4d6b81:91c98b1f > ARRAY /dev/md1 level=raid5 UUID=b737af5c:7c0a70a9:99a648a0:7f693c7d > ARRAY /dev/md2 level=raid5 UUID=e70e0697:a10a5b75:941dd76f:196d9e4e > #ARRAY /dev/md2 level=raid0 UUID=658369ee:23081b79:c990e3a2:15f38c70 > #ARRAY /dev/md3 level=raid0 UUID=e2c910ae:0052c38e:a5e19298:0d057e34 > MAILADDR root > > (md0 and md3 are old arrays that have since been removed - no disks with their uuids are in the system) > > (none):~> mdadm -D /dev/md1 > mdadm: md device /dev/md1 does not appear to be active. > > > (none):~> mdadm -D /dev/md2 > /dev/md2: > Version : 00.90.03 > Creation Time : Tue Aug 19 21:31:10 2008 > Raid Level : raid5 > Array Size : 5860559616 (5589.07 GiB 6001.21 GB) > Used Dev Size : 976759936 (931.51 GiB 1000.20 GB) > Raid Devices : 7 > Total Devices : 7 > Preferred Minor : 2 > Persistence : Superblock is persistent > > Update Time : Thu Jan 1 21:59:20 2009 > State : clean > Active Devices : 7 > Working Devices : 7 > Failed Devices : 0 > Spare Devices : 0 > > Layout : left-symmetric > Chunk Size : 128K > > UUID : e70e0697:a10a5b75:941dd76f:196d9e4e > Events : 0.1438838 > > Number Major Minor RaidDevice State > 0 8 209 0 active sync /dev/sdn1 > 1 8 129 1 active sync /dev/sdi1 > 2 8 177 2 active sync /dev/sdl1 > 3 8 17 3 active sync /dev/sdb1 > 4 8 33 4 active sync /dev/sdc1 > 5 8 65 5 active sync /dev/sde1 > 6 8 193 6 active sync /dev/sdm1 > > > (md1 is comprised of sdd1 sdf1 sdg1 sdh1 sdj1 sdk1 sdo1) > What happens if you use assemble and force with the five good drives and one or the other of the ones that are not assembling (to assemble in degraded mode)? For the two disks that have 'failed' can you show their smart stats, I am curious to see them. Worst case which I do not recommend unless it is your last resort is re-create the array with --assume-clean with the same options you used originally; doing this though will cause filesystem corruption. I recommend you switch to RAID-6 with an array that big btw :) Justin. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-02 12:10 ` Justin Piszcz @ 2009-01-02 18:12 ` Mike Myers 2009-01-02 18:22 ` Justin Piszcz 0 siblings, 1 reply; 46+ messages in thread From: Mike Myers @ 2009-01-02 18:12 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid, john lists Thanks for the response. When I try and assemble the array with just 6 disks (the 5 good ones and one of sdo1 or sdg1) I get: (none):~> mdadm /dev/md1 --assemble /dev/sdf1 /dev/sdh1 /dev/sdj1 /dev/sdk1 /dev/sdo1 /dev/sdd1 --force mdadm: /dev/md1 assembled from 5 drives - not enough to start the array. (none):~> mdadm /dev/md1 --assemble /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdj1 /dev/sdk1 /dev/sdd1 --force mdadm: /dev/md1 assembled from 5 drives - not enough to start the array. As for the smart info: (none):~> smartctl -i /dev/sdo1 smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Hitachi Deskstar 7K1000 Device Model: Hitachi HDS721010KLA330 Serial Number: GTJ000PAG552VC Firmware Version: GKAOA70M User Capacity: 1,000,204,886,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1 Local Time is: Fri Jan 2 09:32:07 2009 PST SMART support is: Available - device has SMART capability. SMART support is: Enabled and (none):~> smartctl -i /dev/sdg1 smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Hitachi Deskstar 7K1000 Device Model: Hitachi HDS721010KLA330 Serial Number: GTA000PAG5R0AA Firmware Version: GKAOA70M User Capacity: 1,000,204,886,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1 Local Time is: Fri Jan 2 10:04:55 2009 PST SMART support is: Available - device has SMART capability. SMART support is: Enabled When I tried to read the smart data from the sdo1, the drive went offline and I get a controller error! Here's what I get talking to sdg1: (none):~> smartctl -l error /dev/sdg1 smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART Error Log Version: 1 ATA Error Count: 1 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 1 occurred at disk power-on lifetime: 6388 hours (266 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 a0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 a0 08 6d+23:19:44.200 IDENTIFY DEVICE 25 00 01 01 00 00 00 04 6d+23:19:44.000 READ DMA EXT 25 00 80 be 1b ba ef ff 6d+23:19:42.500 READ DMA EXT 25 00 c0 7f 1b ba e0 08 6d+23:19:42.500 READ DMA EXT 25 00 40 3f 1b ba e0 08 6d+23:19:30.300 READ DMA EXT As for RAID6, well this array started off as a 3 disk RAID5, and then got incrementally grown as capacity needs grew. I wasn't going to go beyond 7 disks in the raid set, but since you can't reshape raid5 into raid6, and so I would have another few TB of disk available to move the raid set data to. Since I use XFS, I can't just move off data and then shrink the filesystem to minimize the needs. md and XFS make it easy to add disks, but very hard to remove them. :-( It looks like my best bet is to try and get sd1g back into the raid set somehow, but I don't understand why md isn't assembling it into the set. Should I try and clone sdo1 to a new disk, or sdg1? But I am not sure what help that would be if md won't assemble with it. thx mike ----- Original Message ---- From: Justin Piszcz <jpiszcz@lucidpixels.com> To: Mike Myers <mikesm559@yahoo.com> Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> Sent: Friday, January 2, 2009 4:10:20 AM Subject: Re: Need urgent help in fixing raid5 array On Thu, 1 Jan 2009, Mike Myers wrote: > Ok, the bad MPT board is out, replaced by a SI3132, and I rejiggered the drives around so that all the drives are connected. It brought me back to the main problem. md2 is running fine, md1 cannot assemble with only 5 drives out of the 7. > > Here is the data you requested: > > (none):~ # cat /etc/mdadm.conf > DEVICE partitions > ARRAY /dev/md0 level=raid0 UUID=9412e7e1:fd56806c:0f9cc200:95c7ed98 > ARRAY /dev/md3 level=raid0 UUID=67999c69:4a9ca9f9:7d4d6b81:91c98b1f > ARRAY /dev/md1 level=raid5 UUID=b737af5c:7c0a70a9:99a648a0:7f693c7d > ARRAY /dev/md2 level=raid5 UUID=e70e0697:a10a5b75:941dd76f:196d9e4e > #ARRAY /dev/md2 level=raid0 UUID=658369ee:23081b79:c990e3a2:15f38c70 > #ARRAY /dev/md3 level=raid0 UUID=e2c910ae:0052c38e:a5e19298:0d057e34 > MAILADDR root > > (md0 and md3 are old arrays that have since been removed - no disks with their uuids are in the system) > > (none):~> mdadm -D /dev/md1 > mdadm: md device /dev/md1 does not appear to be active. > > > (none):~> mdadm -D /dev/md2 > /dev/md2: > Version : 00.90.03 > Creation Time : Tue Aug 19 21:31:10 2008 > Raid Level : raid5 > Array Size : 5860559616 (5589.07 GiB 6001.21 GB) > Used Dev Size : 976759936 (931.51 GiB 1000.20 GB) > Raid Devices : 7 > Total Devices : 7 > Preferred Minor : 2 > Persistence : Superblock is persistent > > Update Time : Thu Jan 1 21:59:20 2009 > State : clean > Active Devices : 7 > Working Devices : 7 > Failed Devices : 0 > Spare Devices : 0 > > Layout : left-symmetric > Chunk Size : 128K > > UUID : e70e0697:a10a5b75:941dd76f:196d9e4e > Events : 0.1438838 > > Number Major Minor RaidDevice State > 0 8 209 0 active sync /dev/sdn1 > 1 8 129 1 active sync /dev/sdi1 > 2 8 177 2 active sync /dev/sdl1 > 3 8 17 3 active sync /dev/sdb1 > 4 8 33 4 active sync /dev/sdc1 > 5 8 65 5 active sync /dev/sde1 > 6 8 193 6 active sync /dev/sdm1 > > > (md1 is comprised of sdd1 sdf1 sdg1 sdh1 sdj1 sdk1 sdo1) > What happens if you use assemble and force with the five good drives and one or the other of the ones that are not assembling (to assemble in degraded mode)? For the two disks that have 'failed' can you show their smart stats, I am curious to see them. Worst case which I do not recommend unless it is your last resort is re-create the array with --assume-clean with the same options you used originally; doing this though will cause filesystem corruption. I recommend you switch to RAID-6 with an array that big btw :) Justin. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-02 18:12 ` Mike Myers @ 2009-01-02 18:22 ` Justin Piszcz 2009-01-02 18:46 ` Mike Myers 0 siblings, 1 reply; 46+ messages in thread From: Justin Piszcz @ 2009-01-02 18:22 UTC (permalink / raw) To: Mike Myers; +Cc: linux-raid, john lists On Fri, 2 Jan 2009, Mike Myers wrote: > Thanks for the response. When I try and assemble the array with just 6 disks (the 5 good ones and one of sdo1 or sdg1) I get: > > (none):~> mdadm /dev/md1 --assemble /dev/sdf1 /dev/sdh1 /dev/sdj1 /dev/sdk1 /dev/sdo1 /dev/sdd1 --force > mdadm: /dev/md1 assembled from 5 drives - not enough to start the array. > > > (none):~> mdadm /dev/md1 --assemble /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdj1 /dev/sdk1 /dev/sdd1 --force > mdadm: /dev/md1 assembled from 5 drives - not enough to start the array. > > As for the smart info: > > (none):~> smartctl -i /dev/sdo1 > smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build) > Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net > > === START OF INFORMATION SECTION === > Model Family: Hitachi Deskstar 7K1000 > Device Model: Hitachi HDS721010KLA330 > Serial Number: GTJ000PAG552VC > Firmware Version: GKAOA70M > User Capacity: 1,000,204,886,016 bytes > Device is: In smartctl database [for details use: -P show] > ATA Version is: 7 > ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1 > Local Time is: Fri Jan 2 09:32:07 2009 PST > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > and > > (none):~> smartctl -i /dev/sdg1 > smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build) > Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net > > === START OF INFORMATION SECTION === > Model Family: Hitachi Deskstar 7K1000 > Device Model: Hitachi HDS721010KLA330 > Serial Number: GTA000PAG5R0AA > Firmware Version: GKAOA70M > User Capacity: 1,000,204,886,016 bytes > Device is: In smartctl database [for details use: -P show] > ATA Version is: 7 > ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1 > Local Time is: Fri Jan 2 10:04:55 2009 PST > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > > When I tried to read the smart data from the sdo1, the drive went offline and I get a controller error! I would figure out why this happens first and fix it if possible. Backplane? Cable? Controller? Btw: The interest bits from smartctl-- need to see smartctl -a so we can see the statistics for each of the identifiers. > > > Here's what I get talking to sdg1: > > (none):~> smartctl -l error /dev/sdg1 > smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build) > Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net > > === START OF READ SMART DATA SECTION === > SMART Error Log Version: 1 > ATA Error Count: 1 > CR = Command Register [HEX] > FR = Features Register [HEX] > SC = Sector Count Register [HEX] > SN = Sector Number Register [HEX] > CL = Cylinder Low Register [HEX] > CH = Cylinder High Register [HEX] > DH = Device/Head Register [HEX] > DC = Device Command Register [HEX] > ER = Error register [HEX] > ST = Status register [HEX] > Powered_Up_Time is measured from power on, and printed as > DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, > SS=sec, and sss=millisec. It "wraps" after 49.710 days. > > Error 1 occurred at disk power-on lifetime: 6388 hours (266 days + 4 hours) > When the command that caused the error occurred, the device was active or idle. > > After command completion occurred, registers were: > ER ST SC SN CL CH DH > -- -- -- -- -- -- -- > 84 51 00 00 00 00 a0 > > Commands leading to the command that caused the error were: > CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name > -- -- -- -- -- -- -- -- ---------------- -------------------- > ec 00 00 00 00 00 a0 08 6d+23:19:44.200 IDENTIFY DEVICE > 25 00 01 01 00 00 00 04 6d+23:19:44.000 READ DMA EXT > 25 00 80 be 1b ba ef ff 6d+23:19:42.500 READ DMA EXT > 25 00 c0 7f 1b ba e0 08 6d+23:19:42.500 READ DMA EXT > 25 00 40 3f 1b ba e0 08 6d+23:19:30.300 READ DMA EXT > > > > As for RAID6, well this array started off as a 3 disk RAID5, and then got incrementally grown as capacity needs grew. I wasn't going to go beyond 7 disks in the raid set, but since you can't reshape raid5 into raid6, and so I would have another few TB of disk available to move the raid set data to. Since I use XFS, I can't just move off data and then shrink the filesystem to minimize the needs. md and XFS make it easy to add disks, but very hard to remove them. :-( > > It looks like my best bet is to try and get sd1g back into the raid set somehow, but I don't understand why md isn't assembling it into the set. Should I try and clone sdo1 to a new disk, or sdg1? But I am not sure what help that would be if md won't assemble with it. > > thx > mike As far as re-assembling the array, I would wait for Neil or someone who has done this a few times but you need to find out why disks are giving I/O errors. If you run: dd if=/dev/sda of=/dev/null bs=1M & dd if=/dev/sdb of=/dev/null bs=1M & for each disk, can you do that for all disks in the raid array and then see if any errors occur? if you flood your system with that much I/O and it doesnt have any problems I'd say you're good to go, but if you run those commands and background them/run them simultaenously and drives start dropping left and right, I'd wonder about the backplane myself.. Justin. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-02 18:22 ` Justin Piszcz @ 2009-01-02 18:46 ` Mike Myers 2009-01-02 18:57 ` Justin Piszcz 0 siblings, 1 reply; 46+ messages in thread From: Mike Myers @ 2009-01-02 18:46 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid, john lists Well, I can read from sdg1 just fine. It seems to work ok, at least for a few GB of data. I'll try this on some of the other disks, but it is possible for to pull the disks out of the backplane and run the SFF-8087 fanout cables direct to each drive and bypass the backplane completely. It certainly would be easy to do this for the at least the sdo1 drive and see if I can get better results going direct to the disk. I have moved the disks around the backplane a bit to deal with the issues of the controller failure, so I am pretty sure it's not just one bad slot or the like. So you've seen a backplane fail in away that the disks come up fine at boot but have corrupted data transfers across them? I wonder about the sata cables in that case as well. I could hook up a pair of PMP's to my SI3132's and bypass the 8077 cables as well. Thx Mike ----- Original Message ---- From: Justin Piszcz <jpiszcz@lucidpixels.com> To: Mike Myers <mikesm559@yahoo.com> Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> Sent: Friday, January 2, 2009 10:22:29 AM Subject: Re: Need urgent help in fixing raid5 array On Fri, 2 Jan 2009, Mike Myers wrote: > Thanks for the response. When I try and assemble the array with just 6 disks (the 5 good ones and one of sdo1 or sdg1) I get: > > (none):~> mdadm /dev/md1 --assemble /dev/sdf1 /dev/sdh1 /dev/sdj1 /dev/sdk1 /dev/sdo1 /dev/sdd1 --force > mdadm: /dev/md1 assembled from 5 drives - not enough to start the array. > > > (none):~> mdadm /dev/md1 --assemble /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdj1 /dev/sdk1 /dev/sdd1 --force > mdadm: /dev/md1 assembled from 5 drives - not enough to start the array. > > As for the smart info: > > (none):~> smartctl -i /dev/sdo1 > smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build) > Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net > > === START OF INFORMATION SECTION === > Model Family: Hitachi Deskstar 7K1000 > Device Model: Hitachi HDS721010KLA330 > Serial Number: GTJ000PAG552VC > Firmware Version: GKAOA70M > User Capacity: 1,000,204,886,016 bytes > Device is: In smartctl database [for details use: -P show] > ATA Version is: 7 > ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1 > Local Time is: Fri Jan 2 09:32:07 2009 PST > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > and > > (none):~> smartctl -i /dev/sdg1 > smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build) > Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net > > === START OF INFORMATION SECTION === > Model Family: Hitachi Deskstar 7K1000 > Device Model: Hitachi HDS721010KLA330 > Serial Number: GTA000PAG5R0AA > Firmware Version: GKAOA70M > User Capacity: 1,000,204,886,016 bytes > Device is: In smartctl database [for details use: -P show] > ATA Version is: 7 > ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1 > Local Time is: Fri Jan 2 10:04:55 2009 PST > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > > When I tried to read the smart data from the sdo1, the drive went offline and I get a controller error! I would figure out why this happens first and fix it if possible. Backplane? Cable? Controller? Btw: The interest bits from smartctl-- need to see smartctl -a so we can see the statistics for each of the identifiers. > > > Here's what I get talking to sdg1: > > (none):~> smartctl -l error /dev/sdg1 > smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build) > Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net > > === START OF READ SMART DATA SECTION === > SMART Error Log Version: 1 > ATA Error Count: 1 > CR = Command Register [HEX] > FR = Features Register [HEX] > SC = Sector Count Register [HEX] > SN = Sector Number Register [HEX] > CL = Cylinder Low Register [HEX] > CH = Cylinder High Register [HEX] > DH = Device/Head Register [HEX] > DC = Device Command Register [HEX] > ER = Error register [HEX] > ST = Status register [HEX] > Powered_Up_Time is measured from power on, and printed as > DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, > SS=sec, and sss=millisec. It "wraps" after 49.710 days. > > Error 1 occurred at disk power-on lifetime: 6388 hours (266 days + 4 hours) > When the command that caused the error occurred, the device was active or idle. > > After command completion occurred, registers were: > ER ST SC SN CL CH DH > -- -- -- -- -- -- -- > 84 51 00 00 00 00 a0 > > Commands leading to the command that caused the error were: > CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name > -- -- -- -- -- -- -- -- ---------------- -------------------- > ec 00 00 00 00 00 a0 08 6d+23:19:44.200 IDENTIFY DEVICE > 25 00 01 01 00 00 00 04 6d+23:19:44.000 READ DMA EXT > 25 00 80 be 1b ba ef ff 6d+23:19:42.500 READ DMA EXT > 25 00 c0 7f 1b ba e0 08 6d+23:19:42.500 READ DMA EXT > 25 00 40 3f 1b ba e0 08 6d+23:19:30.300 READ DMA EXT > > > > As for RAID6, well this array started off as a 3 disk RAID5, and then got incrementally grown as capacity needs grew. I wasn't going to go beyond 7 disks in the raid set, but since you can't reshape raid5 into raid6, and so I would have another few TB of disk available to move the raid set data to. Since I use XFS, I can't just move off data and then shrink the filesystem to minimize the needs. md and XFS make it easy to add disks, but very hard to remove them. :-( > > It looks like my best bet is to try and get sd1g back into the raid set somehow, but I don't understand why md isn't assembling it into the set. Should I try and clone sdo1 to a new disk, or sdg1? But I am not sure what help that would be if md won't assemble with it. > > thx > mike As far as re-assembling the array, I would wait for Neil or someone who has done this a few times but you need to find out why disks are giving I/O errors. If you run: dd if=/dev/sda of=/dev/null bs=1M & dd if=/dev/sdb of=/dev/null bs=1M & for each disk, can you do that for all disks in the raid array and then see if any errors occur? if you flood your system with that much I/O and it doesnt have any problems I'd say you're good to go, but if you run those commands and background them/run them simultaenously and drives start dropping left and right, I'd wonder about the backplane myself.. Justin. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-02 18:46 ` Mike Myers @ 2009-01-02 18:57 ` Justin Piszcz 2009-01-02 20:46 ` Mike Myers ` (3 more replies) 0 siblings, 4 replies; 46+ messages in thread From: Justin Piszcz @ 2009-01-02 18:57 UTC (permalink / raw) To: Mike Myers; +Cc: linux-raid, john lists On Fri, 2 Jan 2009, Mike Myers wrote: > Well, I can read from sdg1 just fine. It seems to work ok, at least for a few GB of data. I'll try this on some of the other disks, but it is possible for to pull the disks out of the backplane and run the SFF-8087 fanout cables direct to each drive and bypass the backplane completely. It certainly would be easy to do this for the at least the sdo1 drive and see if I can get better results going direct to the disk. I have moved the disks around the backplane a bit to deal with the issues of the controller failure, so I am pretty sure it's not just one bad slot or the like. > > So you've seen a backplane fail in away that the disks come up fine at boot but have corrupted data transfers across them? I wonder about the sata cables in that case as well. I could hook up a pair of PMP's to my SI3132's and bypass the 8077 cables as well. 1. Try by-passing the backplane. 2. Bad cables will usually cause smart identifier UDMA_CRC_Error_Count to increase quite high, if it is 0 or close to it, the cable is unlikely the issue. 3. I have seem all kinds of weirdness with bad backplanes, drives dropping out of the array, drives producing I/O errors, etc. Justin. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-02 18:57 ` Justin Piszcz @ 2009-01-02 20:46 ` Mike Myers 2009-01-02 20:56 ` Mike Myers ` (2 subsequent siblings) 3 siblings, 0 replies; 46+ messages in thread From: Mike Myers @ 2009-01-02 20:46 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid, john lists Well, it looks like (maybe) you could be right about the backplane. Shortly after replying to you, md2 went off and threw two drives. So this is too much of a coincidence. Or I have having a really bad time with a bunch of disks! I had 3 5in3 backplanes from a previous incarnation of the server around, so I moved all the disks from the new system into the old bacplanes, and hooked up power and cables etc... They are now all online in the new backplanes. Md1 looks like it's still in the same state can't assemble from 5 drives. Md2 when it came up said it couldn't assemble from 3 drives. (It was working fine when I booted it in old backplane). I told it to assemble using the --force option, and it adjust two drives events and so now complains that it can't assemble from 5 drives too. If I were taking hits due to a bad backplane, could it be responsible for putting these arrys in such a bad state, even when i cleared the bad backplane? I'll probe around using the smart tools to see if I have a bad cable. Meanwhile I have two new 8 port controllers on order to try and see if I have having more controller related grief. Any ideas as to have to try reassmbling these guys? I really don't want to try and do the create --assume-clean approach. Thx mike ----- Original Message ---- From: Justin Piszcz <jpiszcz@lucidpixels.com> To: Mike Myers <mikesm559@yahoo.com> Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> Sent: Friday, January 2, 2009 10:57:13 AM Subject: Re: Need urgent help in fixing raid5 array On Fri, 2 Jan 2009, Mike Myers wrote: > Well, I can read from sdg1 just fine. It seems to work ok, at least for a few GB of data. I'll try this on some of the other disks, but it is possible for to pull the disks out of the backplane and run the SFF-8087 fanout cables direct to each drive and bypass the backplane completely. It certainly would be easy to do this for the at least the sdo1 drive and see if I can get better results going direct to the disk. I have moved the disks around the backplane a bit to deal with the issues of the controller failure, so I am pretty sure it's not just one bad slot or the like. > > So you've seen a backplane fail in away that the disks come up fine at boot but have corrupted data transfers across them? I wonder about the sata cables in that case as well. I could hook up a pair of PMP's to my SI3132's and bypass the 8077 cables as well. 1. Try by-passing the backplane. 2. Bad cables will usually cause smart identifier UDMA_CRC_Error_Count to increase quite high, if it is 0 or close to it, the cable is unlikely the issue. 3. I have seem all kinds of weirdness with bad backplanes, drives dropping out of the array, drives producing I/O errors, etc. Justin. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-02 18:57 ` Justin Piszcz 2009-01-02 20:46 ` Mike Myers @ 2009-01-02 20:56 ` Mike Myers 2009-01-02 21:37 ` Mike Myers 2009-01-03 4:19 ` Mike Myers 3 siblings, 0 replies; 46+ messages in thread From: Mike Myers @ 2009-01-02 20:56 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid, john lists BTW, several of the disks that md didn't want to assemble have perfectly valid uuid info on them. if I try and manually specify the devices in the --assemble line, and add the ones that are missing (but have a valid uuid for the array), even with the force option, md refuses to assemble them. The disks can't be all bad can they? That's 4 drives out of 14 that would have all had to go bad at once. thx Mike ----- Original Message ---- From: Justin Piszcz <jpiszcz@lucidpixels.com> To: Mike Myers <mikesm559@yahoo.com> Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> Sent: Friday, January 2, 2009 10:57:13 AM Subject: Re: Need urgent help in fixing raid5 array On Fri, 2 Jan 2009, Mike Myers wrote: > Well, I can read from sdg1 just fine. It seems to work ok, at least for a few GB of data. I'll try this on some of the other disks, but it is possible for to pull the disks out of the backplane and run the SFF-8087 fanout cables direct to each drive and bypass the backplane completely. It certainly would be easy to do this for the at least the sdo1 drive and see if I can get better results going direct to the disk. I have moved the disks around the backplane a bit to deal with the issues of the controller failure, so I am pretty sure it's not just one bad slot or the like. > > So you've seen a backplane fail in away that the disks come up fine at boot but have corrupted data transfers across them? I wonder about the sata cables in that case as well. I could hook up a pair of PMP's to my SI3132's and bypass the 8077 cables as well. 1. Try by-passing the backplane. 2. Bad cables will usually cause smart identifier UDMA_CRC_Error_Count to increase quite high, if it is 0 or close to it, the cable is unlikely the issue. 3. I have seem all kinds of weirdness with bad backplanes, drives dropping out of the array, drives producing I/O errors, etc. Justin. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-02 18:57 ` Justin Piszcz 2009-01-02 20:46 ` Mike Myers 2009-01-02 20:56 ` Mike Myers @ 2009-01-02 21:37 ` Mike Myers 2009-01-03 4:19 ` Mike Myers 3 siblings, 0 replies; 46+ messages in thread From: Mike Myers @ 2009-01-02 21:37 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid, john lists BTW, here is the smart error listing for one of the devices that md seems to refuse to add: smartctl -l error /dev/sdb1 smartctl 5.39 2008-05-08 21:56 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART Error Log Version: 1 ATA Error Count: 1 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 1 occurred at disk power-on lifetime: 6388 hours (266 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 a0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 a0 08 6d+23:19:44.200 IDENTIFY DEVICE 25 00 01 01 00 00 00 04 6d+23:19:44.000 READ DMA EXT 25 00 80 be 1b ba ef ff 6d+23:19:42.500 READ DMA EXT 25 00 c0 7f 1b ba e0 08 6d+23:19:42.500 READ DMA EXT 25 00 40 3f 1b ba e0 08 6d+23:19:30.300 READ DMA EXT It looks like a good disk. thx mike ----- Original Message ---- From: Justin Piszcz <jpiszcz@lucidpixels.com> To: Mike Myers <mikesm559@yahoo.com> Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> Sent: Friday, January 2, 2009 10:57:13 AM Subject: Re: Need urgent help in fixing raid5 array On Fri, 2 Jan 2009, Mike Myers wrote: > Well, I can read from sdg1 just fine. It seems to work ok, at least for a few GB of data. I'll try this on some of the other disks, but it is possible for to pull the disks out of the backplane and run the SFF-8087 fanout cables direct to each drive and bypass the backplane completely. It certainly would be easy to do this for the at least the sdo1 drive and see if I can get better results going direct to the disk. I have moved the disks around the backplane a bit to deal with the issues of the controller failure, so I am pretty sure it's not just one bad slot or the like. > > So you've seen a backplane fail in away that the disks come up fine at boot but have corrupted data transfers across them? I wonder about the sata cables in that case as well. I could hook up a pair of PMP's to my SI3132's and bypass the 8077 cables as well. 1. Try by-passing the backplane. 2. Bad cables will usually cause smart identifier UDMA_CRC_Error_Count to increase quite high, if it is 0 or close to it, the cable is unlikely the issue. 3. I have seem all kinds of weirdness with bad backplanes, drives dropping out of the array, drives producing I/O errors, etc. Justin. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-02 18:57 ` Justin Piszcz ` (2 preceding siblings ...) 2009-01-02 21:37 ` Mike Myers @ 2009-01-03 4:19 ` Mike Myers 2009-01-03 4:43 ` Guy Watkins 3 siblings, 1 reply; 46+ messages in thread From: Mike Myers @ 2009-01-03 4:19 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid, john lists Ok, good news and bad news. I finally got all the disks connected and bypassed the backplane. Md2 starts with 6 members in a degraded mode. Md1 is still having the same problem. In doing an examine on each member disk, I discovered that 8 disks had the superblock referencing md2's UUID. The other thing is that only 6 had the UUID of md1, which is suppposed to have 7 members. One of the two (sdf1) that has the superblock of md2 (but not active in the array) is also an Hitachi, which it shouldn't be (md2 is a seagate 7200.11 array). This appears to be the missing md1 disk. I don't understand how it got the other raid array's info, but things are weird here. That was the good news. The bad news is that when I tried to assemble md1 with all the md1 members plus sdf1 (the disk that thinks its part of md2), I mistakenly used it as the target for for mdadm assemble command. Ugh. So I typed: mdadm /dev/sdf1 --assemble /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdi1 /dev/sdj1 --force So now sdf1 instead of having the wrong superblock has no super block. Am I completely hosed at this point? I probably needed to figure out a way to get this disk a new superblock anyway, but but I suspect things are even harder to fix at this point. Any ideas as to how to fix this? Is there another superblock somewhere else on the disk that I can recover the proper info from? Thanks, mike ----- Original Message ---- From: Justin Piszcz <jpiszcz@lucidpixels.com> To: Mike Myers <mikesm559@yahoo.com> Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> Sent: Friday, January 2, 2009 10:57:13 AM Subject: Re: Need urgent help in fixing raid5 array On Fri, 2 Jan 2009, Mike Myers wrote: > Well, I can read from sdg1 just fine. It seems to work ok, at least for a few GB of data. I'll try this on some of the other disks, but it is possible for to pull the disks out of the backplane and run the SFF-8087 fanout cables direct to each drive and bypass the backplane completely. It certainly would be easy to do this for the at least the sdo1 drive and see if I can get better results going direct to the disk. I have moved the disks around the backplane a bit to deal with the issues of the controller failure, so I am pretty sure it's not just one bad slot or the like. > > So you've seen a backplane fail in away that the disks come up fine at boot but have corrupted data transfers across them? I wonder about the sata cables in that case as well. I could hook up a pair of PMP's to my SI3132's and bypass the 8077 cables as well. 1. Try by-passing the backplane. 2. Bad cables will usually cause smart identifier UDMA_CRC_Error_Count to increase quite high, if it is 0 or close to it, the cable is unlikely the issue. 3. I have seem all kinds of weirdness with bad backplanes, drives dropping out of the array, drives producing I/O errors, etc. Justin. ^ permalink raw reply [flat|nested] 46+ messages in thread
* RE: Need urgent help in fixing raid5 array 2009-01-03 4:19 ` Mike Myers @ 2009-01-03 4:43 ` Guy Watkins 2009-01-03 5:02 ` Mike Myers 0 siblings, 1 reply; 46+ messages in thread From: Guy Watkins @ 2009-01-03 4:43 UTC (permalink / raw) To: 'Mike Myers', 'Justin Piszcz' Cc: linux-raid, 'john lists' } -----Original Message----- } From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- } owner@vger.kernel.org] On Behalf Of Mike Myers } Sent: Friday, January 02, 2009 11:20 PM } To: Justin Piszcz } Cc: linux-raid@vger.kernel.org; john lists } Subject: Re: Need urgent help in fixing raid5 array } } Ok, good news and bad news. I finally got all the disks connected and } bypassed the backplane. Md2 starts with 6 members in a degraded mode. } Md1 is still having the same problem. In doing an examine on each member } disk, I discovered that 8 disks had the superblock referencing md2's UUID. } The other thing is that only 6 had the UUID of md1, which is suppposed to } have 7 members. One of the two (sdf1) that has the superblock of md2 (but } not active in the array) is also an Hitachi, which it shouldn't be (md2 is } a seagate 7200.11 array). This appears to be the missing md1 disk. I } don't understand how it got the other raid array's info, but things are } weird here. } } That was the good news. The bad news is that when I tried to assemble md1 } with all the md1 members plus sdf1 (the disk that thinks its part of md2), } I mistakenly used it as the target for for mdadm assemble command. Ugh. } } So I typed: mdadm /dev/sdf1 --assemble /dev/sdb1 /dev/sdc1 /dev/sdd1 } /dev/sde1 /dev/sdf1 /dev/sdi1 /dev/sdj1 --force } } So now sdf1 instead of having the wrong superblock has no super block. Am } I completely hosed at this point? I probably needed to figure out a way } to get this disk a new superblock anyway, but but I suspect things are } even harder to fix at this point. } } Any ideas as to how to fix this? Is there another superblock somewhere } else on the disk that I can recover the proper info from? } } Thanks, } mike I don't consider myself an expert, however... I think you should only assemble it with 6 of 7 disks. Leave out the one that you think has the most wrong data. If this works, the array will not try to sync anything. So, no data damaged. Then test the data. Once you are really sure the data is as good as it can be, then add the missing disk, it will resync at that time. However, 1 bad block on any of the 6 disks will cause a failure. Then, switch to RAID6 ASAP! :) Guy } } } } } ----- Original Message ---- } From: Justin Piszcz <jpiszcz@lucidpixels.com> } To: Mike Myers <mikesm559@yahoo.com> } Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> } Sent: Friday, January 2, 2009 10:57:13 AM } Subject: Re: Need urgent help in fixing raid5 array } } } } On Fri, 2 Jan 2009, Mike Myers wrote: } } > Well, I can read from sdg1 just fine. It seems to work ok, at least for } a few GB of data. I'll try this on some of the other disks, but it is } possible for to pull the disks out of the backplane and run the SFF-8087 } fanout cables direct to each drive and bypass the backplane completely. } It certainly would be easy to do this for the at least the sdo1 drive and } see if I can get better results going direct to the disk. I have moved } the disks around the backplane a bit to deal with the issues of the } controller failure, so I am pretty sure it's not just one bad slot or the } like. } > } > So you've seen a backplane fail in away that the disks come up fine at } boot but have corrupted data transfers across them? I wonder about the } sata cables in that case as well. I could hook up a pair of PMP's to my } SI3132's and bypass the 8077 cables as well. } } 1. Try by-passing the backplane. } 2. Bad cables will usually cause smart identifier UDMA_CRC_Error_Count to } increase quite high, if it is 0 or close to it, the cable is unlikely } the } issue. } 3. I have seem all kinds of weirdness with bad backplanes, drives dropping } out } of the array, drives producing I/O errors, etc. } } Justin. } } } } -- } To unsubscribe from this list: send the line "unsubscribe linux-raid" in } the body of a message to majordomo@vger.kernel.org } More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-03 4:43 ` Guy Watkins @ 2009-01-03 5:02 ` Mike Myers 2009-01-03 12:46 ` John Robinson 0 siblings, 1 reply; 46+ messages in thread From: Mike Myers @ 2009-01-03 5:02 UTC (permalink / raw) To: Guy Watkins, Justin Piszcz; +Cc: linux-raid, john lists I have tried that. It still complains about only having 4 disks to start the array (if don't tell it to use sdf1). I have been unable to explain my md refuses to use some of the members even though they have good superblock info on them even with the force command. There are two members of md1 that are online and seem to have proper superblock info, but md doesn't assemble md1 with them. Is there a place (besides the code) where md's specifics about how it assembles members is documented? Thx mike ----- Original Message ---- From: Guy Watkins <linux-raid@watkins-home.com> To: Mike Myers <mikesm559@yahoo.com>; Justin Piszcz <jpiszcz@lucidpixels.com> Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> Sent: Friday, January 2, 2009 8:43:53 PM Subject: RE: Need urgent help in fixing raid5 array } -----Original Message----- } From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- } owner@vger.kernel.org] On Behalf Of Mike Myers } Sent: Friday, January 02, 2009 11:20 PM } To: Justin Piszcz } Cc: linux-raid@vger.kernel.org; john lists } Subject: Re: Need urgent help in fixing raid5 array } } Ok, good news and bad news. I finally got all the disks connected and } bypassed the backplane. Md2 starts with 6 members in a degraded mode. } Md1 is still having the same problem. In doing an examine on each member } disk, I discovered that 8 disks had the superblock referencing md2's UUID. } The other thing is that only 6 had the UUID of md1, which is suppposed to } have 7 members. One of the two (sdf1) that has the superblock of md2 (but } not active in the array) is also an Hitachi, which it shouldn't be (md2 is } a seagate 7200.11 array). This appears to be the missing md1 disk. I } don't understand how it got the other raid array's info, but things are } weird here. } } That was the good news. The bad news is that when I tried to assemble md1 } with all the md1 members plus sdf1 (the disk that thinks its part of md2), } I mistakenly used it as the target for for mdadm assemble command. Ugh. } } So I typed: mdadm /dev/sdf1 --assemble /dev/sdb1 /dev/sdc1 /dev/sdd1 } /dev/sde1 /dev/sdf1 /dev/sdi1 /dev/sdj1 --force } } So now sdf1 instead of having the wrong superblock has no super block. Am } I completely hosed at this point? I probably needed to figure out a way } to get this disk a new superblock anyway, but but I suspect things are } even harder to fix at this point. } } Any ideas as to how to fix this? Is there another superblock somewhere } else on the disk that I can recover the proper info from? } } Thanks, } mike I don't consider myself an expert, however... I think you should only assemble it with 6 of 7 disks. Leave out the one that you think has the most wrong data. If this works, the array will not try to sync anything. So, no data damaged. Then test the data. Once you are really sure the data is as good as it can be, then add the missing disk, it will resync at that time. However, 1 bad block on any of the 6 disks will cause a failure. Then, switch to RAID6 ASAP! :) Guy } } } } } ----- Original Message ---- } From: Justin Piszcz <jpiszcz@lucidpixels.com> } To: Mike Myers <mikesm559@yahoo.com> } Cc: linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> } Sent: Friday, January 2, 2009 10:57:13 AM } Subject: Re: Need urgent help in fixing raid5 array } } } } On Fri, 2 Jan 2009, Mike Myers wrote: } } > Well, I can read from sdg1 just fine. It seems to work ok, at least for } a few GB of data. I'll try this on some of the other disks, but it is } possible for to pull the disks out of the backplane and run the SFF-8087 } fanout cables direct to each drive and bypass the backplane completely. } It certainly would be easy to do this for the at least the sdo1 drive and } see if I can get better results going direct to the disk. I have moved } the disks around the backplane a bit to deal with the issues of the } controller failure, so I am pretty sure it's not just one bad slot or the } like. } > } > So you've seen a backplane fail in away that the disks come up fine at } boot but have corrupted data transfers across them? I wonder about the } sata cables in that case as well. I could hook up a pair of PMP's to my } SI3132's and bypass the 8077 cables as well. } } 1. Try by-passing the backplane. } 2. Bad cables will usually cause smart identifier UDMA_CRC_Error_Count to } increase quite high, if it is 0 or close to it, the cable is unlikely } the } issue. } 3. I have seem all kinds of weirdness with bad backplanes, drives dropping } out } of the array, drives producing I/O errors, etc. } } Justin. } } } } -- } To unsubscribe from this list: send the line "unsubscribe linux-raid" in } the body of a message to majordomo@vger.kernel.org } More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-03 5:02 ` Mike Myers @ 2009-01-03 12:46 ` John Robinson 2009-01-03 15:49 ` Mike Myers 0 siblings, 1 reply; 46+ messages in thread From: John Robinson @ 2009-01-03 12:46 UTC (permalink / raw) To: Mike Myers; +Cc: linux-raid On 03/01/2009 05:02, Mike Myers wrote: > I have tried that. It still complains about only having 4 disks to start the array (if don't tell it to use sdf1). > > I have been unable to explain my md refuses to use some of the members even though they have good superblock info on them even with the force command. There are two members of md1 that are online and seem to have proper superblock info, but md doesn't assemble md1 with them. > > Is there a place (besides the code) where md's specifics about how it assembles members is documented? I'm absolutely no expert here, but I vaguely recall one of the developers recently noting that there was a minor bug in `mdadm --assemble --force` whereby if you didn't mention the broken member(s) first on the command line, and the early member(s) were good, the later member(s) didn't get forced. So in your case, you might try mentioning your array members in a different order, as long as you don't blame me when it eats your cat, or whatever. Aha, here it is: http://marc.info/?l=linux-raid&m=122938233431234&w=2 Not quite what I said, but not a zillion miles off :-) Good luck. Cheers, John. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-03 12:46 ` John Robinson @ 2009-01-03 15:49 ` Mike Myers 2009-01-03 16:14 ` John Robinson 0 siblings, 1 reply; 46+ messages in thread From: Mike Myers @ 2009-01-03 15:49 UTC (permalink / raw) To: John Robinson; +Cc: linux-raid Ugh. This would explain a lot! i'll try this out and see if it can help get md1 back online. Is the only way to regenerate a superblock on a member doing a a create with the assume-clean option? thx mike ----- Original Message ---- From: John Robinson <john.robinson@anonymous.org.uk> To: Mike Myers <mikesm559@yahoo.com> Cc: linux-raid@vger.kernel.org Sent: Saturday, January 3, 2009 4:46:15 AM Subject: Re: Need urgent help in fixing raid5 array On 03/01/2009 05:02, Mike Myers wrote: > I have tried that. It still complains about only having 4 disks to start the array (if don't tell it to use sdf1). > > I have been unable to explain my md refuses to use some of the members even though they have good superblock info on them even with the force command. There are two members of md1 that are online and seem to have proper superblock info, but md doesn't assemble md1 with them. > Is there a place (besides the code) where md's specifics about how it assembles members is documented? I'm absolutely no expert here, but I vaguely recall one of the developers recently noting that there was a minor bug in `mdadm --assemble --force` whereby if you didn't mention the broken member(s) first on the command line, and the early member(s) were good, the later member(s) didn't get forced. So in your case, you might try mentioning your array members in a different order, as long as you don't blame me when it eats your cat, or whatever. Aha, here it is: http://marc.info/?l=linux-raid&m=122938233431234&w=2 Not quite what I said, but not a zillion miles off :-) Good luck. Cheers, John. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-03 15:49 ` Mike Myers @ 2009-01-03 16:14 ` John Robinson 2009-01-03 16:47 ` Mike Myers 2009-01-03 19:03 ` Mike Myers 0 siblings, 2 replies; 46+ messages in thread From: John Robinson @ 2009-01-03 16:14 UTC (permalink / raw) To: Mike Myers; +Cc: linux-raid On 03/01/2009 15:49, Mike Myers wrote: > Ugh. This would explain a lot! i'll try this out and see if it can help get md1 back online. Good luck; the rest of that thread's probably worth a read before starting in too, just to see whether you need to mention your two dirty members first or last. As Guy suggested earlier in this thread, you might try doing your reassemble while missing out the one with the apparently completely hosed superblock, to at least get the thing up in degraded mode, then test fsck it (e.g. `e2fsck -n`) and mount it read-only to see if you've still got any data. And perhaps take a backup then. > Is the only way to regenerate a superblock on a member doing a a create with the assume-clean option? I'm not sure, but I expect so. Seconding what Justin said much earlier in this thread, personally I'd wait until one of the gurus arrives, in their shining armour and on their white charger, before trying this. Cheers, John. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-03 16:14 ` John Robinson @ 2009-01-03 16:47 ` Mike Myers 2009-01-03 19:03 ` Mike Myers 1 sibling, 0 replies; 46+ messages in thread From: Mike Myers @ 2009-01-03 16:47 UTC (permalink / raw) To: John Robinson; +Cc: linux-raid I tried doing the assemble without the one with the trashed superblock, but got the same result. It may be the order of devices is a problem. I'll play around with this knowing about the bug and see how I fare. I am running lvm and xfs on top of the raid set, so knowing if it's working is somewhat more complicated, but I will continue to hold off on the assume-clean option for awhile more. I have about 4 TB of storage on other servers that I can dump a lot of the data to for backup once I get this thing running again, then I'll reconfigure everything from scratch. Thanks so much for the help. thx mike ----- Original Message ---- From: John Robinson <john.robinson@anonymous.org.uk> To: Mike Myers <mikesm559@yahoo.com> Cc: linux-raid@vger.kernel.org Sent: Saturday, January 3, 2009 8:14:25 AM Subject: Re: Need urgent help in fixing raid5 array On 03/01/2009 15:49, Mike Myers wrote: > Ugh. This would explain a lot! i'll try this out and see if it can help get md1 back online. Good luck; the rest of that thread's probably worth a read before starting in too, just to see whether you need to mention your two dirty members first or last. As Guy suggested earlier in this thread, you might try doing your reassemble while missing out the one with the apparently completely hosed superblock, to at least get the thing up in degraded mode, then test fsck it (e.g. `e2fsck -n`) and mount it read-only to see if you've still got any data. And perhaps take a backup then. > Is the only way to regenerate a superblock on a member doing a a create with the assume-clean option? I'm not sure, but I expect so. Seconding what Justin said much earlier in this thread, personally I'd wait until one of the gurus arrives, in their shining armour and on their white charger, before trying this. Cheers, John. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-03 16:14 ` John Robinson 2009-01-03 16:47 ` Mike Myers @ 2009-01-03 19:03 ` Mike Myers 1 sibling, 0 replies; 46+ messages in thread From: Mike Myers @ 2009-01-03 19:03 UTC (permalink / raw) To: John Robinson; +Cc: linux-raid Ok, all the devices are marked clean, so that doesn't appear to be the problem. but thanks to your link I was reminded that the assemble command has a verbose option. It gives us a much better clue: # mdadm /dev/md1 --assemble /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdi1 /dev/sdj1 --force --verbose md: md1 stopped. md: unbind<sdi1> md: export_rdev(sdi1) md: unbind<sde1> md: export_rdev(sde1) md: unbind<sdd1> md: export_rdev(sdd1) md: unbind<sdc1> md: export_rdev(sdc1) md: unbind<sdf1> md: export_rdev(sdf1) mdadm: looking for devices for /dev/md1 mdadm: /dev/sdb1 is identified as a member of /dev/md1, slot 0 mdadm: /dev/sdc1 is identified as a member of /dev/md1, slot 4 mdadm: /dev/sdd1 is identified as a member of /dev/md1, slot 5 mdadm: /dev/sde1 is identified as a member of /dev/md1, slot 6 mdadm: /dev/sdi1 is identified as a member of /dev/md1, slot 0 mdadm: /dev/sdj1 is identified as a member of /dev/md1, slot 0 mdadm: no uptodate device for slot 1 of /dev/md1 mdadm: no uptodate device for slot 2 of /dev/md1 mdadm: no uptodate device for slot 3 of /dev/md1 md: bind<sdc1> mdadm: added /dev/sdc1 to /dev/md1 as 4 md: bind<sdd1> mdadm: added /dev/sdd1 to /dev/md1 as 5 md: bind<sdc1> mdadm: added /dev/sde1 to /dev/md1 as 6 md: bind<sdi1> mdadm: added /dev/sdi1 to /dev/md1 as 0 mdadm: /dev/md1 assembled from 4 drives - not enough to start the array. Any ideas as to what the issue here is? How did the slot info get corrupted? How can I tell which slot is being these drives belong to? I have backups of the system, so if this is mentioned in a log file, I can probably get it, but the device names are all different now because of the controller failure and the bypassing of the backplane. thx mike ----- Original Message ---- From: John Robinson <john.robinson@anonymous.org.uk> To: Mike Myers <mikesm559@yahoo.com> Cc: linux-raid@vger.kernel.org Sent: Saturday, January 3, 2009 8:14:25 AM Subject: Re: Need urgent help in fixing raid5 array On 03/01/2009 15:49, Mike Myers wrote: > Ugh. This would explain a lot! i'll try this out and see if it can help get md1 back online. Good luck; the rest of that thread's probably worth a read before starting in too, just to see whether you need to mention your two dirty members first or last. As Guy suggested earlier in this thread, you might try doing your reassemble while missing out the one with the apparently completely hosed superblock, to at least get the thing up in degraded mode, then test fsck it (e.g. `e2fsck -n`) and mount it read-only to see if you've still got any data. And perhaps take a backup then. > Is the only way to regenerate a superblock on a member doing a a create with the assume-clean option? I'm not sure, but I expect so. Seconding what Justin said much earlier in this thread, personally I'd wait until one of the gurus arrives, in their shining armour and on their white charger, before trying this. Cheers, John. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-02 6:19 ` Mike Myers 2009-01-02 12:10 ` Justin Piszcz @ 2009-01-05 22:11 ` Neil Brown 2009-01-05 22:22 ` Mike Myers 1 sibling, 1 reply; 46+ messages in thread From: Neil Brown @ 2009-01-05 22:11 UTC (permalink / raw) To: Mike Myers; +Cc: Justin Piszcz, linux-raid, john lists On Thursday January 1, mikesm559@yahoo.com wrote: > > (md1 is comprised of sdd1 sdf1 sdg1 sdh1 sdj1 sdk1 sdo1) > > (none):~> mdadm --examine /dev/sdd1 /dev/sdf1 /dev/sdg1 /dev/sdh1 .... > (none):~> mdadm --examine /dev/sdj1 /dev/sdk1 /dev/sdo1 .... So, of these 7 devices, 5 of them (d1 f1 h1 j1 k1) think that are good members of the array with current metadata, and the other two (g1 and o1) have old/wrong metadata. This means that you need to recreate the array. The questions: Which of g1 and o1 has more recent valid data, and which should be device '1' and which should be device '2'. Depending on the answer to these two questions, the command to you will be one for the following 4 possibilities. mdadm -C /dev/md1 -e1.0 -l5 -n7 -c128 -b internal /dev/sdh1 /dev/sdg1 missing /dev/sdk1 /dev/sdf1 /dev/sdd1 /dev/sdj1 mdadm -C /dev/md1 -e1.0 -l5 -n7 -c128 -b internal /dev/sdh1 /dev/sdo1 missing /dev/sdk1 /dev/sdf1 /dev/sdd1 /dev/sdj1 mdadm -C /dev/md1 -e1.0 -l5 -n7 -c128 -b internal /dev/sdh1 missing /dev/sdg1 /dev/sdk1 /dev/sdf1 /dev/sdd1 /dev/sdj1 mdadm -C /dev/md1 -e1.0 -l5 -n7 -c128 -b internal /dev/sdh1 missing /dev/sdo1 /dev/sdk1 /dev/sdf1 /dev/sdd1 /dev/sdj1 These commands will not change the data on the device, just the metadata. Once you have created the array, you try to validate the data (fsck or similar). If it looks bad, stop the array and try a different command. Note: the metadata on g1 and o1 is very strange. It looks like an old copy of the metadata from sdh1. so it could be that one of g1 or o1 is really the first drive in the array, and h1 one of the two 'missing' devices. So if none of the 4 commands I gave work, try other permutations with o1 or g1 first, and h1 second or third. NeilBrown ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-05 22:11 ` Neil Brown @ 2009-01-05 22:22 ` Mike Myers 2009-01-05 22:53 ` NeilBrown 0 siblings, 1 reply; 46+ messages in thread From: Mike Myers @ 2009-01-05 22:22 UTC (permalink / raw) To: Neil Brown; +Cc: Justin Piszcz, linux-raid, john lists Thanks! I see what you are doing here. So since none of these commands actually change the underlying data, if I get the order right. the array will come up and the LVM superblock will be visible, and then I can try and bring the filesystem online? If I get the order wrong, I can just try it again with another combination. Do I have that right? I should probably print out all the existing metadata and save it since the data will be wiped out by the create. How could the drives get the bad metadata on them> I've played with software raid for about 4 years and have seen seen something this strange. thx mike ----- Original Message ---- From: Neil Brown <neilb@suse.de> To: Mike Myers <mikesm559@yahoo.com> Cc: Justin Piszcz <jpiszcz@lucidpixels.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> Sent: Monday, January 5, 2009 2:11:01 PM Subject: Re: Need urgent help in fixing raid5 array On Thursday January 1, mikesm559@yahoo.com wrote: > > (md1 is comprised of sdd1 sdf1 sdg1 sdh1 sdj1 sdk1 sdo1) > > (none):~> mdadm --examine /dev/sdd1 /dev/sdf1 /dev/sdg1 /dev/sdh1 .... > (none):~> mdadm --examine /dev/sdj1 /dev/sdk1 /dev/sdo1 .... So, of these 7 devices, 5 of them (d1 f1 h1 j1 k1) think that are good members of the array with current metadata, and the other two (g1 and o1) have old/wrong metadata. This means that you need to recreate the array. The questions: Which of g1 and o1 has more recent valid data, and which should be device '1' and which should be device '2'. Depending on the answer to these two questions, the command to you will be one for the following 4 possibilities. mdadm -C /dev/md1 -e1.0 -l5 -n7 -c128 -b internal /dev/sdh1 /dev/sdg1 missing /dev/sdk1 /dev/sdf1 /dev/sdd1 /dev/sdj1 mdadm -C /dev/md1 -e1.0 -l5 -n7 -c128 -b internal /dev/sdh1 /dev/sdo1 missing /dev/sdk1 /dev/sdf1 /dev/sdd1 /dev/sdj1 mdadm -C /dev/md1 -e1.0 -l5 -n7 -c128 -b internal /dev/sdh1 missing /dev/sdg1 /dev/sdk1 /dev/sdf1 /dev/sdd1 /dev/sdj1 mdadm -C /dev/md1 -e1.0 -l5 -n7 -c128 -b internal /dev/sdh1 missing /dev/sdo1 /dev/sdk1 /dev/sdf1 /dev/sdd1 /dev/sdj1 These commands will not change the data on the device, just the metadata. Once you have created the array, you try to validate the data (fsck or similar). If it looks bad, stop the array and try a different command. Note: the metadata on g1 and o1 is very strange. It looks like an old copy of the metadata from sdh1. so it could be that one of g1 or o1 is really the first drive in the array, and h1 one of the two 'missing' devices. So if none of the 4 commands I gave work, try other permutations with o1 or g1 first, and h1 second or third. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-05 22:22 ` Mike Myers @ 2009-01-05 22:53 ` NeilBrown 2009-01-06 2:46 ` Mike Myers 0 siblings, 1 reply; 46+ messages in thread From: NeilBrown @ 2009-01-05 22:53 UTC (permalink / raw) To: Mike Myers; +Cc: Justin Piszcz, linux-raid, john lists On Tue, January 6, 2009 9:22 am, Mike Myers wrote: > Thanks! I see what you are doing here. So since none of these commands > actually change the underlying data, if I get the order right. the array > will come up and the LVM superblock will be visible, and then I can try > and bring the filesystem online? If I get the order wrong, I can just try > it again with another combination. Do I have that right? Exactly, yes. > > I should probably print out all the existing metadata and save it since > the data will be wiped out by the create. Certainly a good idea. > > How could the drives get the bad metadata on them> I've played with > software raid for about 4 years and have seen seen something this strange. I really cannot think how they could get the particular bad metadata that they did. "mdadm --add" will change the metadata, but only to make the device appear to be a spare, which isn't the case here. "mdadm --assemble --force" can re-write the metadata, but again I cannot imagine it making the particular change that has been made. Maybe there is a bug in there somewhere. NeilBrown ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-05 22:53 ` NeilBrown @ 2009-01-06 2:46 ` Mike Myers 2009-01-06 4:00 ` NeilBrown 0 siblings, 1 reply; 46+ messages in thread From: Mike Myers @ 2009-01-06 2:46 UTC (permalink / raw) To: NeilBrown; +Cc: Justin Piszcz, linux-raid, john lists BTW, don't I need to use the --assume-clean option in the create operation to have this work right? Thanks, Mike ----- Original Message ---- From: NeilBrown <neilb@suse.de> To: Mike Myers <mikesm559@yahoo.com> Cc: Justin Piszcz <jpiszcz@lucidpixels.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> Sent: Monday, January 5, 2009 2:53:43 PM Subject: Re: Need urgent help in fixing raid5 array On Tue, January 6, 2009 9:22 am, Mike Myers wrote: > Thanks! I see what you are doing here. So since none of these commands > actually change the underlying data, if I get the order right. the array > will come up and the LVM superblock will be visible, and then I can try > and bring the filesystem online? If I get the order wrong, I can just try > it again with another combination. Do I have that right? Exactly, yes. > > I should probably print out all the existing metadata and save it since > the data will be wiped out by the create. Certainly a good idea. > > How could the drives get the bad metadata on them> I've played with > software raid for about 4 years and have seen seen something this strange. I really cannot think how they could get the particular bad metadata that they did. "mdadm --add" will change the metadata, but only to make the device appear to be a spare, which isn't the case here. "mdadm --assemble --force" can re-write the metadata, but again I cannot imagine it making the particular change that has been made. Maybe there is a bug in there somewhere. NeilBrown ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-06 2:46 ` Mike Myers @ 2009-01-06 4:00 ` NeilBrown 2009-01-06 5:55 ` Mike Myers 2009-01-06 6:24 ` Mike Myers 0 siblings, 2 replies; 46+ messages in thread From: NeilBrown @ 2009-01-06 4:00 UTC (permalink / raw) To: Mike Myers; +Cc: Justin Piszcz, linux-raid, john lists On Tue, January 6, 2009 1:46 pm, Mike Myers wrote: > BTW, don't I need to use the --assume-clean option in the create operation > to have this work right? No. When you create a degraded raid5, it is always assumed to be clean, because it doesn't make any sense for it to be dirty. However it wouldn't hurt to use --assume-clean, but it won't make any difference. NeilBrown ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-06 4:00 ` NeilBrown @ 2009-01-06 5:55 ` Mike Myers 2009-01-06 23:23 ` Neil Brown 2009-01-06 6:24 ` Mike Myers 1 sibling, 1 reply; 46+ messages in thread From: Mike Myers @ 2009-01-06 5:55 UTC (permalink / raw) To: NeilBrown; +Cc: Justin Piszcz, linux-raid, john lists Neil, the devices have moved around a bit since my post on the --examine results of each drive as an attempt to bypass a possibly bad sata backplane and known bad sata controller. So I have to recheck the array positions of the drives. A few questions about how md acts: 1) When I grow an array, do the existing members of the array maintain their slot positions? So if I had 4 drives sda1 sdb1 sdc1 and sdd1 as part of a RAID5 array, and then add sde1, would sde1 take slot 4 or the array, or do the array slots get reset during a reshape operation? The reason I ask is that from the smartctl -a data for each drive, I can get the total powered on hours for the drive. If the drive has a lot of hours on it, it will have been added earlier than another drive, and if the array positiosn are constant, then I can kind of reconstruct the array order on an array that's been built incrementally by look at drive power on times. 2) If a drive goes bad an is replaced by a spare, does the spare take the orginal array slot of the faulty drive? 3) It appears the slot number -1 is the member number? That is, if I do an examine on /dev/sdc1, it tells me it's got slot 5 of md1. But when I do an assemble operation with the --verbose flag, it says /dev/sdc1 "as added as 4". The reason I ask if that's true, what would a slot number 0 mean in terms of what --assemble is supposed to add it as? When I do the assemble, it's added as 0, which I don't understand if the slot number is supposed to be one higher? 4) /dev/sdf1 (the new device name) thinks it's part of md2 (when I do an examine), but can't be, because md2 is all seagate and already has 7 members in it (the right number of drives). So it must be part of md1, which is missing a member. When I first tried to reassemble md1, it said it only found 5 good drives, and couldn't start. now it says it only finds 4 good drives. So I assume sdf1 is one of the 5 good ones but got a weird superblock written to it. Other than the drive hours trick I thought of earlier, is there any way to determine what it's slot number should have been since I am missing slots 1, 2, and 3, and have 3 candidates for slot 0? Lasttly, it REALLY would make life a LOT easier if the devices wouldn't change evrytime they were plugged into a different controller slot, or that controller slots wouldn't change based on boot order etc... It is a pain in rear when you have a hardware outage or a disk that isn't detected properly on boot and then hot added to have it's /dev/sdx1 label change. I know it's not md's fault the way this works, but in a hot swap world it makes it very hard to document drive configurations and map devices under linux to physical drives. Thanks a lot Neil. Mike ----- Original Message ---- From: NeilBrown <neilb@suse.de> To: Mike Myers <mikesm559@yahoo.com> Cc: Justin Piszcz <jpiszcz@lucidpixels.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> Sent: Monday, January 5, 2009 8:00:43 PM Subject: Re: Need urgent help in fixing raid5 array On Tue, January 6, 2009 1:46 pm, Mike Myers wrote: > BTW, don't I need to use the --assume-clean option in the create operation > to have this work right? No. When you create a degraded raid5, it is always assumed to be clean, because it doesn't make any sense for it to be dirty. However it wouldn't hurt to use --assume-clean, but it won't make any difference. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-06 5:55 ` Mike Myers @ 2009-01-06 23:23 ` Neil Brown 0 siblings, 0 replies; 46+ messages in thread From: Neil Brown @ 2009-01-06 23:23 UTC (permalink / raw) To: Mike Myers; +Cc: Justin Piszcz, linux-raid, john lists On Monday January 5, mikesm559@yahoo.com wrote: > Neil, the devices have moved around a bit since my post on the > --examine results of each drive as an attempt to bypass a possibly bad > sata backplane and known bad sata controller. So I have to recheck > the array positions of the drives. A few questions about how md acts: > > 1) When I grow an array, do the existing members of the array > maintain their slot positions? So if I had 4 drives sda1 sdb1 > sdc1 and sdd1 as part of a RAID5 array, and then add sde1, would > sde1 take slot 4 or the array, or do the array slots get reset > during a reshape operation? Slots do not get reset. > > The reason I ask is that from the smartctl -a data for each drive, I > can get the total powered on hours for the drive. If the drive has > a lot of hours on it, it will have been added earlier than another > drive, and if the array positiosn are constant, then I can kind of > reconstruct the array order on an array that's been built > incrementally by look at drive power on times. > > 2) If a drive goes bad an is replaced by a spare, does the spare > take the orginal array slot of the faulty drive? With v1.x metadata you need to be careful of the difference between the 'slot' and the 'role' of each drive. The 'slot' is an arbitrary number that is assigned when the devices is added to the array, whether as a full member or as a spare. It doesn't change. The 'role' is how the drive is functioning in the array. When a device is added to an array, its role is set to 'spare'. When a drive fails and this spare is used to recover that drive, the 'role' is change to match the role of the original drive. But the slot stays the same. If you look at the "Array Slot" line in the "mdadm --examine" output you will see something like Array Slot : 8 (0, failed, failed, failed, 3, 4, failed, 5, 6) That means that this device occupies slot 8, and that for the whole array: The device in slot 0 is being used in role 0 (first live device in array) The devices in slots 1,2,3 have all failed The device in slot 4 has role 3 The device in slot 5 has role 4 The device in slot 6 has failed The devices in slots 7 and 8 have roles 5 and 6 So this device is in role 6. This can be confirmed by looking at the "Array State" line: Array State : u__uuuU 4 failed There is no device ('_') with role 1 or 2, This device ('U') is in the last role, role 6, and roles 0,3,4,5 are filled by other working devices in the array ('u'). > > 3) It appears the slot number -1 is the member number? That is, if I do an examine on /dev/sdc1, it tells me it's got slot 5 of md1. But when I do an assemble operation with the --verbose flag, it says /dev/sdc1 "as added as 4". The reason I ask if that's true, what would a slot number 0 mean in terms of what --assemble is supposed to add it as? When I do the assemble, it's added as 0, which I don't understand if the slot number is supposed to be one higher? There is no simple arithmetical relationship between slot number and member number (aka 'role'). They are assigned independently. > > 4) /dev/sdf1 (the new device name) thinks it's part of md2 (when I > do an examine), but can't be, because md2 is all seagate and > already has 7 members in it (the right number of drives). So it > must be part of md1, which is missing a member. When I first > tried to reassemble md1, it said it only found 5 good drives, > and couldn't start. now it says it only finds 4 good drives. > So I assume sdf1 is one of the 5 good ones but got a weird > superblock written to it. Other than the drive hours trick I > thought of earlier, is there any way to determine what it's slot > number should have been since I am missing slots 1, 2, and 3, > and have 3 candidates for slot 0? No. That information lives in the superblock and is displayed by --examine. If the superblock as been corrupted somehow, then that information is gone. > > Lasttly, it REALLY would make life a LOT easier if the devices > wouldn't change evrytime they were plugged into a different > controller slot, or that controller slots wouldn't change based on > boot order etc... It is a pain in rear when you have a hardware > outage or a disk that isn't detected properly on boot and then hot > added to have it's /dev/sdx1 label change. I know it's not md's > fault the way this works, but in a hot swap world it makes it very > hard to document drive configurations and map devices under linux to > physical drives. Yes, it can be a pain. The various link in e.g. /dev/disk/by-uuid are supposed to make that a bit more manageable. Whether they succeed is less clear. NeilBrown ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-06 4:00 ` NeilBrown 2009-01-06 5:55 ` Mike Myers @ 2009-01-06 6:24 ` Mike Myers 2009-01-06 23:31 ` Neil Brown 1 sibling, 1 reply; 46+ messages in thread From: Mike Myers @ 2009-01-06 6:24 UTC (permalink / raw) To: NeilBrown; +Cc: Justin Piszcz, linux-raid, john lists BTW, in the original email I sent that had the --examine info for each of these array members, three devices have the same device UUID and array slot, and two of them share an older event count, and one has a slightly newer event count. Which of these should be the real array slot 0? And I notice that one of the members in that email had a device UUID that I can't find anymore (I suspect it's the current sdf1 that thinks it's part of md2). In that email, it had array slot 4, which is one of the missing devices in the current familt (that I assume --assemble would add as "3"). It also has 9663 hours on it, which makes it part of the original set of 4 members for this raid5 array. The drive in slot 5 only has 7630 hours on it, so it should have been added later as part of a --grow opera tion. Does all that make sense? If so, then sdb1, (which says it's slot 0), sdi1 (at 9671 hours) and also thinks it's slot 0, sdj1 (at 9194 hours) which also says it's 0, and sdf1 (at 9663 hours) and used to apparently think it's slot 4 should be the original 4 drives of the array. How can I figure out which is the real slot 0, and who is slot 1 and 2 if sdi1 and sdj1 all have the same event count and array slot id (0) and same device UUID? This is way harder work than should be need to fix a problem. :-) But I am sure glad you gurus know how this stuff is supposed to work! Thx Mike ----- Original Message ---- From: NeilBrown <neilb@suse.de> To: Mike Myers <mikesm559@yahoo.com> Cc: Justin Piszcz <jpiszcz@lucidpixels.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> Sent: Monday, January 5, 2009 8:00:43 PM Subject: Re: Need urgent help in fixing raid5 array On Tue, January 6, 2009 1:46 pm, Mike Myers wrote: > BTW, don't I need to use the --assume-clean option in the create operation > to have this work right? No. When you create a degraded raid5, it is always assumed to be clean, because it doesn't make any sense for it to be dirty. However it wouldn't hurt to use --assume-clean, but it won't make any difference. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-06 6:24 ` Mike Myers @ 2009-01-06 23:31 ` Neil Brown 2009-01-06 23:54 ` Mike Myers 2009-01-13 5:38 ` Mike Myers 0 siblings, 2 replies; 46+ messages in thread From: Neil Brown @ 2009-01-06 23:31 UTC (permalink / raw) To: Mike Myers; +Cc: Justin Piszcz, linux-raid, john lists On Monday January 5, mikesm559@yahoo.com wrote: > BTW, in the original email I sent that had the --examine info for > each of these array members, three devices have the same device UUID > and array slot, and two of them share an older event count, and one > has a slightly newer event count. Which of these should be the real > array slot 0? And I notice that one of the members in that email > had a device UUID that I can't find anymore (I suspect it's the > current sdf1 that thinks it's part of md2). In that email, it had > array slot 4, which is one of the missing devices in the current > familt (that I assume --assemble would add as "3"). It also has > 9663 hours on it, which makes it part of the original set of 4 > members for this raid5 array. The drive in slot 5 only has 7630 > hours on it, so it should have been added later as part of a --grow > operation. > > Does all that make sense? If so, then sdb1, (which says it's slot > 0), sdi1 (at 9671 hours) and also thinks it's slot 0, sdj1 (at 9194 > hours) which also says it's 0, and sdf1 (at 9663 hours) and used to > apparently think it's slot 4 should be the original 4 drives of the > array. How can I figure out which is the real slot 0, and who is > slot 1 and 2 if sdi1 and sdj1 all have the same event count and > array slot id (0) and same device UUID? I had noticed the slot number was repeated. I hadn't noticed the device uuid was the same, though I guess that makes sense. Somehow the superblock for one device has been written to the other devices. It is not really possible to be sure which is the original without knowing how this happened, though I suspect that the one with the higher event count is more likely to be the original. Being a software guy, I tend to like to blame hardware, and I wonder if your problematic backplane managed to send write requests to the wrong drive somehow. If it did, then my expectation of your success just went down a few notches. :-( The only option for you to try to find out which device is which is to try various combinations and see what gives you access to the most consistent data. > > This is way harder work than should be need to fix a problem. :-) > But I am sure glad you gurus know how this stuff is supposed to > work! I'm happy to help as much as I can... I just hope your hardware hasn't done too much damage... NeilBrown ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-06 23:31 ` Neil Brown @ 2009-01-06 23:54 ` Mike Myers 2009-01-07 0:19 ` NeilBrown 2009-01-13 5:38 ` Mike Myers 1 sibling, 1 reply; 46+ messages in thread From: Mike Myers @ 2009-01-06 23:54 UTC (permalink / raw) To: Neil Brown; +Cc: Justin Piszcz, linux-raid, john lists Thanks for this and the previous explanation of how roles and slots work. I should be able to try and few combinations and see. At this point, I am not sure if the issue was caused by a bad backplane, or bad controller or bad disk. I can't tell for sure the backplane was bad, but I have a replacement sitting at my desk now, so I can go ahead and replace it just to be sure. The LSI MPT controller that failed was connected only to drives in md2, but that array is up and running fine and so I don't think it broke something when it failed. I had seen two smart alerts indicating a drive was failing, which is what caused me to try and replace the kicked drive with a new one and do a rebuild, which was the event that started this chain of events. I swapped the drive (part of md1), but the OS did not indicate the SATA port went down and did not init the new drive. When I rebooted to system (suspecting a temporary problem with the controller), everything went to hell. I suspect this initial failure was due to the backplane problem, but it may have had some corruption on the disks as well. I may have fat fingered something after the reboot that caused the problem with a bad superblock being written to the sdf1 as the device names may have changed on boot, and I didn't catch that (I may have done a hotswap a month ago when I had my first near death experience with md2) leading me to use the wrong device in an mdadm command, but it's hard to tell that now. With 15 hotswap drives in the system, I can tell you that device name changing is fraught with peril. I am unfamilar with the /dev/disk/by-uuid functionality. Is that documented in a howto somewhere? How is that supposed to work? thx mike ----- Original Message ---- From: Neil Brown <neilb@suse.de> To: Mike Myers <mikesm559@yahoo.com> Cc: Justin Piszcz <jpiszcz@lucidpixels.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> Sent: Tuesday, January 6, 2009 3:31:44 PM Subject: Re: Need urgent help in fixing raid5 array On Monday January 5, mikesm559@yahoo.com wrote: > BTW, in the original email I sent that had the --examine info for > each of these array members, three devices have the same device UUID > and array slot, and two of them share an older event count, and one > has a slightly newer event count. Which of these should be the real > array slot 0? And I notice that one of the members in that email > had a device UUID that I can't find anymore (I suspect it's the > current sdf1 that thinks it's part of md2). In that email, it had > array slot 4, which is one of the missing devices in the current > familt (that I assume --assemble would add as "3"). It also has > 9663 hours on it, which makes it part of the original set of 4 > members for this raid5 array. The drive in slot 5 only has 7630 > hours on it, so it should have been added later as part of a --grow > operation. > > Does all that make sense? If so, then sdb1, (which says it's slot > 0), sdi1 (at 9671 hours) and also thinks it's slot 0, sdj1 (at 9194 > hours) which also says it's 0, and sdf1 (at 9663 hours) and used to > apparently think it's slot 4 should be the original 4 drives of the > array. How can I figure out which is the real slot 0, and who is > slot 1 and 2 if sdi1 and sdj1 all have the same event count and > array slot id (0) and same device UUID? I had noticed the slot number was repeated. I hadn't noticed the device uuid was the same, though I guess that makes sense. Somehow the superblock for one device has been written to the other devices. It is not really possible to be sure which is the original without knowing how this happened, though I suspect that the one with the higher event count is more likely to be the original. Being a software guy, I tend to like to blame hardware, and I wonder if your problematic backplane managed to send write requests to the wrong drive somehow. If it did, then my expectation of your success just went down a few notches. :-( The only option for you to try to find out which device is which is to try various combinations and see what gives you access to the most consistent data. > > This is way harder work than should be need to fix a problem. :-) > But I am sure glad you gurus know how this stuff is supposed to > work! I'm happy to help as much as I can... I just hope your hardware hasn't done too much damage... NeilBrown ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-06 23:54 ` Mike Myers @ 2009-01-07 0:19 ` NeilBrown 0 siblings, 0 replies; 46+ messages in thread From: NeilBrown @ 2009-01-07 0:19 UTC (permalink / raw) To: Mike Myers; +Cc: Justin Piszcz, linux-raid, john lists On Wed, January 7, 2009 10:54 am, Mike Myers wrote: > > With 15 hotswap drives in the system, I can tell you that device name > changing is fraught with peril. I am unfamilar with the /dev/disk/by-uuid > functionality. Is that documented in a howto somewhere? How is that > supposed to work? > /dev/disk/by-id is probably the one you want. I don't know about documentation, but if you ls -l /dev/disk/by-id | grep -v part and have a look it should make sense. I get: lrwxrwxrwx 1 root root 9 2009-01-07 11:10 ata-ST3160812AS_4LS5YYHQ -> ../../sda lrwxrwxrwx 1 root root 9 2009-01-07 11:10 ata-ST3160812AS_4LS5YZDG -> ../../sdf lrwxrwxrwx 1 root root 9 2009-01-07 11:10 ata-ST3160812AS_4LS5YZJQ -> ../../sdc lrwxrwxrwx 1 root root 9 2009-01-07 11:10 ata-ST3160812AS_4LS5Z05D -> ../../sdd lrwxrwxrwx 1 root root 9 2009-01-07 11:10 ata-ST3160812AS_4LS5Z0B6 -> ../../sde lrwxrwxrwx 1 root root 9 2009-01-07 11:10 ata-ST3160812AS_4LS5Z0BH -> ../../sdb lrwxrwxrwx 1 root root 11 2009-01-07 11:10 md-uuid-8fd0af3f:4fbb94ea:12cc2127:f9855db5 -> ../../md_d6 lrwxrwxrwx 1 root root 9 2009-01-07 11:10 scsi-SATA_ST3160812AS_4LS5YYHQ -> ../../sda lrwxrwxrwx 1 root root 9 2009-01-07 11:10 scsi-SATA_ST3160812AS_4LS5YZDG -> ../../sdf lrwxrwxrwx 1 root root 9 2009-01-07 11:10 scsi-SATA_ST3160812AS_4LS5YZJQ -> ../../sdc lrwxrwxrwx 1 root root 9 2009-01-07 11:10 scsi-SATA_ST3160812AS_4LS5Z05D -> ../../sdd lrwxrwxrwx 1 root root 9 2009-01-07 11:10 scsi-SATA_ST3160812AS_4LS5Z0B6 -> ../../sde lrwxrwxrwx 1 root root 9 2009-01-07 11:10 scsi-SATA_ST3160812AS_4LS5Z0BH -> ../../sdb which shows you my 6 drives (each listed twice) with their model numbers and serial numbers. If I move the drive around, they will get different 'sda' names, but if you look at the lines in disk/by-id you can be sure that a given name will always refer to the same physical device. This is all managed by 'udev'. NeilBrown ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-06 23:31 ` Neil Brown 2009-01-06 23:54 ` Mike Myers @ 2009-01-13 5:38 ` Mike Myers 2009-01-13 5:57 ` Mike Myers 1 sibling, 1 reply; 46+ messages in thread From: Mike Myers @ 2009-01-13 5:38 UTC (permalink / raw) To: Neil Brown; +Cc: Justin Piszcz, linux-raid, john lists Ok, still more help needed. I finally got enough time scheduled tonight to be able to try recreating the raid array as per our conversation before. When doing the create as you outlined in your earlier post, mdadm -C says the first two disks are part of an existing raid array (I assume this is a normal "error" for this sort of situation and will be ignored in the end), but for each of the last 4 devices I speciffy on the command line, it says: mdadm: cannot open /dev/sdc1: Device or resource busy (and gives this error for each of the 4 devices). The devices are online though. I can do an mdadm --examine on them, dd from them, and do smartctl operations on them. Why would md think they were busy? Thx Mike ----- Original Message ---- From: Neil Brown <neilb@suse.de> To: Mike Myers <mikesm559@yahoo.com> Cc: Justin Piszcz <jpiszcz@lucidpixels.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> Sent: Tuesday, January 6, 2009 3:31:44 PM Subject: Re: Need urgent help in fixing raid5 array On Monday January 5, mikesm559@yahoo.com wrote: > BTW, in the original email I sent that had the --examine info for > each of these array members, three devices have the same device UUID > and array slot, and two of them share an older event count, and one > has a slightly newer event count. Which of these should be the real > array slot 0? And I notice that one of the members in that email > had a device UUID that I can't find anymore (I suspect it's the > current sdf1 that thinks it's part of md2). In that email, it had > array slot 4, which is one of the missing devices in the current > familt (that I assume --assemble would add as "3"). It also has > 9663 hours on it, which makes it part of the original set of 4 > members for this raid5 array. The drive in slot 5 only has 7630 > hours on it, so it should have been added later as part of a --grow > operation. > > Does all that make sense? If so, then sdb1, (which says it's slot > 0), sdi1 (at 9671 hours) and also thinks it's slot 0, sdj1 (at 9194 > hours) which also says it's 0, and sdf1 (at 9663 hours) and used to > apparently think it's slot 4 should be the original 4 drives of the > array. How can I figure out which is the real slot 0, and who is > slot 1 and 2 if sdi1 and sdj1 all have the same event count and > array slot id (0) and same device UUID? I had noticed the slot number was repeated. I hadn't noticed the device uuid was the same, though I guess that makes sense. Somehow the superblock for one device has been written to the other devices. It is not really possible to be sure which is the original without knowing how this happened, though I suspect that the one with the higher event count is more likely to be the original. Being a software guy, I tend to like to blame hardware, and I wonder if your problematic backplane managed to send write requests to the wrong drive somehow. If it did, then my expectation of your success just went down a few notches. :-( The only option for you to try to find out which device is which is to try various combinations and see what gives you access to the most consistent data. > > This is way harder work than should be need to fix a problem. :-) > But I am sure glad you gurus know how this stuff is supposed to > work! I'm happy to help as much as I can... I just hope your hardware hasn't done too much damage... NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array 2009-01-13 5:38 ` Mike Myers @ 2009-01-13 5:57 ` Mike Myers 0 siblings, 0 replies; 46+ messages in thread From: Mike Myers @ 2009-01-13 5:57 UTC (permalink / raw) To: Mike Myers, Neil Brown; +Cc: Justin Piszcz, linux-raid, john lists Figured this out. I had to stop md1 even though md couldn't assemble it. The "good devices" were still running. Will let you know how it goes. thx mike ----- Original Message ---- From: Mike Myers <mikesm559@yahoo.com> To: Neil Brown <neilb@suse.de> Cc: Justin Piszcz <jpiszcz@lucidpixels.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> Sent: Monday, January 12, 2009 9:38:21 PM Subject: Re: Need urgent help in fixing raid5 array Ok, still more help needed. I finally got enough time scheduled tonight to be able to try recreating the raid array as per our conversation before. When doing the create as you outlined in your earlier post, mdadm -C says the first two disks are part of an existing raid array (I assume this is a normal "error" for this sort of situation and will be ignored in the end), but for each of the last 4 devices I speciffy on the command line, it says: mdadm: cannot open /dev/sdc1: Device or resource busy (and gives this error for each of the 4 devices). The devices are online though. I can do an mdadm --examine on them, dd from them, and do smartctl operations on them. Why would md think they were busy? Thx Mike ----- Original Message ---- From: Neil Brown <neilb@suse.de> To: Mike Myers <mikesm559@yahoo.com> Cc: Justin Piszcz <jpiszcz@lucidpixels.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com> Sent: Tuesday, January 6, 2009 3:31:44 PM Subject: Re: Need urgent help in fixing raid5 array On Monday January 5, mikesm559@yahoo.com wrote: > BTW, in the original email I sent that had the --examine info for > each of these array members, three devices have the same device UUID > and array slot, and two of them share an older event count, and one > has a slightly newer event count. Which of these should be the real > array slot 0? And I notice that one of the members in that email > had a device UUID that I can't find anymore (I suspect it's the > current sdf1 that thinks it's part of md2). In that email, it had > array slot 4, which is one of the missing devices in the current > familt (that I assume --assemble would add as "3"). It also has > 9663 hours on it, which makes it part of the original set of 4 > members for this raid5 array. The drive in slot 5 only has 7630 > hours on it, so it should have been added later as part of a --grow > operation. > > Does all that make sense? If so, then sdb1, (which says it's slot > 0), sdi1 (at 9671 hours) and also thinks it's slot 0, sdj1 (at 9194 > hours) which also says it's 0, and sdf1 (at 9663 hours) and used to > apparently think it's slot 4 should be the original 4 drives of the > array. How can I figure out which is the real slot 0, and who is > slot 1 and 2 if sdi1 and sdj1 all have the same event count and > array slot id (0) and same device UUID? I had noticed the slot number was repeated. I hadn't noticed the device uuid was the same, though I guess that makes sense. Somehow the superblock for one device has been written to the other devices. It is not really possible to be sure which is the original without knowing how this happened, though I suspect that the one with the higher event count is more likely to be the original. Being a software guy, I tend to like to blame hardware, and I wonder if your problematic backplane managed to send write requests to the wrong drive somehow. If it did, then my expectation of your success just went down a few notches. :-( The only option for you to try to find out which device is which is to try various combinations and see what gives you access to the most consistent data. > > This is way harder work than should be need to fix a problem. :-) > But I am sure glad you gurus know how this stuff is supposed to > work! I'm happy to help as much as I can... I just hope your hardware hasn't done too much damage... NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2009-01-13 5:57 UTC | newest]
Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-05 17:03 Need urgent help in fixing raid5 array Mike Myers
2008-12-06 0:18 ` Mike Myers
2008-12-06 0:24 ` Justin Piszcz
2008-12-06 0:47 ` Mike Myers
2008-12-06 0:51 ` Justin Piszcz
2008-12-06 0:58 ` Mike Myers
2008-12-06 19:02 ` Mike Myers
2008-12-06 19:30 ` Mike Myers
2008-12-06 20:14 ` Mike Myers
2008-12-06 0:52 ` David Lethe
-- strict thread matches above, loose matches on Subject: below --
2009-01-01 15:31 Mike Myers
[not found] <451872.61166.qm@web30802.mail.mud.yahoo.com>
2009-01-01 15:40 ` Justin Piszcz
2009-01-01 17:51 ` Mike Myers
2009-01-01 18:29 ` Justin Piszcz
2009-01-01 18:40 ` Jon Nelson
2009-01-01 20:38 ` Mike Myers
2009-01-02 6:19 ` Mike Myers
2009-01-02 12:10 ` Justin Piszcz
2009-01-02 18:12 ` Mike Myers
2009-01-02 18:22 ` Justin Piszcz
2009-01-02 18:46 ` Mike Myers
2009-01-02 18:57 ` Justin Piszcz
2009-01-02 20:46 ` Mike Myers
2009-01-02 20:56 ` Mike Myers
2009-01-02 21:37 ` Mike Myers
2009-01-03 4:19 ` Mike Myers
2009-01-03 4:43 ` Guy Watkins
2009-01-03 5:02 ` Mike Myers
2009-01-03 12:46 ` John Robinson
2009-01-03 15:49 ` Mike Myers
2009-01-03 16:14 ` John Robinson
2009-01-03 16:47 ` Mike Myers
2009-01-03 19:03 ` Mike Myers
2009-01-05 22:11 ` Neil Brown
2009-01-05 22:22 ` Mike Myers
2009-01-05 22:53 ` NeilBrown
2009-01-06 2:46 ` Mike Myers
2009-01-06 4:00 ` NeilBrown
2009-01-06 5:55 ` Mike Myers
2009-01-06 23:23 ` Neil Brown
2009-01-06 6:24 ` Mike Myers
2009-01-06 23:31 ` Neil Brown
2009-01-06 23:54 ` Mike Myers
2009-01-07 0:19 ` NeilBrown
2009-01-13 5:38 ` Mike Myers
2009-01-13 5:57 ` Mike Myers
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).