linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Need urgent help in fixing raid5 array
@ 2008-12-05 17:03 Mike Myers
  2008-12-06  0:18 ` Mike Myers
  0 siblings, 1 reply; 46+ messages in thread
From: Mike Myers @ 2008-12-05 17:03 UTC (permalink / raw)
  To: linux-raid

I have a problem with repairing a raid5 array I really need some help with.  I must be missing something here.

I have 2 raid5 arrays combined with LVM into a common logical volume and then running XFS on top of that.  Both arrays have 7 1 TB disks in them.  I moved a controller card around so that I could install a new Intel GB ethernet card in one of the PCI-E slots.  That went fine except one of the SATA cables got knocked loose so one of the disks in /dev/md2 wen't offline.  Linux booted fine, started the md2 with 6 elements in it and everything was fine with md2 in a degraded state.  I fixed the cable problem and hot added that drive to the array, but since it was now out of sync, md began a rebuild.  No problem.

Around 60% through the resync, smartd started reporting problems with one of the other drives in the array.  Then that drive ejected from the degraded array, caused the raid to stop and the LVm volume to go offline.  Ugh...

Ok, so it looks from the smart data that that disk had been having a lot of problems and was failing.  As it happens, I had a new 1 TB disk arrive the same day, and I pressed it to service here.  I used sfdisk -d olddisk | sfdisk newdisk to copy the partition table from the old drive to the new one, and then used ddrescue to copy the data from the old partition (/dev/sdo1) to the new one (/dev/sdp1). That worked pretty well, just 12kB couldn't be recovered.

So I remove the old disk,  re-add the new disk, and attempt to start the array with new (cloned) 1 Tb disk in the old disks stead.  Even though the UUID's, magic numbers and events fields are all the same, md thinks the cloned disk is a spare, and doesn't start the array.  What am I missing here?  Why doesn't it view it as the old disk as a member and just start it?

thx
mike


      

^ permalink raw reply	[flat|nested] 46+ messages in thread
* Re: Need urgent help in fixing raid5 array
@ 2009-01-01 15:31 Mike Myers
  0 siblings, 0 replies; 46+ messages in thread
From: Mike Myers @ 2009-01-01 15:31 UTC (permalink / raw)
  To: linux-raid

Well, thanks for all your help last month.  As i posted, things came
back up and I survived the failure.  Now, I have yet another problem.
:(  After 5 years of running a linux server as a dedicated NAS, I am
hitting some very weird problems.  This server started as an single
processor AMD system with 4 320GB drives, and has been upgraded
multiple times so that it is now a quad core Intel rackmounted 4U
system with 14 1 TB drives and I have never lost data in any of the
upgrades of CPU, motherboard and disk controller hardware and disk
drives.  Now after last month's near death experience I am faced with
another serious problem in less than a month.  Any help you guys could
give me would be most appreciated.  This is a sucky way to start the
new year.

The array I had problems with last month (md2 comprised of 7 1 TB drives in a RAID5 config) is running just fine. 
md1, which is built of 7 1 TB hitachi 7K1000 drives is now having
problems.  We returned from a 10 day family visit with everything
running just fine.  There ws a brief power outage today, abt 3 mins,
but I can't see how that could be related as the server is on a high
quality rackmount 3U APC UPS that handled the outage just fine.  I was
working on the system getting X to work again after a nvidia driver
update, and when that was working fine, checked the disks to discover
that md1 was in a degraded state, with /dev/sdl1 kicked out of the
array (removed).  I tried to do a dd from the drive to verify it's
location in the rack, but I got an i/o error.  This was most odd, and
so went to the rack and pulled the disk and reinserted it.  No system
log entries recorded the device being pulled or re-installed.  So I am
thinking that a cable somehow
has come loose.  I power the system
down, pull it out of the rack, look at the cable that goes to the
drive, everything looks fine.  

So I reboot the system, and now the array won't come online because now in addition to the drive that
shows as (removed), one of the other drives shows as a faulty spare. 
Well, learning from the last go around, I reassemble the array with the
--force option, and the array comes back up.  But LVM won't come back
up because it sees the physical volume that maps to md1 as missing. 
Now I am very concerned.  After trying a bunch of things, I do a
pvcreate with the missing UUID on md1, restart the vg and the logical
volume comes back up.  I was thinking I may have told lvm to use an
array of bad data, but to my surprise, I mounted the filesystem and
everything looked intact!  Ok, sometimes you win.  So I do one more
reboot to get the system back up in multiuser so I can back up some of
the more important media stored on the volume (it's got about 10 Tb
used, but most of that is PVR recordings, but there is a lot of ripped
music and DVD's that I really don't
want to rerip) on a another server that has some space on it while I figure out what has been happening.

The reboot again fails because of a problem with md1.  This time, another
one of the drives shows as removed (/dev/sdm1), and I can't reassemble
the array with a --force option.  It is acting like /dev/sdl1 (the
other removed unit), and even though I can read from the drives fine,
their UUID is fine, etc..., md does not consider them as part of the
array.  /dev/sdo1 (which was the drive that looked like a faulty spare)
seems OK when trying to do the assemble.  sdm1 seemed just fine before
the reboot, and was showing no problems before.  They are not hooked up
on the same controller cable ( a SAS to SATA fanout), and the LSI MPT
controller card seems to talk to the other disks just fine.  

Anyways,I have no idea as to what's going on.  When I try to add sdm1 or sdl1
back into the array, md complains the device is busy, which is very odd
because it's not part of another array or doing anything else in the
system.

Any idea as to what could be happening here?  I am beyond frustrated.

thanks,
Mike


      

^ permalink raw reply	[flat|nested] 46+ messages in thread
[parent not found: <451872.61166.qm@web30802.mail.mud.yahoo.com>]

end of thread, other threads:[~2009-01-13  5:57 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-05 17:03 Need urgent help in fixing raid5 array Mike Myers
2008-12-06  0:18 ` Mike Myers
2008-12-06  0:24   ` Justin Piszcz
2008-12-06  0:47     ` Mike Myers
2008-12-06  0:51       ` Justin Piszcz
2008-12-06  0:58         ` Mike Myers
2008-12-06 19:02         ` Mike Myers
2008-12-06 19:30           ` Mike Myers
2008-12-06 20:14             ` Mike Myers
2008-12-06  0:52     ` David Lethe
  -- strict thread matches above, loose matches on Subject: below --
2009-01-01 15:31 Mike Myers
     [not found] <451872.61166.qm@web30802.mail.mud.yahoo.com>
2009-01-01 15:40 ` Justin Piszcz
2009-01-01 17:51   ` Mike Myers
2009-01-01 18:29     ` Justin Piszcz
2009-01-01 18:40       ` Jon Nelson
2009-01-01 20:38         ` Mike Myers
2009-01-02  6:19       ` Mike Myers
2009-01-02 12:10         ` Justin Piszcz
2009-01-02 18:12           ` Mike Myers
2009-01-02 18:22             ` Justin Piszcz
2009-01-02 18:46               ` Mike Myers
2009-01-02 18:57                 ` Justin Piszcz
2009-01-02 20:46                   ` Mike Myers
2009-01-02 20:56                   ` Mike Myers
2009-01-02 21:37                   ` Mike Myers
2009-01-03  4:19                   ` Mike Myers
2009-01-03  4:43                     ` Guy Watkins
2009-01-03  5:02                       ` Mike Myers
2009-01-03 12:46                         ` John Robinson
2009-01-03 15:49                           ` Mike Myers
2009-01-03 16:14                             ` John Robinson
2009-01-03 16:47                               ` Mike Myers
2009-01-03 19:03                               ` Mike Myers
2009-01-05 22:11         ` Neil Brown
2009-01-05 22:22           ` Mike Myers
2009-01-05 22:53             ` NeilBrown
2009-01-06  2:46               ` Mike Myers
2009-01-06  4:00                 ` NeilBrown
2009-01-06  5:55                   ` Mike Myers
2009-01-06 23:23                     ` Neil Brown
2009-01-06  6:24                   ` Mike Myers
2009-01-06 23:31                     ` Neil Brown
2009-01-06 23:54                       ` Mike Myers
2009-01-07  0:19                         ` NeilBrown
2009-01-13  5:38                       ` Mike Myers
2009-01-13  5:57                         ` Mike Myers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).