From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eli Stair Subject: Re: RAID5 refuses to accept replacement drive. Date: Wed, 25 Oct 2006 10:33:11 -0700 Message-ID: <453F9FD7.3060503@ilm.com> References: <200610251652.k9PGq5tt032608@wind.enjellic.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <200610251652.k9PGq5tt032608@wind.enjellic.com> Sender: linux-raid-owner@vger.kernel.org To: greg@enjellic.com Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids A tangentially-related suggestion: If you layer dm-multipath on top of the raw block (SCSI,FC) layer, you add some complexity but also the good quality of enabling periodic readsector0() checks... so if your spindle powers down unexpectedly but the controller thinks it's still alive, you will still get a drive disconnect issued from below MD, as device-mapper will fail the drive automatically and MD will see it as faulty. Sorry, no useful suggestion on the recovery task... /eli greg@enjellic.com wrote: > Good morning to everyone, hope everyone's day is going well. > > Neil, I sent this to your SUSE address a week ago but it may have > gotten trapped in a SPAM filter or lost in the shuffle. > > I've used MD based RAID since it first existed. First time I've run > into a situation like this. > > Environment: > Kernel: 2.4.33.3 > MDADM: 2.4.1/2.5.3 > MD: Three drive RAID5 (md3) > > A 'silent' disk failure was experienced in a SCSI hot-swap chassis > during a yearly system upgrade. Machine failed to boot until 'nobd' > directive was given to LILO. Drive was mechanically dead but > electrically alive. > > Drives were shuffled to get the machine operational. The machine came > up with md3 degraded. The md3 device refuses to accept a replacement > partition using the following syntax: > > mdadm --manage /dev/md3 -a /dev/sde1 > > No output from mdadm, nothing in the logfiles. Tail end of strace is > as follows: > > open("/dev/md3", O_RDWR) = 3 > fstat64(0x3, 0xbffff8fc) = 0 > ioctl(3, 0x800c0910, 0xbffff9f8) = 0 > _exit(0) = ? > > I 'zeroed' the superblock on /dev/sde1 to make sure there was nothing > to interfere. No change in behavior. > > I know the 2.4 kernels are not in vogue but this is from a group of > machines which are expected to run a year at a time. Stability and > known behavior are the foremost goals. > > Details on the MD device and component drives are included below. > > We've handled a lot of MD failures, first time anything like this has > happened. I feel like there is probably a 'brown paper bag' solution > to this but I can't see it. > > Thoughts? > > Greg > > --------------------------------------------------------------------------- > /dev/md3: > Version : 00.90.00 > Creation Time : Fri Jun 23 19:51:43 2006 > Raid Level : raid5 > Array Size : 5269120 (5.03 GiB 5.40 GB) > Device Size : 2634560 (2.51 GiB 2.70 GB) > Raid Devices : 3 > Total Devices : 3 > Preferred Minor : 3 > Persistence : Superblock is persistent > > Update Time : Wed Oct 11 04:33:06 2006 > State : active, degraded > Active Devices : 2 > Working Devices : 2 > Failed Devices : 1 > Spare Devices : 0 > > Layout : left-symmetric > Chunk Size : 64K > > UUID : cdd418a1:4bc3da6b:1ec17a15:e73ecadd > Events : 0.25 > > Number Major Minor RaidDevice State > 0 8 49 0 active sync /dev/sdd1 > 1 0 0 1 removed > 2 8 33 2 active sync /dev/sdc1 > --------------------------------------------------------------------------- > > > Details for raid device 0: > > --------------------------------------------------------------------------- > /dev/sdd1: > Magic : a92b4efc > Version : 00.90.00 > UUID : cdd418a1:4bc3da6b:1ec17a15:e73ecadd > Creation Time : Fri Jun 23 19:51:43 2006 > Raid Level : raid5 > Device Size : 2634560 (2.51 GiB 2.70 GB) > Array Size : 5269120 (5.03 GiB 5.40 GB) > Raid Devices : 3 > Total Devices : 3 > Preferred Minor : 3 > > Update Time : Wed Oct 11 04:33:06 2006 > State : active > Active Devices : 2 > Working Devices : 2 > Failed Devices : 1 > Spare Devices : 0 > Checksum : 52b602d5 - correct > Events : 0.25 > > Layout : left-symmetric > Chunk Size : 64K > > Number Major Minor RaidDevice State > this 0 8 49 0 active sync /dev/sdd1 > > 0 0 8 49 0 active sync /dev/sdd1 > 1 1 0 0 1 faulty removed > 2 2 8 33 2 active sync /dev/sdc1 > --------------------------------------------------------------------------- > > > Details for RAID device 2: > > --------------------------------------------------------------------------- > /dev/sdc1: > Magic : a92b4efc > Version : 00.90.00 > UUID : cdd418a1:4bc3da6b:1ec17a15:e73ecadd > Creation Time : Fri Jun 23 19:51:43 2006 > Raid Level : raid5 > Device Size : 2634560 (2.51 GiB 2.70 GB) > Array Size : 5269120 (5.03 GiB 5.40 GB) > Raid Devices : 3 > Total Devices : 3 > Preferred Minor : 3 > > Update Time : Wed Oct 11 04:33:06 2006 > State : active > Active Devices : 2 > Working Devices : 2 > Failed Devices : 1 > Spare Devices : 0 > Checksum : 52b602c9 - correct > Events : 0.25 > > Layout : left-symmetric > Chunk Size : 64K > > Number Major Minor RaidDevice State > this 2 8 33 2 active sync /dev/sdc1 > > 0 0 8 49 0 active sync /dev/sdd1 > 1 1 0 0 1 faulty removed > 2 2 8 33 2 active sync /dev/sdc1 > --------------------------------------------------------------------------- > > As always, > Dr. G.W. Wettstein, Ph.D. Enjellic Systems Development, LLC. > 4206 N. 19th Ave. Specializing in information infra-structure > Fargo, ND 58102 development. > PH: 701-281-1686 > FAX: 701-281-3949 EMAIL: greg@enjellic.com > ------------------------------------------------------------------------------ > "We restored the user's real .pinerc from backup but another of our users > must still be missing those cows." > -- Malcolm Beattie > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >