Re: RAID5 Not coming back up after crash

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "B. J. Zolp" <bjzolp@wisc.edu>
To: Guy <bugzilla@watkins-home.com>
Cc: 'BERNARD JOHN ZOLP' <bjzolp@students.wisc.edu>,
	linux-raid@vger.kernel.org
Subject: Re: RAID5 Not coming back up after crash
Date: Mon, 29 Nov 2004 23:38:08 -0600	[thread overview]
Message-ID: <41AC0740.9060508@wisc.edu> (raw)
In-Reply-To: <200411292229.iATMTpN06996@www.watkins-home.com>

I found a spare new drive that I copied hdj1 onto and put the new drive 
on the proper IDE cable for hdj.  Then tried running the mdadm -A 
/dev/md0 --force /dev/hda1 /dev/hdc1 /dev/hdd1 /dev/hdi1 /dev/hdj1
But it seg faults.  Should I try the  mkraid --force 
--dangerous-no-resync /dev/md0   ???

Thanks,
bjz



Guy wrote:

>If you are sure you can overwrite the correct bad sectors, then do it.
>
>mdadm is much better than raidtools.  From what I have read, yes it is
>compatible.
>
>The below info is not required.
>Who makes your 6 disk drives?  And how old are they?  Any bets anyone?
>
>Guy
>
>-----Original Message-----
>From: BERNARD JOHN ZOLP [mailto:bjzolp@students.wisc.edu] 
>Sent: Monday, November 29, 2004 3:57 PM
>To: Guy
>Cc: linux-raid@vger.kernel.org
>Subject: Re: RE: RAID5 Not coming back up after crash
>
>Just a few follow up questions before I dive into this.  Will mdadm work
>with a RAID setup created with the older raidtools package that came
>with my SuSE installation?
>  Assuming the drive with bad blocks is not getting worse, dont think it
>is -- but you never know, could I map them out by writing to those
>sectors with dd and then running the command to bring the array back
>online?  Or should I wait for the RMA of the flakey drive and dd_rescue
>to the new one and bring that up?
>
>Thanks again,
>bjz
>
>----- Original Message -----
>From: Guy <bugzilla@watkins-home.com>
>Date: Monday, November 29, 2004 11:40 am
>Subject: RE: RAID5 Not coming back up after crash
>
>  
>
>>You can recover, but not with bad blocks.
>>
>>This command should get your array back on-line:
>>mdadm -A /dev/md0 --force /dev/hda1 /dev/hdc1 /dev/hdd1 /dev/hdi1 
>>/dev/hdj1
>>But, as soon as md reads a bad block it will fail the disk and your 
>>arraywill be off-line.
>>
>>If you have an extra disk, you could attempt to copy the disk 
>>first, then
>>replace the disk with the read error with the copy.
>>
>>dd_rescue can copy a disk with read errors.
>>
>>Also, it is common for a disk to grow bad spots over time.  These 
>>bad spots
>>(sectors) can be re-mapped by the drive to a spare sector.  This re-
>>mappingwill occur when an attempt is made to write to the bad 
>>sector.  So, you can
>>repair your disk by writing to the bad sectors.  But, be careful 
>>not to
>>overwrite good data.  I have done this using dd.  First I found the 
>>badsector with dd, then I wrote to the 1 bad sector with dd.  I 
>>would need to
>>refer to the man page to do it again, so I can't explain it here at 
>>thistime.  It is not really hard, but 1 small mistake, and "that's 
>>it man, game
>>over man, game over".
>>
>>Guy
>>
>>
>>-----Original Message-----
>>From: linux-raid-owner@vger.kernel.org
>>[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of B. J. Zolp
>>Sent: Monday, November 29, 2004 11:33 AM
>>To: linux-raid@vger.kernel.org
>>Subject: RAID5 Not coming back up after crash
>>
>>I have a RAID5 setup on my fileserver using disks hda1 hdb1 hdc1 
>>hdd1 
>>hdi1 and hdj1.  Yesterday I started moving a large chunk of files 
>>~80GB 
>>from this array to a stand alone drive in the system and about 
>>halfway 
>>through the mv I got a ton of PERMISSION DENIED errors some of the 
>>remaining files left to be moved and the move process quit.  I did 
>>a ls 
>>of the raid directory and got PERMISSION DENIED on the same files 
>>that 
>>errored out on the mv while some of the other files looked fine.  I 
>>figured it might be a good idea to take the raid down and back up 
>>again 
>>(probably a mistake) and I could not reboot the machine without 
>>physically turning it off as some processes were hung.  Upon 
>>booting 
>>back up, the raid did not come online stating that hdj1 was kicked 
>>due 
>>to inconsistency.  Additionally hdb1 is listed as offline too.  So 
>>I 
>>have 2 drives that are not cooperating.  I have a hunch hdb1 might 
>>have 
>>not been working for some time.
>>
>>I found some info stating that if you mark the drive that failed 
>>first 
>>as "failed-drive" and try a  "mkraid --force --dangerous-no-resync 
>>/dev/md0" then I might have some luck getting my files back.  From 
>>my 
>>logs I can see that all the working drives have event counter: 
>>00000022 
>>and hdj1 has event counter: 00000021 and hdb1 has event counter: 
>>00000001.  Does this mean that hdb1 failed a log time ago or is 
>>this 
>>difference in event counters likely within a few minutes fo each 
>>other?  
>>I just ran badblocks on both hdb1 and hdj1 and found 1 bad block on 
>>hdb1 
>>and about 15 on hdj1, would that be enough to cause my raid to get 
>>this 
>>out of whack?  In any case I plan to replace those drives, but 
>>would the 
>>method above be the best route once I have copied the raw data to 
>>the 
>>new drives in order to bring my raid back up?
>>
>>
>>Thanks,
>>
>>bjz
>>
>>here is my log from when I run raidstart /dev/md0:
>>
>>Nov 29 10:10:19 orion kernel:  [events: 00000022]
>>Nov 29 10:10:19 orion last message repeated 3 times
>>Nov 29 10:10:19 orion kernel:  [events: 00000021]
>>Nov 29 10:10:19 orion kernel: md: autorun ...
>>Nov 29 10:10:19 orion kernel: md: considering hdj1 ...
>>Nov 29 10:10:19 orion kernel: md:  adding hdj1 ...
>>Nov 29 10:10:19 orion kernel: md:  adding hdi1 ...
>>Nov 29 10:10:19 orion kernel: md:  adding hdd1 ...
>>Nov 29 10:10:19 orion kernel: md:  adding hdc1 ...
>>Nov 29 10:10:19 orion kernel: md:  adding hda1 ...
>>Nov 29 10:10:19 orion kernel: md: created md0
>>Nov 29 10:10:19 orion kernel: md: bind<hda1,1>
>>Nov 29 10:10:19 orion kernel: md: bind<hdc1,2>
>>Nov 29 10:10:19 orion kernel: md: bind<hdd1,3>
>>Nov 29 10:10:19 orion kernel: md: bind<hdi1,4>
>>Nov 29 10:10:19 orion kernel: md: bind<hdj1,5>
>>Nov 29 10:10:19 orion kernel: md: running: 
>><hdj1><hdi1><hdd1><hdc1><hda1>Nov 29 10:10:19 orion kernel: md: 
>>hdj1's event counter: 00000021
>>Nov 29 10:10:19 orion kernel: md: hdi1's event counter: 00000022
>>Nov 29 10:10:19 orion kernel: md: hdd1's event counter: 00000022
>>Nov 29 10:10:19 orion kernel: md: hdc1's event counter: 00000022
>>Nov 29 10:10:19 orion kernel: md: hda1's event counter: 00000022
>>Nov 29 10:10:19 orion kernel: md: superblock update time 
>>inconsistency 
>>-- using the most recent one
>>Nov 29 10:10:19 orion kernel: md: freshest: hdi1
>>Nov 29 10:10:19 orion kernel: md0: kicking faulty hdj1!
>>Nov 29 10:10:19 orion kernel: md: unbind<hdj1,4>
>>Nov 29 10:10:19 orion kernel: md: export_rdev(hdj1)
>>Nov 29 10:10:19 orion kernel: md: md0: raid array is not clean -- 
>>starting background reconstruction
>>Nov 29 10:10:19 orion kernel: md0: max total readahead window set 
>>to 2560k
>>Nov 29 10:10:19 orion kernel: md0: 5 data-disks, max readahead per 
>>data-disk: 512k
>>Nov 29 10:10:19 orion kernel: raid5: device hdi1 operational as 
>>raid disk 4
>>Nov 29 10:10:19 orion kernel: raid5: device hdd1 operational as 
>>raid disk 3
>>Nov 29 10:10:19 orion kernel: raid5: device hdc1 operational as 
>>raid disk 2
>>Nov 29 10:10:19 orion kernel: raid5: device hda1 operational as 
>>raid disk 0
>>Nov 29 10:10:19 orion kernel: raid5: not enough operational devices 
>>for 
>>md0 (2/6 failed)
>>Nov 29 10:10:19 orion kernel: RAID5 conf printout:
>>Nov 29 10:10:19 orion kernel:  --- rd:6 wd:4 fd:2
>>Nov 29 10:10:19 orion kernel:  disk 0, s:0, o:1, n:0 rd:0 us:1 
>>dev:hda1Nov 29 10:10:19 orion kernel:  disk 1, s:0, o:0, n:1 rd:1 
>>us:1 dev:[dev 
>>00:00]
>>Nov 29 10:10:19 orion kernel:  disk 2, s:0, o:1, n:2 rd:2 us:1 
>>dev:hdc1Nov 29 10:10:19 orion kernel:  disk 3, s:0, o:1, n:3 rd:3 
>>us:1 dev:hdd1
>>Nov 29 10:10:19 orion kernel:  disk 4, s:0, o:1, n:4 rd:4 us:1 
>>dev:hdi1Nov 29 10:10:19 orion kernel:  disk 5, s:0, o:0, n:5 rd:5 
>>us:1 dev:[dev 
>>00:00]
>>Nov 29 10:10:19 orion kernel: raid5: failed to run raid set md0
>>Nov 29 10:10:19 orion kernel: md: pers->run() failed ...
>>Nov 29 10:10:19 orion kernel: md :do_md_run() returned -22
>>Nov 29 10:10:19 orion kernel: md: md0 stopped.
>>Nov 29 10:10:19 orion kernel: md: unbind<hdi1,3>
>>Nov 29 10:10:19 orion kernel: md: export_rdev(hdi1)
>>Nov 29 10:10:19 orion kernel: md: unbind<hdd1,2>
>>Nov 29 10:10:19 orion kernel: md: export_rdev(hdd1)
>>Nov 29 10:10:19 orion kernel: md: unbind<hdc1,1>
>>Nov 29 10:10:19 orion kernel: md: export_rdev(hdc1)
>>Nov 29 10:10:19 orion kernel: md: unbind<hda1,0>
>>Nov 29 10:10:19 orion kernel: md: export_rdev(hda1)
>>Nov 29 10:10:19 orion kernel: md: ... autorun DONE.
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-
>>raid" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>-- 
>>No virus found in this incoming message.
>>Checked by AVG Anti-Virus.
>>Version: 7.0.289 / Virus Database: 265.4.3 - Release Date: 11/26/2004
>>
>>
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-
>>raid" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>    
>>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>  
>

next prev parent reply	other threads:[~2004-11-30  5:38 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-11-29 20:56 RE: RAID5 Not coming back up after crash BERNARD JOHN ZOLP
2004-11-29 22:29 ` Guy
2004-11-30  5:38   ` B. J. Zolp [this message]
2004-11-30  5:45     ` Neil Brown
2004-11-30  5:48       ` B. J. Zolp
2004-11-30  5:54         ` Neil Brown
2004-11-30  6:33           ` THANKS!! was:Re: " B. J. Zolp
  -- strict thread matches above, loose matches on Subject: below --
2004-11-29 16:33 B. J. Zolp
2004-11-29 17:40 ` Guy
2004-11-30 21:29   ` Frank van Maarseveen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=41AC0740.9060508@wisc.edu \
    --to=bjzolp@wisc.edu \
    --cc=bjzolp@students.wisc.edu \
    --cc=bugzilla@watkins-home.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).