The right way to recover from md partition failure?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* The right way to recover from md partition failure?
@ 2004-08-30 19:38 Jonathan Baker-Bates
  2004-08-30 20:14 ` Guy
  0 siblings, 1 reply; 8+ messages in thread
From: Jonathan Baker-Bates @ 2004-08-30 19:38 UTC (permalink / raw)
  To: linux-raid

I've been reading various FAQs and HOWTOs, but for some reason can't really
get an answer to what I assume is a simple question about how best to get a
failed md RAID 1 partition back into an array.

After a power-outage, I see that cat /proc/mdstat shows:

Personalities : [raid1]
read_ahead 1024 sectors
Event: 3
md1 : active raid1 hdg3[1]
      178787264 blocks [2/1] [_U]

md0 : active raid1 hde2[0] hdg2[1]
      2048192 blocks [2/2] [UU]

md2 : active raid1 hde1[0] hdg1[1]
      104320 blocks [2/2] [UU]

unused devices: <none>

So it looks like /dev/hde3 is down. I'm not sure exactly why this is, but
there were some console messages about a bad block or something. So,
assuming hdg3 is OK (which it seems to be) can I just do the following?

Copy good partition to bad one:

dd if=/dev/hdg3 of=/dev/hde3

Add the resulting copy to the raid:

raidhotadd /dev/md1 /dev/hde3

fsck /dev/md1 to make sure all is well.

Is there a better way?

Jonathan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: The right way to recover from md partition failure?
  2004-08-30 19:38 The right way to recover from md partition failure? Jonathan Baker-Bates
@ 2004-08-30 20:14 ` Guy
  2004-08-30 21:33   ` David Greaves
  2004-08-30 21:44   ` Jonathan Baker-Bates
  0 siblings, 2 replies; 8+ messages in thread
From: Guy @ 2004-08-30 20:14 UTC (permalink / raw)
  To: 'Jonathan Baker-Bates', linux-raid

No need to copy, that's what md does.

Verify that the disk is not part of the array:
mdadm -D /dev/md1

I bet you will find the disk is there, but failed.
So, raidhotremove it, then raidhotadd it.

mdadm is the preferred tool.  The old raidtools are not supported.
For details:
man mdadm

You may need to install mdadm.

mdadm manage /dev/md1 -r /dev/hde3
mdadm manage /dev/md1 -a /dev/hde3

or short form:
mdadm /dev/md1 -r /dev/hde3
mdadm /dev/md1 -a /dev/hde3

It should start to re-sync.  Monitor the status with:
cat /proc/mdstat
and/or
mdadm -D /dev/md1

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Jonathan Baker-Bates
Sent: Monday, August 30, 2004 3:39 PM
To: linux-raid@vger.kernel.org
Subject: The right way to recover from md partition failure?

I've been reading various FAQs and HOWTOs, but for some reason can't really
get an answer to what I assume is a simple question about how best to get a
failed md RAID 1 partition back into an array.

After a power-outage, I see that cat /proc/mdstat shows:

Personalities : [raid1]
read_ahead 1024 sectors
Event: 3
md1 : active raid1 hdg3[1]
      178787264 blocks [2/1] [_U]

md0 : active raid1 hde2[0] hdg2[1]
      2048192 blocks [2/2] [UU]

md2 : active raid1 hde1[0] hdg1[1]
      104320 blocks [2/2] [UU]

unused devices: <none>

So it looks like /dev/hde3 is down. I'm not sure exactly why this is, but
there were some console messages about a bad block or something. So,
assuming hdg3 is OK (which it seems to be) can I just do the following?

Copy good partition to bad one:

dd if=/dev/hdg3 of=/dev/hde3

Add the resulting copy to the raid:

raidhotadd /dev/md1 /dev/hde3

fsck /dev/md1 to make sure all is well.

Is there a better way?

Jonathan

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: The right way to recover from md partition failure?
  2004-08-30 20:14 ` Guy
@ 2004-08-30 21:33   ` David Greaves
  2004-08-30 21:50     ` Jonathan Baker-Bates
  2004-08-30 22:17     ` Philip Molter
  2004-08-30 21:44   ` Jonathan Baker-Bates
  1 sibling, 2 replies; 8+ messages in thread
From: David Greaves @ 2004-08-30 21:33 UTC (permalink / raw)
  To: Guy; +Cc: 'Jonathan Baker-Bates', linux-raid

I think a better approach might be:

mdadm /dev/md1 -r /dev/hde3
dd if=/dev/hde3 of=/dev/null
check logs for nasty errors and only continue if there weren't any :)
mdadm /dev/md1 -a /dev/hde3

Having done this very thing this afternoon!!

If you have "some console messages about a bad block or something" then 
I'd make damn sure your disk is good before putting it back.
If you end up doing lots of retries during the resync and an error 
occurs on a remaining drive you'll be sorry!

In general a raid failure means you should suspect a disk failure.

I just wish Jeff G would get of his backside and make SMART work with 
libata - doesn't the man work on bank holidays? ;)

David


Guy wrote:

>No need to copy, that's what md does.
>
>Verify that the disk is not part of the array:
>mdadm -D /dev/md1
>
>I bet you will find the disk is there, but failed.
>So, raidhotremove it, then raidhotadd it.
>
>mdadm is the preferred tool.  The old raidtools are not supported.
>For details:
>man mdadm
>
>You may need to install mdadm.
>
>mdadm manage /dev/md1 -r /dev/hde3
>mdadm manage /dev/md1 -a /dev/hde3
>
>or short form:
>mdadm /dev/md1 -r /dev/hde3
>mdadm /dev/md1 -a /dev/hde3
>
>It should start to re-sync.  Monitor the status with:
>cat /proc/mdstat
>and/or
>mdadm -D /dev/md1
>
>Guy
>
>-----Original Message-----
>From: linux-raid-owner@vger.kernel.org
>[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Jonathan Baker-Bates
>Sent: Monday, August 30, 2004 3:39 PM
>To: linux-raid@vger.kernel.org
>Subject: The right way to recover from md partition failure?
>
>I've been reading various FAQs and HOWTOs, but for some reason can't really
>get an answer to what I assume is a simple question about how best to get a
>failed md RAID 1 partition back into an array.
>
>After a power-outage, I see that cat /proc/mdstat shows:
>
>Personalities : [raid1]
>read_ahead 1024 sectors
>Event: 3
>md1 : active raid1 hdg3[1]
>      178787264 blocks [2/1] [_U]
>
>md0 : active raid1 hde2[0] hdg2[1]
>      2048192 blocks [2/2] [UU]
>
>md2 : active raid1 hde1[0] hdg1[1]
>      104320 blocks [2/2] [UU]
>
>unused devices: <none>
>
>So it looks like /dev/hde3 is down. I'm not sure exactly why this is, but
>there were some console messages about a bad block or something. So,
>assuming hdg3 is OK (which it seems to be) can I just do the following?
>
>Copy good partition to bad one:
>
>dd if=/dev/hdg3 of=/dev/hde3
>
>Add the resulting copy to the raid:
>
>raidhotadd /dev/md1 /dev/hde3
>
>fsck /dev/md1 to make sure all is well.
>
>Is there a better way?
>
>Jonathan
>
>
>
>
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>  
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: The right way to recover from md partition failure?
  2004-08-30 20:14 ` Guy
  2004-08-30 21:33   ` David Greaves
@ 2004-08-30 21:44   ` Jonathan Baker-Bates
  1 sibling, 0 replies; 8+ messages in thread
From: Jonathan Baker-Bates @ 2004-08-30 21:44 UTC (permalink / raw)
  To: linux-raid

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org
> [mailto:linux-raid-owner@vger.kernel.org]On Behalf Of Guy
> Sent: 30 August 2004 21:14
> To: 'Jonathan Baker-Bates'; linux-raid@vger.kernel.org
> Subject: RE: The right way to recover from md partition failure?
>
>
> No need to copy, that's what md does.
>
> Verify that the disk is not part of the array:
> mdadm -D /dev/md1
>
> I bet you will find the disk is there, but failed.
> So, raidhotremove it, then raidhotadd it.
>
> mdadm is the preferred tool.  The old raidtools are not supported.
> For details:
> man mdadm
>
> You may need to install mdadm.
>
> mdadm manage /dev/md1 -r /dev/hde3
> mdadm manage /dev/md1 -a /dev/hde3
>
> or short form:
> mdadm /dev/md1 -r /dev/hde3
> mdadm /dev/md1 -a /dev/hde3
>
> It should start to re-sync.  Monitor the status with:
> cat /proc/mdstat
> and/or
> mdadm -D /dev/md1
>

Ah, thanks. I'll need to do a backup just in case before I try that though.
One question meanwhile: If there are bad blocks on the drive, and assuming
mdadm adds that disk to the array OK, can I fsck /dev/md1 in the normal way
and repair or mark them as bad? I'm a bit confused about using fsck and
RAID.

Jonathan



^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: The right way to recover from md partition failure?
  2004-08-30 21:33   ` David Greaves
@ 2004-08-30 21:50     ` Jonathan Baker-Bates
  2004-08-30 22:11       ` David Greaves
  2004-08-30 22:17     ` Philip Molter
  1 sibling, 1 reply; 8+ messages in thread
From: Jonathan Baker-Bates @ 2004-08-30 21:50 UTC (permalink / raw)
  To: linux-raid

> -----Original Message-----
> From: David Greaves [mailto:david@dgreaves.com]
> Sent: 30 August 2004 22:33
> To: Guy
> Cc: 'Jonathan Baker-Bates'; linux-raid@vger.kernel.org
> Subject: Re: The right way to recover from md partition failure?
>
>
> I think a better approach might be:
>
> mdadm /dev/md1 -r /dev/hde3
> dd if=/dev/hde3 of=/dev/null

Why the /dev/null-ing?

> check logs for nasty errors and only continue if there weren't any :)
> mdadm /dev/md1 -a /dev/hde3
>
> Having done this very thing this afternoon!!
>
> If you have "some console messages about a bad block or something" then
> I'd make damn sure your disk is good before putting it back.
> If you end up doing lots of retries during the resync and an error
> occurs on a remaining drive you'll be sorry!
>
> In general a raid failure means you should suspect a disk failure.
>

Now it's the issue of making sure the disk is good that was worrying me. How
do I make sure? Hence my question to Guy about fsck.

Jonathan



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: The right way to recover from md partition failure?
  2004-08-30 21:50     ` Jonathan Baker-Bates
@ 2004-08-30 22:11       ` David Greaves
  0 siblings, 0 replies; 8+ messages in thread
From: David Greaves @ 2004-08-30 22:11 UTC (permalink / raw)
  To: Jonathan Baker-Bates; +Cc: linux-raid

Jonathan Baker-Bates wrote:

>>-----Original Message-----
>>From: David Greaves [mailto:david@dgreaves.com]
>>Sent: 30 August 2004 22:33
>>To: Guy
>>Cc: 'Jonathan Baker-Bates'; linux-raid@vger.kernel.org
>>Subject: Re: The right way to recover from md partition failure?
>>
>>
>>I think a better approach might be:
>>
>>mdadm /dev/md1 -r /dev/hde3
>>dd if=/dev/hde3 of=/dev/null
>>    
>>
>
>Why the /dev/null-ing?
>  
>
Since you ask I guess you're new at this?
First of be careful - check the dd syntax carefully - it can ruin your 
whole day.
In this case dd goes straight to the hard disk device and pulls data 
from the disk and sends it to /dev/null
The objective is to cause the disk to read every sector in the partition 
and cause the OS to flag any low-level read errors.
If the dd command doesn't produce any errors - CHECK THE LOGS
If it succeeds on a 'retry' then I'd suspect the disk - if you have 
*any* errors - suspect the disk.

>>check logs for nasty errors and only continue if there weren't any :)
>>    
>>
check /var/log/messages and /var/log/kernel
Let us know what they say.

>>mdadm /dev/md1 -a /dev/hde3
>>
>>Having done this very thing this afternoon!!
>>
>>If you have "some console messages about a bad block or something" then
>>I'd make damn sure your disk is good before putting it back.
>>If you end up doing lots of retries during the resync and an error
>>occurs on a remaining drive you'll be sorry!
>>
>>In general a raid failure means you should suspect a disk failure.
>>
>>    
>>
>
>Now it's the issue of making sure the disk is good that was worrying me. How
>do I make sure? Hence my question to Guy about fsck.
>  
>
No
fsck will check to see if the *filesystem* is good - it will be.
To be honest you shouldn't have noticed any problems - the disk failed - 
it happens - that's why you have RAID.
Smile - right now your system would be toast without it.

[Aside: FYI, disk systems are 'layered'.
In your case data (files) lives 'on top' of the filesystem which lives 
on top of the md1 device which lives on top of the /dev/hd?? devices.
The md1 is designed to keep working if either /dev/hd?? fails - so the 
filesystem and your files should never notice.
]

Anyway, of course disks sometimes have glitches (eg if it gets too hot etc).
You should probably go and get smartmon or smarttools (they look at your 
disk's health status)

If you do have errors then shut down if you can and check your cables 
and make sure all your fans are OK.
Reboot and try the dd again.
If you get errors again then you can try changing the IDE cable.
If you *still* have errors then get yourself online and dig out the 
credit-card for a new disk.

David


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: The right way to recover from md partition failure?
  2004-08-30 21:33   ` David Greaves
  2004-08-30 21:50     ` Jonathan Baker-Bates
@ 2004-08-30 22:17     ` Philip Molter
  2004-08-30 23:27       ` Guy
  1 sibling, 1 reply; 8+ messages in thread
From: Philip Molter @ 2004-08-30 22:17 UTC (permalink / raw)
  To: linux-raid

David Greaves wrote:
> I think a better approach might be:
> 
> mdadm /dev/md1 -r /dev/hde3
> dd if=/dev/hde3 of=/dev/null
> check logs for nasty errors and only continue if there weren't any :)
> mdadm /dev/md1 -a /dev/hde3

Normally, for this I:

dd if=/dev/zero of=/dev/hde3
dd if=/dev/hde3 of=/dev/null

The write will usually cause the hard drive to internally relocate any 
bad sectors, which is usually what causes RAID failures on IDE drives 
(in my experience).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: The right way to recover from md partition failure?
  2004-08-30 22:17     ` Philip Molter
@ 2004-08-30 23:27       ` Guy
  0 siblings, 0 replies; 8+ messages in thread
From: Guy @ 2004-08-30 23:27 UTC (permalink / raw)
  To: 'Philip Molter', linux-raid

Yes! That was my plan, I just did not take the time to explain.
When md re-syncs the disk, the write to the "bad" disk should fix/re-locate
the bad blocks.

or

What he said (Philip), but be very careful!
dd if=/dev/zero of=/dev/hde3
dd if=/dev/hde3 of=/dev/null

SCSI disk have the same bad block issue.  md does not support bad blocks. :(

You should test the disks once per day!  In my opinion!

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Philip Molter
Sent: Monday, August 30, 2004 6:17 PM
To: linux-raid@vger.kernel.org
Subject: Re: The right way to recover from md partition failure?

David Greaves wrote:
> I think a better approach might be:
> 
> mdadm /dev/md1 -r /dev/hde3
> dd if=/dev/hde3 of=/dev/null
> check logs for nasty errors and only continue if there weren't any :)
> mdadm /dev/md1 -a /dev/hde3

Normally, for this I:

dd if=/dev/zero of=/dev/hde3
dd if=/dev/hde3 of=/dev/null

The write will usually cause the hard drive to internally relocate any 
bad sectors, which is usually what causes RAID failures on IDE drives 
(in my experience).
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2004-08-30 23:27 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-30 19:38 The right way to recover from md partition failure? Jonathan Baker-Bates
2004-08-30 20:14 ` Guy
2004-08-30 21:33   ` David Greaves
2004-08-30 21:50     ` Jonathan Baker-Bates
2004-08-30 22:11       ` David Greaves
2004-08-30 22:17     ` Philip Molter
2004-08-30 23:27       ` Guy
2004-08-30 21:44   ` Jonathan Baker-Bates

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).