Failed RAID-5 with 4 disks

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Failed RAID-5 with 4 disks
@ 2005-07-26 17:03 Frank Blendinger
  2005-07-26 21:58 ` Tyler
  0 siblings, 1 reply; 9+ messages in thread
From: Frank Blendinger @ 2005-07-26 17:03 UTC (permalink / raw)
  To: linux-raid

Hi,

I have a RAID-5 set up with the following raidtab:

raiddev /dev/md0
        raid-level              5
        nr-raid-disks           4
        nr-spare-disks          0
        persistent-superblock   1
        parity-algorithm        left-symmetric
        chunk-size              256
        device                  /dev/hde
        raid-disk               0
        device                  /dev/hdg
        raid-disk               1
        device                  /dev/hdi
        raid-disk               2
        device                  /dev/hdk
        raid-disk               3

My hde has failed some time ago, leaving some 
	hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
	hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
messages in the syslog.

I wanted to get sure it really was damaged, so I did a badblocks
(read-only) scan on /dev/hde. It actually found a bad sector on the
disk.

I wanted to take the disk out to get me a new one, but unfortunately my
hdg seems to have run into trouble too, now. I also have some
SeekComplete/BadCRC errors in my log for that disk, too.

Furthermore, i got this:

Jul 25 10:35:49 blackbox kernel: ide: failed opcode was: unknown
Jul 25 10:35:49 blackbox kernel: hdg: DMA disabled
Jul 25 10:35:49 blackbox kernel: PDC202XX: Secondary channel reset.
Jul 25 10:35:49 blackbox kernel: PDC202XX: Primary channel reset.
Jul 25 10:35:49 blackbox kernel: hde: lost interrupt
Jul 25 10:35:49 blackbox kernel: ide3: reset: master: error (0x00?)
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 488396928
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159368976
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159368984
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159368992
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369000
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369008
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369016
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369024
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369032
Jul 25 10:35:49 blackbox kernel: md: write_disk_sb failed for device hdg
Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0
Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0
Jul 25 10:35:49 blackbox kernel: RAID5 conf printout:
Jul 25 10:35:49 blackbox kernel:  --- rd:4 wd:2 fd:2
Jul 25 10:35:49 blackbox kernel:  disk 0, o:1, dev:hdk
Jul 25 10:35:49 blackbox kernel:  disk 1, o:1, dev:hdi
Jul 25 10:35:49 blackbox kernel:  disk 2, o:0, dev:hdg
Jul 25 10:35:49 blackbox kernel: RAID5 conf printout:
Jul 25 10:35:49 blackbox kernel:  --- rd:4 wd:2 fd:2
Jul 25 10:35:49 blackbox kernel:  disk 0, o:1, dev:hdk
Jul 25 10:35:49 blackbox kernel:  disk 1, o:1, dev:hdi
Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0

Well, now it seems I have to failed disks in my RAID-5, which of course
would be fatal. I am still hoping to somehow rescue the data on the
array somehow, but I am not sure what would be the best approach. I don't
want to cause any more damage.

When booting my system with all four disks connected, hde and hdg as
expected won't get added:

Jul 26 18:07:59 blackbox kernel: md: hdg has invalid sb, not importing!
Jul 26 18:07:59 blackbox kernel: md: autorun ...
Jul 26 18:07:59 blackbox kernel: md: considering hdi ...
Jul 26 18:07:59 blackbox kernel: md:  adding hdi ...
Jul 26 18:07:59 blackbox kernel: md:  adding hdk ...
Jul 26 18:07:59 blackbox kernel: md:  adding hde ...
Jul 26 18:07:59 blackbox kernel: md: created md0
Jul 26 18:07:59 blackbox kernel: md: bind<hde>
Jul 26 18:07:59 blackbox kernel: md: bind<hdk>
Jul 26 18:07:59 blackbox kernel: md: bind<hdi>
Jul 26 18:07:59 blackbox kernel: md: running: <hdi><hdk><hde>
Jul 26 18:07:59 blackbox kernel: md: kicking non-fresh hde from array!
Jul 26 18:07:59 blackbox kernel: md: unbind<hde>
Jul 26 18:07:59 blackbox kernel: md: export_rdev(hde)
Jul 26 18:07:59 blackbox kernel: raid5: device hdi operational as raid disk 1
Jul 26 18:07:59 blackbox kernel: raid5: device hdk operational as raid disk 0
Jul 26 18:07:59 blackbox kernel: RAID5 conf printout:
Jul 26 18:07:59 blackbox kernel:  --- rd:4 wd:2 fd:2
Jul 26 18:07:59 blackbox kernel:  disk 0, o:1, dev:hdk
Jul 26 18:07:59 blackbox kernel:  disk 1, o:1, dev:hdi
Jul 26 18:07:59 blackbox kernel: md: do_md_run() returned -22
Jul 26 18:07:59 blackbox kernel: md: md0 stopped.
Jul 26 18:07:59 blackbox kernel: md: unbind<hdi>
Jul 26 18:07:59 blackbox kernel: md: export_rdev(hdi)
Jul 26 18:07:59 blackbox kernel: md: unbind<hdk>
Jul 26 18:07:59 blackbox kernel: md: export_rdev(hdk)
Jul 26 18:07:59 blackbox kernel: md: ... autorun DONE.

So hde is not fresh (it has been removed from the array for quite some
time now) and hdg has an invalid superblock.

Any advice on what I should do now? Should I better try to rebuild the
array with hde or with hdg?

Greetings,
Frank

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed RAID-5 with 4 disks
  2005-07-26 17:03 Failed RAID-5 with 4 disks Frank Blendinger
@ 2005-07-26 21:58 ` Tyler
  2005-07-26 22:36   ` Dan Stromberg
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Tyler @ 2005-07-26 21:58 UTC (permalink / raw)
  To: Frank Blendinger; +Cc: linux-raid

My suggestion would be to buy two new drives, and DD (or dd rescue) the 
two bad drives onto the new drives, and then plug the new drive that has 
the most recent failure on it (HDG?) in, and try running a forced 
assemble including the HDG drive, then, in readonly mode, run an fsck to 
check the file system, and see if it thinks most things are okay.  *IF* 
it checks out okay (for the most part.. you will probably lose some 
data), then plug the second new disk in, and add it to the array as a 
spare, and it would then start a resync of the array.  Otherwise, if the 
fsck found that the entire filesystem was fubar... then I would try the 
above steps, but force the assemble with the original failed disk.. but 
depending on how long in between the two failures its been, and if any 
data was written to the array after the first failure, this is probably 
not going to be a good thing.. but could still be useful if you were 
trying to recover specific files that were not touched in between the 
two failures.

I would also suggest googling raid manual recovery procedures, some info 
is outdated, but some of it describes what I just described above.

Tyler.

Frank Blendinger wrote:

>Hi,
>
>I have a RAID-5 set up with the following raidtab:
>
>raiddev /dev/md0
>        raid-level              5
>        nr-raid-disks           4
>        nr-spare-disks          0
>        persistent-superblock   1
>        parity-algorithm        left-symmetric
>        chunk-size              256
>        device                  /dev/hde
>        raid-disk               0
>        device                  /dev/hdg
>        raid-disk               1
>        device                  /dev/hdi
>        raid-disk               2
>        device                  /dev/hdk
>        raid-disk               3
>
>My hde has failed some time ago, leaving some 
>	hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
>	hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
>messages in the syslog.
>
>I wanted to get sure it really was damaged, so I did a badblocks
>(read-only) scan on /dev/hde. It actually found a bad sector on the
>disk.
>
>
>I wanted to take the disk out to get me a new one, but unfortunately my
>hdg seems to have run into trouble too, now. I also have some
>SeekComplete/BadCRC errors in my log for that disk, too.
>
>Furthermore, i got this:
>
>Jul 25 10:35:49 blackbox kernel: ide: failed opcode was: unknown
>Jul 25 10:35:49 blackbox kernel: hdg: DMA disabled
>Jul 25 10:35:49 blackbox kernel: PDC202XX: Secondary channel reset.
>Jul 25 10:35:49 blackbox kernel: PDC202XX: Primary channel reset.
>Jul 25 10:35:49 blackbox kernel: hde: lost interrupt
>Jul 25 10:35:49 blackbox kernel: ide3: reset: master: error (0x00?)
>Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 488396928
>Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159368976
>Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159368984
>Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159368992
>Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369000
>Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369008
>Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369016
>Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369024
>Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369032
>Jul 25 10:35:49 blackbox kernel: md: write_disk_sb failed for device hdg
>Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0
>Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0
>Jul 25 10:35:49 blackbox kernel: RAID5 conf printout:
>Jul 25 10:35:49 blackbox kernel:  --- rd:4 wd:2 fd:2
>Jul 25 10:35:49 blackbox kernel:  disk 0, o:1, dev:hdk
>Jul 25 10:35:49 blackbox kernel:  disk 1, o:1, dev:hdi
>Jul 25 10:35:49 blackbox kernel:  disk 2, o:0, dev:hdg
>Jul 25 10:35:49 blackbox kernel: RAID5 conf printout:
>Jul 25 10:35:49 blackbox kernel:  --- rd:4 wd:2 fd:2
>Jul 25 10:35:49 blackbox kernel:  disk 0, o:1, dev:hdk
>Jul 25 10:35:49 blackbox kernel:  disk 1, o:1, dev:hdi
>Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0
>
>
>Well, now it seems I have to failed disks in my RAID-5, which of course
>would be fatal. I am still hoping to somehow rescue the data on the
>array somehow, but I am not sure what would be the best approach. I don't
>want to cause any more damage.
>
>When booting my system with all four disks connected, hde and hdg as
>expected won't get added:
>
>Jul 26 18:07:59 blackbox kernel: md: hdg has invalid sb, not importing!
>Jul 26 18:07:59 blackbox kernel: md: autorun ...
>Jul 26 18:07:59 blackbox kernel: md: considering hdi ...
>Jul 26 18:07:59 blackbox kernel: md:  adding hdi ...
>Jul 26 18:07:59 blackbox kernel: md:  adding hdk ...
>Jul 26 18:07:59 blackbox kernel: md:  adding hde ...
>Jul 26 18:07:59 blackbox kernel: md: created md0
>Jul 26 18:07:59 blackbox kernel: md: bind<hde>
>Jul 26 18:07:59 blackbox kernel: md: bind<hdk>
>Jul 26 18:07:59 blackbox kernel: md: bind<hdi>
>Jul 26 18:07:59 blackbox kernel: md: running: <hdi><hdk><hde>
>Jul 26 18:07:59 blackbox kernel: md: kicking non-fresh hde from array!
>Jul 26 18:07:59 blackbox kernel: md: unbind<hde>
>Jul 26 18:07:59 blackbox kernel: md: export_rdev(hde)
>Jul 26 18:07:59 blackbox kernel: raid5: device hdi operational as raid disk 1
>Jul 26 18:07:59 blackbox kernel: raid5: device hdk operational as raid disk 0
>Jul 26 18:07:59 blackbox kernel: RAID5 conf printout:
>Jul 26 18:07:59 blackbox kernel:  --- rd:4 wd:2 fd:2
>Jul 26 18:07:59 blackbox kernel:  disk 0, o:1, dev:hdk
>Jul 26 18:07:59 blackbox kernel:  disk 1, o:1, dev:hdi
>Jul 26 18:07:59 blackbox kernel: md: do_md_run() returned -22
>Jul 26 18:07:59 blackbox kernel: md: md0 stopped.
>Jul 26 18:07:59 blackbox kernel: md: unbind<hdi>
>Jul 26 18:07:59 blackbox kernel: md: export_rdev(hdi)
>Jul 26 18:07:59 blackbox kernel: md: unbind<hdk>
>Jul 26 18:07:59 blackbox kernel: md: export_rdev(hdk)
>Jul 26 18:07:59 blackbox kernel: md: ... autorun DONE.
>
>So hde is not fresh (it has been removed from the array for quite some
>time now) and hdg has an invalid superblock.
>
>Any advice on what I should do now? Should I better try to rebuild the
>array with hde or with hdg?
>
>
>Greetings,
>Frank
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>  
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed RAID-5 with 4 disks
  2005-07-26 21:58 ` Tyler
@ 2005-07-26 22:36   ` Dan Stromberg
  2005-07-26 23:12   ` Tyler
  2005-09-16 11:36   ` Frank Blendinger
  2 siblings, 0 replies; 9+ messages in thread
From: Dan Stromberg @ 2005-07-26 22:36 UTC (permalink / raw)
  To: Tyler; +Cc: Frank Blendinger, linux-raid, strombrg

[-- Attachment #1: Type: text/plain, Size: 8340 bytes --]


It seems theoretically possible, but might take some semi-substantial
coding effort, to build a program that would sort of combine fsck and
RAID information to get back as much as is feasible.

For example, this program might track down all the inodes it can, and
perhaps even hunt for inode magic numbers to augment the inode list
(using fuzzy logic - if the inode points to a reasonable series of
blocks, for example, and has a possible file length and the files are
owned by an existing user - each of these strengthens the program's
believe that it has heuristically found a real inode), and then look
down the block pointers in the inodes for blocks that are known still
good, and ignoring the ones that aren't.

I suppose then you'd need some sort of table, built out of lots of range
arithmetic, indicating that N files were 100% recovered, and M files
were recovered only in ranges such and so.

Might actually be kind of a fun project once you get really immersed
into it.

HTH. :)

On Tue, 2005-07-26 at 14:58 -0700, Tyler wrote:
> My suggestion would be to buy two new drives, and DD (or dd rescue) the 
> two bad drives onto the new drives, and then plug the new drive that has 
> the most recent failure on it (HDG?) in, and try running a forced 
> assemble including the HDG drive, then, in readonly mode, run an fsck to 
> check the file system, and see if it thinks most things are okay.  *IF* 
> it checks out okay (for the most part.. you will probably lose some 
> data), then plug the second new disk in, and add it to the array as a 
> spare, and it would then start a resync of the array.  Otherwise, if the 
> fsck found that the entire filesystem was fubar... then I would try the 
> above steps, but force the assemble with the original failed disk.. but 
> depending on how long in between the two failures its been, and if any 
> data was written to the array after the first failure, this is probably 
> not going to be a good thing.. but could still be useful if you were 
> trying to recover specific files that were not touched in between the 
> two failures.
> 
> I would also suggest googling raid manual recovery procedures, some info 
> is outdated, but some of it describes what I just described above.
> 
> Tyler.
> 
> Frank Blendinger wrote:
> 
> >Hi,
> >
> >I have a RAID-5 set up with the following raidtab:
> >
> >raiddev /dev/md0
> >        raid-level              5
> >        nr-raid-disks           4
> >        nr-spare-disks          0
> >        persistent-superblock   1
> >        parity-algorithm        left-symmetric
> >        chunk-size              256
> >        device                  /dev/hde
> >        raid-disk               0
> >        device                  /dev/hdg
> >        raid-disk               1
> >        device                  /dev/hdi
> >        raid-disk               2
> >        device                  /dev/hdk
> >        raid-disk               3
> >
> >My hde has failed some time ago, leaving some 
> >	hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> >	hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
> >messages in the syslog.
> >
> >I wanted to get sure it really was damaged, so I did a badblocks
> >(read-only) scan on /dev/hde. It actually found a bad sector on the
> >disk.
> >
> >
> >I wanted to take the disk out to get me a new one, but unfortunately my
> >hdg seems to have run into trouble too, now. I also have some
> >SeekComplete/BadCRC errors in my log for that disk, too.
> >
> >Furthermore, i got this:
> >
> >Jul 25 10:35:49 blackbox kernel: ide: failed opcode was: unknown
> >Jul 25 10:35:49 blackbox kernel: hdg: DMA disabled
> >Jul 25 10:35:49 blackbox kernel: PDC202XX: Secondary channel reset.
> >Jul 25 10:35:49 blackbox kernel: PDC202XX: Primary channel reset.
> >Jul 25 10:35:49 blackbox kernel: hde: lost interrupt
> >Jul 25 10:35:49 blackbox kernel: ide3: reset: master: error (0x00?)
> >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 488396928
> >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159368976
> >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159368984
> >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159368992
> >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369000
> >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369008
> >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369016
> >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369024
> >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369032
> >Jul 25 10:35:49 blackbox kernel: md: write_disk_sb failed for device hdg
> >Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0
> >Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0
> >Jul 25 10:35:49 blackbox kernel: RAID5 conf printout:
> >Jul 25 10:35:49 blackbox kernel:  --- rd:4 wd:2 fd:2
> >Jul 25 10:35:49 blackbox kernel:  disk 0, o:1, dev:hdk
> >Jul 25 10:35:49 blackbox kernel:  disk 1, o:1, dev:hdi
> >Jul 25 10:35:49 blackbox kernel:  disk 2, o:0, dev:hdg
> >Jul 25 10:35:49 blackbox kernel: RAID5 conf printout:
> >Jul 25 10:35:49 blackbox kernel:  --- rd:4 wd:2 fd:2
> >Jul 25 10:35:49 blackbox kernel:  disk 0, o:1, dev:hdk
> >Jul 25 10:35:49 blackbox kernel:  disk 1, o:1, dev:hdi
> >Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0
> >
> >
> >Well, now it seems I have to failed disks in my RAID-5, which of course
> >would be fatal. I am still hoping to somehow rescue the data on the
> >array somehow, but I am not sure what would be the best approach. I don't
> >want to cause any more damage.
> >
> >When booting my system with all four disks connected, hde and hdg as
> >expected won't get added:
> >
> >Jul 26 18:07:59 blackbox kernel: md: hdg has invalid sb, not importing!
> >Jul 26 18:07:59 blackbox kernel: md: autorun ...
> >Jul 26 18:07:59 blackbox kernel: md: considering hdi ...
> >Jul 26 18:07:59 blackbox kernel: md:  adding hdi ...
> >Jul 26 18:07:59 blackbox kernel: md:  adding hdk ...
> >Jul 26 18:07:59 blackbox kernel: md:  adding hde ...
> >Jul 26 18:07:59 blackbox kernel: md: created md0
> >Jul 26 18:07:59 blackbox kernel: md: bind<hde>
> >Jul 26 18:07:59 blackbox kernel: md: bind<hdk>
> >Jul 26 18:07:59 blackbox kernel: md: bind<hdi>
> >Jul 26 18:07:59 blackbox kernel: md: running: <hdi><hdk><hde>
> >Jul 26 18:07:59 blackbox kernel: md: kicking non-fresh hde from array!
> >Jul 26 18:07:59 blackbox kernel: md: unbind<hde>
> >Jul 26 18:07:59 blackbox kernel: md: export_rdev(hde)
> >Jul 26 18:07:59 blackbox kernel: raid5: device hdi operational as raid disk 1
> >Jul 26 18:07:59 blackbox kernel: raid5: device hdk operational as raid disk 0
> >Jul 26 18:07:59 blackbox kernel: RAID5 conf printout:
> >Jul 26 18:07:59 blackbox kernel:  --- rd:4 wd:2 fd:2
> >Jul 26 18:07:59 blackbox kernel:  disk 0, o:1, dev:hdk
> >Jul 26 18:07:59 blackbox kernel:  disk 1, o:1, dev:hdi
> >Jul 26 18:07:59 blackbox kernel: md: do_md_run() returned -22
> >Jul 26 18:07:59 blackbox kernel: md: md0 stopped.
> >Jul 26 18:07:59 blackbox kernel: md: unbind<hdi>
> >Jul 26 18:07:59 blackbox kernel: md: export_rdev(hdi)
> >Jul 26 18:07:59 blackbox kernel: md: unbind<hdk>
> >Jul 26 18:07:59 blackbox kernel: md: export_rdev(hdk)
> >Jul 26 18:07:59 blackbox kernel: md: ... autorun DONE.
> >
> >So hde is not fresh (it has been removed from the array for quite some
> >time now) and hdg has an invalid superblock.
> >
> >Any advice on what I should do now? Should I better try to rebuild the
> >array with hde or with hdg?
> >
> >
> >Greetings,
> >Frank
> >-
> >To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >  
> >
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed RAID-5 with 4 disks
  2005-07-26 21:58 ` Tyler
  2005-07-26 22:36   ` Dan Stromberg
@ 2005-07-26 23:12   ` Tyler
  2005-09-16 11:36   ` Frank Blendinger
  2 siblings, 0 replies; 9+ messages in thread
From: Tyler @ 2005-07-26 23:12 UTC (permalink / raw)
  Cc: Frank Blendinger, linux-raid

I had a typo in my original email, near the end, where I say "was 
fubar... then I would try the above steps, but force the assemble with 
the original failed disk", I actually meant to say "but force the 
assemble with the newly DD'd copy of the original/first drive that failed."

Tyler.

Tyler wrote:

> My suggestion would be to buy two new drives, and DD (or dd rescue) 
> the two bad drives onto the new drives, and then plug the new drive 
> that has the most recent failure on it (HDG?) in, and try running a 
> forced assemble including the HDG drive, then, in readonly mode, run 
> an fsck to check the file system, and see if it thinks most things are 
> okay.  *IF* it checks out okay (for the most part.. you will probably 
> lose some data), then plug the second new disk in, and add it to the 
> array as a spare, and it would then start a resync of the array.  
> Otherwise, if the fsck found that the entire filesystem was fubar... 
> then I would try the above steps, but force the assemble with the 
> original failed disk.. but depending on how long in between the two 
> failures its been, and if any data was written to the array after the 
> first failure, this is probably not going to be a good thing.. but 
> could still be useful if you were trying to recover specific files 
> that were not touched in between the two failures.
>
> I would also suggest googling raid manual recovery procedures, some 
> info is outdated, but some of it describes what I just described above.
>
> Tyler.
>
> Frank Blendinger wrote:
>
>> Hi,
>>
>> I have a RAID-5 set up with the following raidtab:
>>
>> raiddev /dev/md0
>>        raid-level              5
>>        nr-raid-disks           4
>>        nr-spare-disks          0
>>        persistent-superblock   1
>>        parity-algorithm        left-symmetric
>>        chunk-size              256
>>        device                  /dev/hde
>>        raid-disk               0
>>        device                  /dev/hdg
>>        raid-disk               1
>>        device                  /dev/hdi
>>        raid-disk               2
>>        device                  /dev/hdk
>>        raid-disk               3
>>
>> My hde has failed some time ago, leaving some     hde: dma_intr: 
>> status=0x51 { DriveReady SeekComplete Error }
>>     hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
>> messages in the syslog.
>>
>> I wanted to get sure it really was damaged, so I did a badblocks
>> (read-only) scan on /dev/hde. It actually found a bad sector on the
>> disk.
>>
>>
>> I wanted to take the disk out to get me a new one, but unfortunately my
>> hdg seems to have run into trouble too, now. I also have some
>> SeekComplete/BadCRC errors in my log for that disk, too.
>>
>> Furthermore, i got this:
>>
>> Jul 25 10:35:49 blackbox kernel: ide: failed opcode was: unknown
>> Jul 25 10:35:49 blackbox kernel: hdg: DMA disabled
>> Jul 25 10:35:49 blackbox kernel: PDC202XX: Secondary channel reset.
>> Jul 25 10:35:49 blackbox kernel: PDC202XX: Primary channel reset.
>> Jul 25 10:35:49 blackbox kernel: hde: lost interrupt
>> Jul 25 10:35:49 blackbox kernel: ide3: reset: master: error (0x00?)
>> Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, 
>> sector 488396928
>> Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, 
>> sector 159368976
>> Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, 
>> sector 159368984
>> Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, 
>> sector 159368992
>> Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, 
>> sector 159369000
>> Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, 
>> sector 159369008
>> Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, 
>> sector 159369016
>> Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, 
>> sector 159369024
>> Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, 
>> sector 159369032
>> Jul 25 10:35:49 blackbox kernel: md: write_disk_sb failed for device hdg
>> Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0
>> Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0
>> Jul 25 10:35:49 blackbox kernel: RAID5 conf printout:
>> Jul 25 10:35:49 blackbox kernel:  --- rd:4 wd:2 fd:2
>> Jul 25 10:35:49 blackbox kernel:  disk 0, o:1, dev:hdk
>> Jul 25 10:35:49 blackbox kernel:  disk 1, o:1, dev:hdi
>> Jul 25 10:35:49 blackbox kernel:  disk 2, o:0, dev:hdg
>> Jul 25 10:35:49 blackbox kernel: RAID5 conf printout:
>> Jul 25 10:35:49 blackbox kernel:  --- rd:4 wd:2 fd:2
>> Jul 25 10:35:49 blackbox kernel:  disk 0, o:1, dev:hdk
>> Jul 25 10:35:49 blackbox kernel:  disk 1, o:1, dev:hdi
>> Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0
>>
>>
>> Well, now it seems I have to failed disks in my RAID-5, which of course
>> would be fatal. I am still hoping to somehow rescue the data on the
>> array somehow, but I am not sure what would be the best approach. I 
>> don't
>> want to cause any more damage.
>>
>> When booting my system with all four disks connected, hde and hdg as
>> expected won't get added:
>>
>> Jul 26 18:07:59 blackbox kernel: md: hdg has invalid sb, not importing!
>> Jul 26 18:07:59 blackbox kernel: md: autorun ...
>> Jul 26 18:07:59 blackbox kernel: md: considering hdi ...
>> Jul 26 18:07:59 blackbox kernel: md:  adding hdi ...
>> Jul 26 18:07:59 blackbox kernel: md:  adding hdk ...
>> Jul 26 18:07:59 blackbox kernel: md:  adding hde ...
>> Jul 26 18:07:59 blackbox kernel: md: created md0
>> Jul 26 18:07:59 blackbox kernel: md: bind<hde>
>> Jul 26 18:07:59 blackbox kernel: md: bind<hdk>
>> Jul 26 18:07:59 blackbox kernel: md: bind<hdi>
>> Jul 26 18:07:59 blackbox kernel: md: running: <hdi><hdk><hde>
>> Jul 26 18:07:59 blackbox kernel: md: kicking non-fresh hde from array!
>> Jul 26 18:07:59 blackbox kernel: md: unbind<hde>
>> Jul 26 18:07:59 blackbox kernel: md: export_rdev(hde)
>> Jul 26 18:07:59 blackbox kernel: raid5: device hdi operational as 
>> raid disk 1
>> Jul 26 18:07:59 blackbox kernel: raid5: device hdk operational as 
>> raid disk 0
>> Jul 26 18:07:59 blackbox kernel: RAID5 conf printout:
>> Jul 26 18:07:59 blackbox kernel:  --- rd:4 wd:2 fd:2
>> Jul 26 18:07:59 blackbox kernel:  disk 0, o:1, dev:hdk
>> Jul 26 18:07:59 blackbox kernel:  disk 1, o:1, dev:hdi
>> Jul 26 18:07:59 blackbox kernel: md: do_md_run() returned -22
>> Jul 26 18:07:59 blackbox kernel: md: md0 stopped.
>> Jul 26 18:07:59 blackbox kernel: md: unbind<hdi>
>> Jul 26 18:07:59 blackbox kernel: md: export_rdev(hdi)
>> Jul 26 18:07:59 blackbox kernel: md: unbind<hdk>
>> Jul 26 18:07:59 blackbox kernel: md: export_rdev(hdk)
>> Jul 26 18:07:59 blackbox kernel: md: ... autorun DONE.
>>
>> So hde is not fresh (it has been removed from the array for quite some
>> time now) and hdg has an invalid superblock.
>>
>> Any advice on what I should do now? Should I better try to rebuild the
>> array with hde or with hdg?
>>
>>
>> Greetings,
>> Frank
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>  
>>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed RAID-5 with 4 disks
  2005-07-26 21:58 ` Tyler
  2005-07-26 22:36   ` Dan Stromberg
  2005-07-26 23:12   ` Tyler
@ 2005-09-16 11:36   ` Frank Blendinger
       [not found]     ` <432AFA95.3040709@h3c.com>
  2 siblings, 1 reply; 9+ messages in thread
From: Frank Blendinger @ 2005-09-16 11:36 UTC (permalink / raw)
  To: linux-raid

On Tue, Jul 26, 2005 at 02:58:43PM -0700, Tyler wrote:
> My suggestion would be to buy two new drives, and DD (or dd rescue) the 
> two bad drives onto the new drives, and then plug the new drive that has 
> the most recent failure on it (HDG?) in, and try running a forced 
> assemble including the HDG drive, then, in readonly mode, run an fsck to 
> check the file system, and see if it thinks most things are okay.  *IF* 
> it checks out okay (for the most part.. you will probably lose some 
> data), then plug the second new disk in, and add it to the array as a 
> spare, and it would then start a resync of the array.  Otherwise, if the 
> fsck found that the entire filesystem was fubar... then I would try the 
> above steps, but force the assemble with the original failed disk.. but 
> depending on how long in between the two failures its been, and if any 
> data was written to the array after the first failure, this is probably 
> not going to be a good thing.. but could still be useful if you were 
> trying to recover specific files that were not touched in between the 
> two failures.

Thanks for your suggestions.

This is what I did so far: I got one of the two bad drives (the one that
failed first) replaced with a new one. I copied the other bad drive to
the new one with dd. I guess that not everything could be copied
alright, I got 10 "Buffer I/O error on device hdg, logical sector
..." and about 35 "end_request: I/O error, hdg, sector ..." error
messages in my syslog.

Now I'm stuck re-activating the array with the dd'ed hde and the working
hdi and hdk. I tried "mdadm --assemble --scan /dev/md0", which told me
"mdadm: /dev/md0 assembled from 2 drives - not enough to start the array."

I then tried hot-adding hde with "mdadm --add /dev/hde [--force] /dev/md0"
but that only got me "mdadm: /dev/hde does not appear to be an md
device".

Any suggestions?

Greets,
Frank

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <432AFA95.3040709@h3c.com>]

* Re: Failed RAID-5 with 4 disks
       [not found]     ` <432AFA95.3040709@h3c.com>
@ 2005-09-16 19:09       ` Frank Blendinger
  2005-09-16 19:52         ` Mike Hardy
  2005-09-17  9:31         ` Burkhard Carstens
  0 siblings, 2 replies; 9+ messages in thread
From: Frank Blendinger @ 2005-09-16 19:09 UTC (permalink / raw)
  To: linux-raid

On Fri, Sep 16, 2005 at 10:02:13AM -0700, Mike Hardy wrote:
> Frank Blendinger wrote:
> > This is what I did so far: I got one of the two bad drives (the one that
> > failed first) replaced with a new one. I copied the other bad drive to
> > the new one with dd. I guess that not everything could be copied
> > alright, I got 10 "Buffer I/O error on device hdg, logical sector
> > ..." and about 35 "end_request: I/O error, hdg, sector ..." error
> > messages in my syslog.
> > 
> > Now I'm stuck re-activating the array with the dd'ed hde and the working
> > hdi and hdk. I tried "mdadm --assemble --scan /dev/md0", which told me
> > "mdadm: /dev/md0 assembled from 2 drives - not enough to start the array."
> 
> Does an mdadm -E on the dd'd /dev/hde show that it has the superblock
> and knows about the array? That would confirm that it has the data and
> is ready to go.

mdadm -E /dev/hde tells me: 
mdadm: No super block found on /dev/hde (Expected magic a92b4efc, got
00000000)


> Given that, you want to do a version of the create/assemble that
> forcibly uses all three drives, even though one of them is out of date
> from the raid set's perspective.
> 
> I believe its possible to issue a create line that has a 'missing' entry
> for the drive missing, but order is important. Luckily since one drive
> is missing, md won't resync or anything so you should get multiple tries.
> 
> Something like 'mdadm --create --force -level 5 -n 4 /dev/md0 /dev/hda
> /dev/hdb /dev/hde missing' is what you're looking for.
> 
> Obviously I don't know what your disk names are, so put the correct ones
> in there, not the ones I used. If you don't get a valid raid set from
> that, you could try moving the order around.

OK, sounds good. I tried this:

$ mdadm --create --force --level 5 -n 4 /dev/md0 /dev/hdi /dev/hdk /dev/hde missing
mdadm: /dev/hdi appears to be part of a raid array:
level=5 devices=4 ctime=Mon Apr 18 21:05:23 2005
mdadm: /dev/hdk appears to contain an ext2fs file system
size=732595200K  mtime=Sun Jul 24 03:08:46 2005
mdadm: /dev/hdk appears to be part of a raid array:
level=5 devices=4 ctime=Mon Apr 18 21:05:23 2005
Continue creating array? 

I'm not quite sure about the output, hdk gets listed twice (once false
as an ext2) and hde (this is the dd'ed disk) not at all.
Should i continue here?


> Each time you do that, you'll be creating a brand new raid set with new
> superblocks, but the layout will hopefully match, and it won't update
> the data because a drive is missing. After the raid is created the right
> way, you should find your data.
> 
> Then you can hot-add a new drive to the array, to get your redundancy
> back. I'd definitely use smartctl -t long on /dev/hdg to find the blocks
> that are bad, and use the BadBlockHowTo (google for that) so you can
> clear the bad blocks.

Of course I don't want the second broken hard drive as spare. I just
used it to dd its content to the new disk. I am going to get a new drive
for the second failed one once I got the array back up and running
(without redundancy).

Should a check for bad blocks on hdg and then repeat the dd to the new
disk?

 
> Alternatively you could forcibly assemble the array as it was with
> Neil's new faulty-read-correction patch, and the blocks will probably
> get auto-cleared.

I am still using mdadm 1.9.0 (the package that came with Debian sarge).
Would you suggest me to manually upgrade to a 2.0 version?

 
> > I then tried hot-adding hde with "mdadm --add /dev/hde [--force] /dev/md0"
> > but that only got me "mdadm: /dev/hde does not appear to be an md
> > device".
> 
> You got the array and the drive in the wrong positions here, thus the
> error message, and you can't hot-add to an array that isn't started.
> hot-add is to add redundancy to an array that is already running - for
> instance after a drive has failed you hot-remove it, then after you've
> cleared bad blocks, you hot-add it.

I see, I completely misunderstood the manpage there.


Greets,
Frank

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed RAID-5 with 4 disks
  2005-09-16 19:09       ` Frank Blendinger
@ 2005-09-16 19:52         ` Mike Hardy
  2005-09-17  9:31         ` Burkhard Carstens
  1 sibling, 0 replies; 9+ messages in thread
From: Mike Hardy @ 2005-09-16 19:52 UTC (permalink / raw)
  To: linux-raid

Frank Blendinger wrote:
> On Fri, Sep 16, 2005 at 10:02:13AM -0700, Mike Hardy wrote:

>>Does an mdadm -E on the dd'd /dev/hde show that it has the superblock
>>and knows about the array? That would confirm that it has the data and
>>is ready to go.
> 
> 
> mdadm -E /dev/hde tells me: 
> mdadm: No super block found on /dev/hde (Expected magic a92b4efc, got
> 00000000)

This is bad - it appears that your dd either did not work or was not
complete or something, but /dev/hde does not contain a copy of an array
component.

You can not continue until you've got one of the two failed disks' data.

> OK, sounds good. I tried this:
> 
> $ mdadm --create --force --level 5 -n 4 /dev/md0 /dev/hdi /dev/hdk /dev/hde missing
> mdadm: /dev/hdi appears to be part of a raid array:
> level=5 devices=4 ctime=Mon Apr 18 21:05:23 2005
> mdadm: /dev/hdk appears to contain an ext2fs file system
> size=732595200K  mtime=Sun Jul 24 03:08:46 2005
> mdadm: /dev/hdk appears to be part of a raid array:
> level=5 devices=4 ctime=Mon Apr 18 21:05:23 2005
> Continue creating array? 
> 
> I'm not quite sure about the output, hdk gets listed twice (once false
> as an ext2) and hde (this is the dd'ed disk) not at all.
> Should i continue here?

Not sure why hdk gets listed twice, but its probably not a huge deal.
The missing hde is the big problem though. You don't have enough
components together yet

> Of course I don't want the second broken hard drive as spare. I just
> used it to dd its content to the new disk. I am going to get a new drive
> for the second failed one once I got the array back up and running
> (without redundancy).

The 'missing' slot and '-n 4' tells mdadm that you are creating a 4 disk
array, but one of the slots is empty at this point. The array will
created and initially run in degraded mode. When you add a disk to the
array later, it won't be a spare, it will give you the normal redundancy.

> Should a check for bad blocks on hdg and then repeat the dd to the new
> disk?

I'm not going to comment on specific disks, I don't really know your
complete situation. The process is the point though. If you have a disk
that failed, for any reason, you should run a long SMART test on it
('smartctl -t long <disk>'). If it has bad blocks, you should fix them.
If it has data on it that is not redundant you should try to copy it
elsewhere first. Once the disks pass the long SMART test, they're
capable of being used without problems in a raid array.

>>Alternatively you could forcibly assemble the array as it was with
>>Neil's new faulty-read-correction patch, and the blocks will probably
>>get auto-cleared.
> 
> 
> I am still using mdadm 1.9.0 (the package that came with Debian sarge).
> Would you suggest me to manually upgrade to a 2.0 version?

I'd get your data back first. Its clear you haven't used the tools much,
so I wouldn't throw an attempt at upgrading them into the mix. Getting
the data back is hard enough, even if the tools are old hat.

> I see, I completely misunderstood the manpage there.

Given this, and your questions about how to add the drives, and the
redundancy etc., the main thing I'd recommend is to practice this stuff
in a safe environment. If your data is important, and you plan on
running a raid for a while, it will really pay off in the long run, and
its so much more relaxing to run this stuff when you're confident in the
tools you have to use when problems crop up.

How to do that? I'd use loop devices. You create a number of files of
the same size, you export them as loopback-mounted devices, and build a
raid out of them. Open a second terminal so you can `watch cat
/proc/mdstat`. Open a third so you can `tail -f /var/log/messages`, then
start playing around with creating new arrays out of the loop files,
hot-removing, hot-adding etc. All in a safe way.

I wrote a script that facilitates this a while back and posted it to the
list: http://www.spinics.net/lists/raid/msg07564.html

You should be able to simulate what you need to do with your real disks
by setting up 4 loop devices and failing two, then attempting to recover.

-Mike

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed RAID-5 with 4 disks
  2005-09-16 19:09       ` Frank Blendinger
  2005-09-16 19:52         ` Mike Hardy
@ 2005-09-17  9:31         ` Burkhard Carstens
  2005-09-17 16:46           ` Frank Blendinger
  1 sibling, 1 reply; 9+ messages in thread
From: Burkhard Carstens @ 2005-09-17  9:31 UTC (permalink / raw)
  To: linux-raid

Am Freitag, 16. September 2005 21:09 schrieb Frank Blendinger:
> On Fri, Sep 16, 2005 at 10:02:13AM -0700, Mike Hardy wrote:
> > Frank Blendinger wrote:
> > > This is what I did so far: I got one of the two bad drives (the
> > > one that failed first) replaced with a new one. I copied the
> > > other bad drive to the new one with dd. I guess that not
> > > everything could be copied alright, I got 10 "Buffer I/O error on
> > > device hdg, logical sector ..." and about 35 "end_request: I/O
> > > error, hdg, sector ..." error messages in my syslog.
> > >
I am unable to find the beginning of this thread, so please excuse me if 
this has already been said, but:

You did use ddrescue, didn't you? Because dd, when it fails to read a 
block, it won't write that block. That's why your superblock might have 
"moved" about 45 sectors towards the beginning of the drive.
ddrescue writes a block with zeros when reading that block fails, so 
every readable sector get copdied to its corresponding sector on the 
new drive..

Burkhard


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed RAID-5 with 4 disks
  2005-09-17  9:31         ` Burkhard Carstens
@ 2005-09-17 16:46           ` Frank Blendinger
  0 siblings, 0 replies; 9+ messages in thread
From: Frank Blendinger @ 2005-09-17 16:46 UTC (permalink / raw)
  To: linux-raid

On Sat, Sep 17, 2005 at 11:31:12AM +0200, Burkhard Carstens wrote:
> Am Freitag, 16. September 2005 21:09 schrieb Frank Blendinger:
> > On Fri, Sep 16, 2005 at 10:02:13AM -0700, Mike Hardy wrote:
> > > Frank Blendinger wrote:
> > > > This is what I did so far: I got one of the two bad drives (the
> > > > one that failed first) replaced with a new one. I copied the
> > > > other bad drive to the new one with dd. I guess that not
> > > > everything could be copied alright, I got 10 "Buffer I/O error on
> > > > device hdg, logical sector ..." and about 35 "end_request: I/O
> > > > error, hdg, sector ..." error messages in my syslog.
> > > >
> I am unable to find the beginning of this thread, so please excuse me if 
> this has already been said, but:
> 
> You did use ddrescue, didn't you? Because dd, when it fails to read a 
> block, it won't write that block. That's why your superblock might have 
> "moved" about 45 sectors towards the beginning of the drive.
> ddrescue writes a block with zeros when reading that block fails, so 
> every readable sector get copdied to its corresponding sector on the 
> new drive..

OK, thanks for the note, I am going to do a bad block scan with smart
and then try again with ddrescue. I hope this will restore the
superblock.


Greets,
Frank

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2005-09-17 16:46 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-26 17:03 Failed RAID-5 with 4 disks Frank Blendinger
2005-07-26 21:58 ` Tyler
2005-07-26 22:36   ` Dan Stromberg
2005-07-26 23:12   ` Tyler
2005-09-16 11:36   ` Frank Blendinger
     [not found]     ` <432AFA95.3040709@h3c.com>
2005-09-16 19:09       ` Frank Blendinger
2005-09-16 19:52         ` Mike Hardy
2005-09-17  9:31         ` Burkhard Carstens
2005-09-17 16:46           ` Frank Blendinger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).