Problem with 5disk RAID5 array

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Problem with 5disk RAID5 array - two drives lost
@ 2006-04-19 19:39 Tim Bostrom
  2006-04-22  3:54 ` Molle Bestefich
  0 siblings, 1 reply; 10+ messages in thread
From: Tim Bostrom @ 2006-04-19 19:39 UTC (permalink / raw)
  To: linux-raid

Good day,

I'm running FC4 kernel 2.6.11-1.1369 with a 5 disk RAID5 array.    
This past weekend after a reboot to my machine, /dev/md0 will no  
longer mount and Fedora will abort booting the system and force me to  
fix the filesystem.  Upon further investigation, it looks like I lost  
two drives within a few weeks of each other.  I'll go ahead and get  
this out of the way - I'm an idiot and didn't setup mdadm -F for  
mailing with RAID problems.

It appears that /dev/hdf1 failed this past week and /dev/hdh1 failed  
back in February.  I tried a mdadm --assemble --force and was able to  
get the following:

==========================
mdadm: forcing event count in /dev/hdf1(1) from 777532 upto 777535
mdadm: clearing FAULTY flag for device 2 in /dev/md0 for /dev/hdf1
raid5: raid level 5 set md0 active with 4 out of 5 devices, algorithm 2
mdadm: /dev/md0 has been started with 4 drives (out of 5).
==========================


I then tried to mount /dev/md0 and received the following:
====================
raid5: Disk failure on hdf1, disabling device.  Operation continuing  
on  drives
mount: wrong fs type, bad option, bad superblock on /dev/md0,
missing codepage or other error
In some cases useful info is found in syslog - try dmesg | tail
=====================

In checking dmesg, I find:
==================================
raid5: device hde1 operational as raid disk 0
raid5: device hdc1 operational as raid disk 4
raid5: device hdg1 operational as raid disk 2
raid5: device hdf1 operational as raid disk 1
raid5: allocated 5254kB for md0
raid5: raid level 5 set md0 active with 4 out of 5 devices, algorithm 2
RAID5 conf printout:
--- rd:5 wd:4 fd:1
disk 0, o:1, dev:hde1
disk 1, o:1, dev:hdf1
disk 2, o:1, dev:hdg1
disk 4, o:1, dev:hdc1
usb 1-2: USB disconnect, address 2
usb 1-2: new full speed USB device using uhci_hcd and address 3
usb 1-2: not running at top speed; connect to a high speed hub
scsi1 : SCSI emulation for USB Mass Storage devices
usb-storage: device found at 3
usb-storage: waiting for device to settle before scanning
   Vendor: SanDisk   Model: Cruzer Mini       Rev: 0.1
   Type:   Direct-Access                      ANSI SCSI revision: 02
SCSI device sda: 1000944 512-byte hdwr sectors (512 MB)
sda: Write Protect is off
sda: Mode Sense: 03 00 00 00
sda: assuming drive cache: write through
SCSI device sda: 1000944 512-byte hdwr sectors (512 MB)
sda: Write Protect is off
sda: Mode Sense: 03 00 00 00
sda: assuming drive cache: write through
sda: sda1
Attached scsi removable disk sda at scsi1, channel 0, id 0, lun 0
usb-storage: device scan complete
spurious 8259A interrupt: IRQ7.
hdf: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6720,  
high=0, low=6720, sector=6719
ide: failed opcode was: unknown
end_request: I/O error, dev hdf, sector 6719
raid5: Disk failure on hdf1, disabling device. Operation continuing  
on 3 devices
hdf: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6731,  
high=0, low=6731, sector=6727
ide: failed opcode was: unknown
end_request: I/O error, dev hdf, sector 6727
hdf: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6735,  
high=0, low=6735, sector=6735
ide: failed opcode was: unknown
end_request: I/O error, dev hdf, sector 6735
hdf: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6743,  
high=0, low=6743, sector=6743
ide: failed opcode was: unknown
end_request: I/O error, dev hdf, sector 6743
hdf: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6753,  
high=0, low=6753, sector=6751
ide: failed opcode was: unknown
end_request: I/O error, dev hdf, sector 6751
hdf: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6763,  
high=0, low=6763, sector=6759
ide: failed opcode was: unknown
end_request: I/O error, dev hdf, sector 6759
hdf: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6770,  
high=0, low=6770, sector=6767
ide: failed opcode was: unknown
end_request: I/O error, dev hdf, sector 6767
hdf: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6776,  
high=0, low=6776, sector=6775
ide: failed opcode was: unknown
end_request: I/O error, dev hdf, sector 6775
hdf: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6803,  
high=0, low=6803, sector=6783
ide: failed opcode was: unknown
end_request: I/O error, dev hdf, sector 6783
hdf: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6803,  
high=0, low=6803, sector=6791
ide: failed opcode was: unknown
end_request: I/O error, dev hdf, sector 6791
JBD: Failed to read block at offset 1794
JBD: recovery failed
EXT3-fs: error loading journal.
hdf: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6803,  
high=0, low=6803, sector=6799
ide: failed opcode was: unknown
end_request: I/O error, dev hdf, sector 6799
Buffer I/O error on device md0, logical block 1604
lost page write due to I/O error on md0
hdf: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6807,  
high=0, low=6807, sector=6807
ide: failed opcode was: unknown
end_request: I/O error, dev hdf, sector 6807
hdf: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6815,  
high=0, low=6815, sector=6815
ide: failed opcode was: unknown
end_request: I/O error, dev hdf, sector 6815
hdf: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6823,  
high=0, low=6823, sector=6823
ide: failed opcode was: unknown
end_request: I/O error, dev hdf, sector 6823
hdf: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6831,  
high=0, low=6831, sector=6831
ide: failed opcode was: unknown
end_request: I/O error, dev hdf, sector 6831
hdf: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6841,  
high=0, low=6841, sector=6839
ide: failed opcode was: unknown
end_request: I/O error, dev hdf, sector 6839
hdf: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6851,  
high=0, low=6851, sector=6847
ide: failed opcode was: unknown
end_request: I/O error, dev hdf, sector 6847
RAID5 conf printout:
--- rd:5 wd:3 fd:2
disk 0, o:1, dev:hde1
disk 1, o:0, dev:hdf1
disk 2, o:1, dev:hdg1
disk 4, o:1, dev:hdc1
RAID5 conf printout:
--- rd:5 wd:3 fd:2
disk 0, o:1, dev:hde1
disk 2, o:1, dev:hdg1
disk 4, o:1, dev:hdc1
================================

I'm guessing /dev/hdf is shot.  I haven't tried an fsck though.   
Would this be advisable?  I don't want to bork all the data.  It's  
about 700 GB of data.  I'm open to losing any data that was added  
since the February drive failure.  Is there a way that I can try and  
build the array again with /dev/hdh instead of /dev/hdf with some  
possible data corruption on files that were added since Feb?

Any advice would great.  I'm at a loss and I don't want to lose all  
of the data if I don't have to.  I might end up visiting one of those  
data recovery shops if I can't fix this on my own.

Thank you,

Tim



mdadm -E outputs below:
=================================

/dev/hdc1:
           Magic : a92b4efc
         Version : 00.90.01
            UUID : 2d1d58c2:23357cca:12b8e65a:a80cdebe
   Creation Time : Tue Jul 26 17:20:10 2005
      Raid Level : raid5
    Raid Devices : 5
   Total Devices : 4
Preferred Minor : 0

     Update Time : Sun Apr 16 09:10:28 2006
           State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 3
   Spare Devices : 0
        Checksum : 4a150769 - correct
          Events : 0.777535

          Layout : left-symmetric
      Chunk Size : 128K

       Number   Major   Minor   RaidDevice State
this     4      22        1        4      active sync   /dev/hdc1

    0     0      33        1        0      active sync   /dev/hde1
    1     1       0        0        1      faulty removed
    2     2      34        1        2      active sync   /dev/hdg1
    3     3       0        0        3      faulty removed
    4     4      22        1        4      active sync   /dev/hdc1



/dev/hde1:
           Magic : a92b4efc
         Version : 00.90.01
            UUID : 2d1d58c2:23357cca:12b8e65a:a80cdebe
   Creation Time : Tue Jul 26 17:20:10 2005
      Raid Level : raid5
    Raid Devices : 5
   Total Devices : 4
Preferred Minor : 0

     Update Time : Sun Apr 16 09:10:28 2006
           State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 3
   Spare Devices : 0
        Checksum : 4a15076c - correct
          Events : 0.777535

          Layout : left-symmetric
      Chunk Size : 128K

       Number   Major   Minor   RaidDevice State
this     0      33        1        0      active sync   /dev/hde1

    0     0      33        1        0      active sync   /dev/hde1
    1     1       0        0        1      faulty removed
    2     2      34        1        2      active sync   /dev/hdg1
    3     3       0        0        3      faulty removed
    4     4      22        1        4      active sync   /dev/hdc1




/dev/hdf1:
           Magic : a92b4efc
         Version : 00.90.01
            UUID : 2d1d58c2:23357cca:12b8e65a:a80cdebe
   Creation Time : Tue Jul 26 17:20:10 2005
      Raid Level : raid5
    Raid Devices : 5
   Total Devices : 5
Preferred Minor : 0

     Update Time : Fri Apr 14 13:46:06 2006
           State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 2
   Spare Devices : 0
        Checksum : 4a06c868 - correct
          Events : 0.777532

          Layout : left-symmetric
      Chunk Size : 128K

       Number   Major   Minor   RaidDevice State
this     1      33       65        1      active sync   /dev/hdf1

    0     0      33        1        0      active sync   /dev/hde1
    1     1      33       65        1      active sync   /dev/hdf1
    2     2      34        1        2      active sync   /dev/hdg1
    3     3       0        0        3      faulty removed
    4     4      22        1        4      active sync   /dev/hdc1

/dev/hdh1:
           Magic : a92b4efc
         Version : 00.90.01
            UUID : 2d1d58c2:23357cca:12b8e65a:a80cdebe
   Creation Time : Tue Jul 26 17:20:10 2005
      Raid Level : raid5
    Raid Devices : 5
   Total Devices : 5
Preferred Minor : 0

     Update Time : Tue Feb 21 07:47:51 2006
           State : active
Active Devices : 5
Working Devices : 5
Failed Devices : 0
   Spare Devices : 0
        Checksum : 49c0be2c - correct
          Events : 0.698097

          Layout : left-symmetric
      Chunk Size : 128K

       Number   Major   Minor   RaidDevice State
this     3      34       65        3      active sync   /dev/hdh1

    0     0      33        1        0      active sync   /dev/hde1
    1     1      33       65        1      active sync   /dev/hdf1
    2     2      34        1        2      active sync   /dev/hdg1
    3     3      34       65        3      active sync   /dev/hdh1
    4     4      22        1        4      active sync   /dev/hdc1



/dev/hdh1:
           Magic : a92b4efc
         Version : 00.90.01
            UUID : 2d1d58c2:23357cca:12b8e65a:a80cdebe
   Creation Time : Tue Jul 26 17:20:10 2005
      Raid Level : raid5
    Raid Devices : 5
   Total Devices : 5
Preferred Minor : 0

     Update Time : Tue Feb 21 07:47:51 2006
           State : active
Active Devices : 5
Working Devices : 5
Failed Devices : 0
   Spare Devices : 0
        Checksum : 49c0be2c - correct
          Events : 0.698097

          Layout : left-symmetric
      Chunk Size : 128K

       Number   Major   Minor   RaidDevice State
this     3      34       65        3      active sync   /dev/hdh1

    0     0      33        1        0      active sync   /dev/hde1
    1     1      33       65        1      active sync   /dev/hdf1
    2     2      34        1        2      active sync   /dev/hdg1
    3     3      34       65        3      active sync   /dev/hdh1
    4     4      22        1        4      active sync   /dev/hdc1


/dev/hdg1:
           Magic : a92b4efc
         Version : 00.90.01
            UUID : 2d1d58c2:23357cca:12b8e65a:a80cdebe
   Creation Time : Tue Jul 26 17:20:10 2005
      Raid Level : raid5
    Raid Devices : 5
   Total Devices : 4
Preferred Minor : 0

     Update Time : Sun Apr 16 09:10:28 2006
           State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 3
   Spare Devices : 0
        Checksum : 4a150771 - correct
          Events : 0.777535

          Layout : left-symmetric
      Chunk Size : 128K

       Number   Major   Minor   RaidDevice State
this     2      34        1        2      active sync   /dev/hdg1

    0     0      33        1        0      active sync   /dev/hde1
    1     1       0        0        1      faulty removed
    2     2      34        1        2      active sync   /dev/hdg1
    3     3       0        0        3      faulty removed
    4     4      22        1        4      active sync   /dev/hdc1





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Problem with 5disk RAID5 array - two drives lost
  2006-04-19 19:39 Problem with 5disk RAID5 array - two drives lost Tim Bostrom
@ 2006-04-22  3:54 ` Molle Bestefich
  2006-04-22 19:42   ` Carlos Carvalho
  2006-04-24  0:17   ` Tim Bostrom
  0 siblings, 2 replies; 10+ messages in thread
From: Molle Bestefich @ 2006-04-22  3:54 UTC (permalink / raw)
  To: Tim Bostrom; +Cc: linux-raid

Tim Bostrom wrote:
> It appears that /dev/hdf1 failed this past week and /dev/hdh1 failed  back in February.

An obvious question would be, how much have you been altering the
contents of the array since February?

> I tried a mdadm --assemble --force and was able to get the following:
> ==========================
> mdadm: forcing event count in /dev/hdf1(1) from 777532 upto 777535
> mdadm: clearing FAULTY flag for device 2 in /dev/md0 for /dev/hdf1
> raid5: raid level 5 set md0 active with 4 out of 5 devices, algorithm 2
> mdadm: /dev/md0 has been started with 4 drives (out of 5).
> ==========================

Looks good.

> I then tried to mount /dev/md0

A bit premature, I'd say.

> ====================
> raid5: Disk failure on hdf1, disabling device.

MD doesn't like to find errors when it's rebuilding.
It will kick that disk off the array, which will cause MD to return
crap (instead of stopping the array and removing the device - I
wonder), again causing 'mount' etc. to fail.

Quite unfortunate for you, since you have absolutely no redundancy
with 4/5 drives, and you really can't afford to have the 4th disk
kicked just because there's a bad block on it.

This is something that MD could probably handle much better than it does now.
In your case, you probably want to try and reconstruct from all 5
disks, but without loosing the information in their event counters -
you want MD to use as much data as it can from the 4 fresh disks
(assuming that they're at least 99% readable), and only when there's a
rare bad block on one of them should it use data from the 5th.

Seeing as
1) MD doesn't automatically check your array unless you ask it to
2) Modern disks have a habit of developing lots of bad blocks

It would be very nice if MD could help out in these kind of situations.
Unfortunately implementation is tricky as I see it, and currently MD
can do no such thing.

> spurious 8259A interrupt: IRQ7.

Oops.
I'd look into that, I think it's a known bug.

(Then again, maybe it's just the IDE drivers - I've experienced really
bad IRQ handling both with old style IDE and with atalib.)

> hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6720

Hey, it's telling you where your data used to be.  Cute.

> raid5: Disk failure on hdf1, disabling device.
> Operation continuing  on 3 devices

Haha!  Real bright there, MD, continuing raid5 operation with 3/5 devices.
Still not a bug, eh? :-)
*poke, poke*

> I'm guessing /dev/hdf is shot.

Actually, there's a lot of sequential sector numbers in the output you posted.
I think it's unusual for a drive to develop that many bad blocks in a row.
I could be wrong, and it could be a head crash or something (have you
been moving the system around much?).

But if I had to guess, I'd say that there's a real likelihood that
it's a loose cable or a controller problem or a driver issue.

Could you try and run:
# dd if=/dev/hdf of=/dev/null bs=1M count=100 skip=1234567

You can play around with different random numbers instead of 1234567.
If it craps out *immediately*, then I'd say it's a cable problem or
so, and not a problem with what's on the platters.

> I haven't tried an fsck though.
> Would this be advisable?

No, get the array running first, then fix the filesystem.

You can initiate array checks and repairs like this:
# cd /sys/block/md0/md/
# echo check > sync_action
or
# echo repair > sync_action

Or something like that.

> Is there a way that I can try and  build the array again with /dev/hdh
> instead of /dev/hdf with some possible data corruption on files that
> were added since Feb?

Let's first see if we can't get hdf online.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Problem with 5disk RAID5 array - two drives lost
  2006-04-22  3:54 ` Molle Bestefich
@ 2006-04-22 19:42   ` Carlos Carvalho
  2006-04-22 19:52     ` Molle Bestefich
  2006-04-22 19:54     ` David Greaves
  2006-04-24  0:17   ` Tim Bostrom
  1 sibling, 2 replies; 10+ messages in thread
From: Carlos Carvalho @ 2006-04-22 19:42 UTC (permalink / raw)
  To: linux-raid

Molle Bestefich (molle.bestefich@gmail.com) wrote on 22 April 2006 05:54:
 >Tim Bostrom wrote:
 >> raid5: Disk failure on hdf1, disabling device.
 >
 >MD doesn't like to find errors when it's rebuilding.
 >It will kick that disk off the array, which will cause MD to return
 >crap (instead of stopping the array and removing the device - I
 >wonder), again causing 'mount' etc. to fail.
 >
 >Quite unfortunate for you, since you have absolutely no redundancy
 >with 4/5 drives, and you really can't afford to have the 4th disk
 >kicked just because there's a bad block on it.

Yes...

As Molle says, you have a chance that it's a driver/cable problem.
What you can also do is dd the disk to another one and try to rebuild
the array with the new disk so that you won't get errors during the
reconstruction. If you get errors during the copy you'll have to
decide what to do with the bad blocks. Some people prefer to use
ddrescue instead of dd; I've never tried it.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Problem with 5disk RAID5 array - two drives lost
  2006-04-22 19:42   ` Carlos Carvalho
@ 2006-04-22 19:52     ` Molle Bestefich
  2006-04-22 19:54     ` David Greaves
  1 sibling, 0 replies; 10+ messages in thread
From: Molle Bestefich @ 2006-04-22 19:52 UTC (permalink / raw)
  To: linux-raid

Carlos Carvalho wrote:
> What you can also do is dd the disk to another one and try to rebuild
> the array with the new disk so that you won't get errors during the
> reconstruction.

Right, neat hack.

> Some people prefer to use ddrescue instead of dd; I've never tried it.

I can definitely recommend dd_rescue/dd-rescue/ddrescue.
Much better than dd for this kind of thing.

Very delicious.  Yummy yummy.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Problem with 5disk RAID5 array - two drives lost
  2006-04-22 19:42   ` Carlos Carvalho
  2006-04-22 19:52     ` Molle Bestefich
@ 2006-04-22 19:54     ` David Greaves
  1 sibling, 0 replies; 10+ messages in thread
From: David Greaves @ 2006-04-22 19:54 UTC (permalink / raw)
  To: Carlos Carvalho; +Cc: linux-raid, tbostrom

Carlos Carvalho wrote:
> Molle Bestefich (molle.bestefich@gmail.com) wrote on 22 April 2006 05:54:
>  >Tim Bostrom wrote:
>  >> raid5: Disk failure on hdf1, disabling device.
>  >
>  >MD doesn't like to find errors when it's rebuilding.
>  >It will kick that disk off the array, which will cause MD to return
>  >crap (instead of stopping the array and removing the device - I
>  >wonder), again causing 'mount' etc. to fail.
>  >
>  >Quite unfortunate for you, since you have absolutely no redundancy
>  >with 4/5 drives, and you really can't afford to have the 4th disk
>  >kicked just because there's a bad block on it.
>
> Yes...
>
> As Molle says, you have a chance that it's a driver/cable problem.
> What you can also do is dd the disk to another one and try to rebuild
> the array with the new disk so that you won't get errors during the
> reconstruction. If you get errors during the copy you'll have to
> decide what to do with the bad blocks. Some people prefer to use
> ddrescue instead of dd; I've never tried it.
I've used ddrescue and would *highly* recommend it. Use the GNU version,
not the other one (dd_rescue?)

It handles errors very well indeed and has a good display to show what's
happening.

It seems faster than dd (possibly threaded so streams both drives rather
than read a drive, write a drive)

David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Problem with 5disk RAID5 array - two drives lost
  2006-04-22  3:54 ` Molle Bestefich
  2006-04-22 19:42   ` Carlos Carvalho
@ 2006-04-24  0:17   ` Tim Bostrom
  2006-04-24  2:00     ` Arthur Britto
  1 sibling, 1 reply; 10+ messages in thread
From: Tim Bostrom @ 2006-04-24  0:17 UTC (permalink / raw)
  To: Molle Bestefich; +Cc: linux-raid

First let me say - thank you for responding.  I'm still trying to  
figure out this problem.

On Apr 21, 2006, at 8:54 PM  Apr 21, 2006, Molle Bestefich wrote:

> Tim Bostrom wrote:
>> It appears that /dev/hdf1 failed this past week and /dev/hdh1  
>> failed  back in February.
>
> An obvious question would be, how much have you been altering the
> contents of the array since February?
>

This is a video backup drive for my MythTV system.  I just backup old  
TV shows and movies from my system.  There's maybe 3-4 GB of data  
that's been stored there since February and no other data's been  
moved or deleted.  Pretty much when something is backed up here, it  
stays.  I'm willing to lose the February - April data.

>
>> hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6720
>
>
>
> Actually, there's a lot of sequential sector numbers in the output  
> you posted.
> I think it's unusual for a drive to develop that many bad blocks in  
> a row.
> I could be wrong, and it could be a head crash or something (have you
> been moving the system around much?).
>
> But if I had to guess, I'd say that there's a real likelihood that
> it's a loose cable or a controller problem or a driver issue.
>
> Could you try and run:
> # dd if=/dev/hdf of=/dev/null bs=1M count=100 skip=1234567
> You can play around with different random numbers instead of 1234567.
> If it craps out *immediately*, then I'd say it's a cable problem or
> so, and not a problem with what's on the platters.
>

Tried running this.  It doesn't crap out immediately.  Goes along,  
but I see a bunch of the {Uncorrectable Error} {Drive Ready Seek  
Complete} errors in dmesg that I posted before.  I see these messages  
in dmesg when I run the above command.

I bought two extra 250GB drives - I'll try using dd_rescue as  
recommended and see if I can get a "good" copy of hdf online.

>
> No, get the array running first, then fix the filesystem.
>
> You can initiate array checks and repairs like this:
> # cd /sys/block/md0/md/
> # echo check > sync_action
> or
> # echo repair > sync_action
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Problem with 5disk RAID5 array - two drives lost
  2006-04-24  0:17   ` Tim Bostrom
@ 2006-04-24  2:00     ` Arthur Britto
  2006-04-24 14:01       ` David Greaves
  0 siblings, 1 reply; 10+ messages in thread
From: Arthur Britto @ 2006-04-24  2:00 UTC (permalink / raw)
  To: Tim Bostrom; +Cc: linux-raid

On Sun, 2006-04-23 at 17:17 -0700, Tim Bostrom wrote:
> I bought two extra 250GB drives - I'll try using dd_rescue as  
> recommended and see if I can get a "good" copy of hdf online.

You might want to use dd_rhelp:
http://www.kalysto.org/utilities/dd_rhelp/index.en.html

-Arthur



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Problem with 5disk RAID5 array - two drives lost
  2006-04-24  2:00     ` Arthur Britto
@ 2006-04-24 14:01       ` David Greaves
  2006-04-25 14:55         ` Tim Bostrom
  0 siblings, 1 reply; 10+ messages in thread
From: David Greaves @ 2006-04-24 14:01 UTC (permalink / raw)
  To: Arthur Britto; +Cc: Tim Bostrom, linux-raid

Arthur Britto wrote:
> On Sun, 2006-04-23 at 17:17 -0700, Tim Bostrom wrote:
>> I bought two extra 250GB drives - I'll try using dd_rescue as 
>> recommended and see if I can get a "good" copy of hdf online.
>
> You might want to use dd_rhelp:
> http://www.kalysto.org/utilities/dd_rhelp/index.en.html
Having used both dd_rescue/dd_rhelp and the gnu ddrescue in anger, I'd
suggest gnu ddrescue.

http://www.gnu.org/software/ddrescue/ddrescue.html

David

-- 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Problem with 5disk RAID5 array - two drives lost
  2006-04-24 14:01       ` David Greaves
@ 2006-04-25 14:55         ` Tim Bostrom
  2006-04-26  6:19           ` Tim Bostrom
  0 siblings, 1 reply; 10+ messages in thread
From: Tim Bostrom @ 2006-04-25 14:55 UTC (permalink / raw)
  To: David Greaves; +Cc: Arthur Britto, linux-raid

OK, so 952 errors (about 450k) and 25+ hours later, I have a copy of  
the hdf drive on a brand new 250GB drive thanks to dd_rescue.

I haven't tried swapping it to the array.  That's the next step.  I  
imagine, I'll be able to mdadm --assemble --force and have it take  
the 4 drives into the array.  Dare I mount it after this or is there  
some troubleshooting I should do before a mount?

-Tim



On Apr 24, 2006, at 7:01 AM  Apr 24, 2006, David Greaves wrote:

> Arthur Britto wrote:
>> On Sun, 2006-04-23 at 17:17 -0700, Tim Bostrom wrote:
>>> I bought two extra 250GB drives - I'll try using dd_rescue as
>>> recommended and see if I can get a "good" copy of hdf online.
>>
>> You might want to use dd_rhelp:
>> http://www.kalysto.org/utilities/dd_rhelp/index.en.html
> Having used both dd_rescue/dd_rhelp and the gnu ddrescue in anger, I'd
> suggest gnu ddrescue.
>
> http://www.gnu.org/software/ddrescue/ddrescue.html
>
> David
>
> -- 
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux- 
> raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Problem with 5disk RAID5 array - two drives lost
  2006-04-25 14:55         ` Tim Bostrom
@ 2006-04-26  6:19           ` Tim Bostrom
  0 siblings, 0 replies; 10+ messages in thread
From: Tim Bostrom @ 2006-04-26  6:19 UTC (permalink / raw)
  To: Tim Bostrom; +Cc: David Greaves, Arthur Britto, linux-raid

I just wanted to thank all of you who helped me with this problem.

dd_rescue was the ticket.  I used Knoppix and dd_rescue to copy the  
entire /dev/hdf drive to a brand new drive.  Took almost 36 hours to  
copy 250GB.  After that, I replaced hdf with the new drive and  
rebooted the machine back to FC4.  I then mdadm --assemble --force  
and then mounted the drive.  All the data seems to be there (or at  
least the data I cared about).  Next, I installed a new drive to  
replace the first failed drive (/dev/hdh) and then began the  
rebuilding process.  Everything seems OK for now.

Thank you all, again.



On Apr 25, 2006, at 7:55 AM  Apr 25, 2006, Tim Bostrom wrote:

> OK, so 952 errors (about 450k) and 25+ hours later, I have a copy  
> of the hdf drive on a brand new 250GB drive thanks to dd_rescue.
>
> I haven't tried swapping it to the array.  That's the next step.  I  
> imagine, I'll be able to mdadm --assemble --force and have it take  
> the 4 drives into the array.  Dare I mount it after this or is  
> there some troubleshooting I should do before a mount?
>
> -Tim
>
>
>
> On Apr 24, 2006, at 7:01 AM  Apr 24, 2006, David Greaves wrote:
>
>> Arthur Britto wrote:
>>> On Sun, 2006-04-23 at 17:17 -0700, Tim Bostrom wrote:
>>>> I bought two extra 250GB drives - I'll try using dd_rescue as
>>>> recommended and see if I can get a "good" copy of hdf online.
>>>
>>> You might want to use dd_rhelp:
>>> http://www.kalysto.org/utilities/dd_rhelp/index.en.html
>> Having used both dd_rescue/dd_rhelp and the gnu ddrescue in anger,  
>> I'd
>> suggest gnu ddrescue.
>>
>> http://www.gnu.org/software/ddrescue/ddrescue.html
>>
>> David
>>
>> -- 
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux- 
>> raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux- 
> raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-04-26  6:19 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-19 19:39 Problem with 5disk RAID5 array - two drives lost Tim Bostrom
2006-04-22  3:54 ` Molle Bestefich
2006-04-22 19:42   ` Carlos Carvalho
2006-04-22 19:52     ` Molle Bestefich
2006-04-22 19:54     ` David Greaves
2006-04-24  0:17   ` Tim Bostrom
2006-04-24  2:00     ` Arthur Britto
2006-04-24 14:01       ` David Greaves
2006-04-25 14:55         ` Tim Bostrom
2006-04-26  6:19           ` Tim Bostrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).