* Failed RAID-5 with 4 disks
@ 2005-07-26 17:03 Frank Blendinger
2005-07-26 21:58 ` Tyler
0 siblings, 1 reply; 9+ messages in thread
From: Frank Blendinger @ 2005-07-26 17:03 UTC (permalink / raw)
To: linux-raid
Hi,
I have a RAID-5 set up with the following raidtab:
raiddev /dev/md0
raid-level 5
nr-raid-disks 4
nr-spare-disks 0
persistent-superblock 1
parity-algorithm left-symmetric
chunk-size 256
device /dev/hde
raid-disk 0
device /dev/hdg
raid-disk 1
device /dev/hdi
raid-disk 2
device /dev/hdk
raid-disk 3
My hde has failed some time ago, leaving some
hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hde: dma_intr: error=0x84 { DriveStatusError BadCRC }
messages in the syslog.
I wanted to get sure it really was damaged, so I did a badblocks
(read-only) scan on /dev/hde. It actually found a bad sector on the
disk.
I wanted to take the disk out to get me a new one, but unfortunately my
hdg seems to have run into trouble too, now. I also have some
SeekComplete/BadCRC errors in my log for that disk, too.
Furthermore, i got this:
Jul 25 10:35:49 blackbox kernel: ide: failed opcode was: unknown
Jul 25 10:35:49 blackbox kernel: hdg: DMA disabled
Jul 25 10:35:49 blackbox kernel: PDC202XX: Secondary channel reset.
Jul 25 10:35:49 blackbox kernel: PDC202XX: Primary channel reset.
Jul 25 10:35:49 blackbox kernel: hde: lost interrupt
Jul 25 10:35:49 blackbox kernel: ide3: reset: master: error (0x00?)
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 488396928
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159368976
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159368984
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159368992
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369000
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369008
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369016
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369024
Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369032
Jul 25 10:35:49 blackbox kernel: md: write_disk_sb failed for device hdg
Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0
Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0
Jul 25 10:35:49 blackbox kernel: RAID5 conf printout:
Jul 25 10:35:49 blackbox kernel: --- rd:4 wd:2 fd:2
Jul 25 10:35:49 blackbox kernel: disk 0, o:1, dev:hdk
Jul 25 10:35:49 blackbox kernel: disk 1, o:1, dev:hdi
Jul 25 10:35:49 blackbox kernel: disk 2, o:0, dev:hdg
Jul 25 10:35:49 blackbox kernel: RAID5 conf printout:
Jul 25 10:35:49 blackbox kernel: --- rd:4 wd:2 fd:2
Jul 25 10:35:49 blackbox kernel: disk 0, o:1, dev:hdk
Jul 25 10:35:49 blackbox kernel: disk 1, o:1, dev:hdi
Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0
Well, now it seems I have to failed disks in my RAID-5, which of course
would be fatal. I am still hoping to somehow rescue the data on the
array somehow, but I am not sure what would be the best approach. I don't
want to cause any more damage.
When booting my system with all four disks connected, hde and hdg as
expected won't get added:
Jul 26 18:07:59 blackbox kernel: md: hdg has invalid sb, not importing!
Jul 26 18:07:59 blackbox kernel: md: autorun ...
Jul 26 18:07:59 blackbox kernel: md: considering hdi ...
Jul 26 18:07:59 blackbox kernel: md: adding hdi ...
Jul 26 18:07:59 blackbox kernel: md: adding hdk ...
Jul 26 18:07:59 blackbox kernel: md: adding hde ...
Jul 26 18:07:59 blackbox kernel: md: created md0
Jul 26 18:07:59 blackbox kernel: md: bind<hde>
Jul 26 18:07:59 blackbox kernel: md: bind<hdk>
Jul 26 18:07:59 blackbox kernel: md: bind<hdi>
Jul 26 18:07:59 blackbox kernel: md: running: <hdi><hdk><hde>
Jul 26 18:07:59 blackbox kernel: md: kicking non-fresh hde from array!
Jul 26 18:07:59 blackbox kernel: md: unbind<hde>
Jul 26 18:07:59 blackbox kernel: md: export_rdev(hde)
Jul 26 18:07:59 blackbox kernel: raid5: device hdi operational as raid disk 1
Jul 26 18:07:59 blackbox kernel: raid5: device hdk operational as raid disk 0
Jul 26 18:07:59 blackbox kernel: RAID5 conf printout:
Jul 26 18:07:59 blackbox kernel: --- rd:4 wd:2 fd:2
Jul 26 18:07:59 blackbox kernel: disk 0, o:1, dev:hdk
Jul 26 18:07:59 blackbox kernel: disk 1, o:1, dev:hdi
Jul 26 18:07:59 blackbox kernel: md: do_md_run() returned -22
Jul 26 18:07:59 blackbox kernel: md: md0 stopped.
Jul 26 18:07:59 blackbox kernel: md: unbind<hdi>
Jul 26 18:07:59 blackbox kernel: md: export_rdev(hdi)
Jul 26 18:07:59 blackbox kernel: md: unbind<hdk>
Jul 26 18:07:59 blackbox kernel: md: export_rdev(hdk)
Jul 26 18:07:59 blackbox kernel: md: ... autorun DONE.
So hde is not fresh (it has been removed from the array for quite some
time now) and hdg has an invalid superblock.
Any advice on what I should do now? Should I better try to rebuild the
array with hde or with hdg?
Greetings,
Frank
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: Failed RAID-5 with 4 disks 2005-07-26 17:03 Failed RAID-5 with 4 disks Frank Blendinger @ 2005-07-26 21:58 ` Tyler 2005-07-26 22:36 ` Dan Stromberg ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Tyler @ 2005-07-26 21:58 UTC (permalink / raw) To: Frank Blendinger; +Cc: linux-raid My suggestion would be to buy two new drives, and DD (or dd rescue) the two bad drives onto the new drives, and then plug the new drive that has the most recent failure on it (HDG?) in, and try running a forced assemble including the HDG drive, then, in readonly mode, run an fsck to check the file system, and see if it thinks most things are okay. *IF* it checks out okay (for the most part.. you will probably lose some data), then plug the second new disk in, and add it to the array as a spare, and it would then start a resync of the array. Otherwise, if the fsck found that the entire filesystem was fubar... then I would try the above steps, but force the assemble with the original failed disk.. but depending on how long in between the two failures its been, and if any data was written to the array after the first failure, this is probably not going to be a good thing.. but could still be useful if you were trying to recover specific files that were not touched in between the two failures. I would also suggest googling raid manual recovery procedures, some info is outdated, but some of it describes what I just described above. Tyler. Frank Blendinger wrote: >Hi, > >I have a RAID-5 set up with the following raidtab: > >raiddev /dev/md0 > raid-level 5 > nr-raid-disks 4 > nr-spare-disks 0 > persistent-superblock 1 > parity-algorithm left-symmetric > chunk-size 256 > device /dev/hde > raid-disk 0 > device /dev/hdg > raid-disk 1 > device /dev/hdi > raid-disk 2 > device /dev/hdk > raid-disk 3 > >My hde has failed some time ago, leaving some > hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } > hde: dma_intr: error=0x84 { DriveStatusError BadCRC } >messages in the syslog. > >I wanted to get sure it really was damaged, so I did a badblocks >(read-only) scan on /dev/hde. It actually found a bad sector on the >disk. > > >I wanted to take the disk out to get me a new one, but unfortunately my >hdg seems to have run into trouble too, now. I also have some >SeekComplete/BadCRC errors in my log for that disk, too. > >Furthermore, i got this: > >Jul 25 10:35:49 blackbox kernel: ide: failed opcode was: unknown >Jul 25 10:35:49 blackbox kernel: hdg: DMA disabled >Jul 25 10:35:49 blackbox kernel: PDC202XX: Secondary channel reset. >Jul 25 10:35:49 blackbox kernel: PDC202XX: Primary channel reset. >Jul 25 10:35:49 blackbox kernel: hde: lost interrupt >Jul 25 10:35:49 blackbox kernel: ide3: reset: master: error (0x00?) >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 488396928 >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159368976 >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159368984 >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159368992 >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369000 >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369008 >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369016 >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369024 >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369032 >Jul 25 10:35:49 blackbox kernel: md: write_disk_sb failed for device hdg >Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0 >Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0 >Jul 25 10:35:49 blackbox kernel: RAID5 conf printout: >Jul 25 10:35:49 blackbox kernel: --- rd:4 wd:2 fd:2 >Jul 25 10:35:49 blackbox kernel: disk 0, o:1, dev:hdk >Jul 25 10:35:49 blackbox kernel: disk 1, o:1, dev:hdi >Jul 25 10:35:49 blackbox kernel: disk 2, o:0, dev:hdg >Jul 25 10:35:49 blackbox kernel: RAID5 conf printout: >Jul 25 10:35:49 blackbox kernel: --- rd:4 wd:2 fd:2 >Jul 25 10:35:49 blackbox kernel: disk 0, o:1, dev:hdk >Jul 25 10:35:49 blackbox kernel: disk 1, o:1, dev:hdi >Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0 > > >Well, now it seems I have to failed disks in my RAID-5, which of course >would be fatal. I am still hoping to somehow rescue the data on the >array somehow, but I am not sure what would be the best approach. I don't >want to cause any more damage. > >When booting my system with all four disks connected, hde and hdg as >expected won't get added: > >Jul 26 18:07:59 blackbox kernel: md: hdg has invalid sb, not importing! >Jul 26 18:07:59 blackbox kernel: md: autorun ... >Jul 26 18:07:59 blackbox kernel: md: considering hdi ... >Jul 26 18:07:59 blackbox kernel: md: adding hdi ... >Jul 26 18:07:59 blackbox kernel: md: adding hdk ... >Jul 26 18:07:59 blackbox kernel: md: adding hde ... >Jul 26 18:07:59 blackbox kernel: md: created md0 >Jul 26 18:07:59 blackbox kernel: md: bind<hde> >Jul 26 18:07:59 blackbox kernel: md: bind<hdk> >Jul 26 18:07:59 blackbox kernel: md: bind<hdi> >Jul 26 18:07:59 blackbox kernel: md: running: <hdi><hdk><hde> >Jul 26 18:07:59 blackbox kernel: md: kicking non-fresh hde from array! >Jul 26 18:07:59 blackbox kernel: md: unbind<hde> >Jul 26 18:07:59 blackbox kernel: md: export_rdev(hde) >Jul 26 18:07:59 blackbox kernel: raid5: device hdi operational as raid disk 1 >Jul 26 18:07:59 blackbox kernel: raid5: device hdk operational as raid disk 0 >Jul 26 18:07:59 blackbox kernel: RAID5 conf printout: >Jul 26 18:07:59 blackbox kernel: --- rd:4 wd:2 fd:2 >Jul 26 18:07:59 blackbox kernel: disk 0, o:1, dev:hdk >Jul 26 18:07:59 blackbox kernel: disk 1, o:1, dev:hdi >Jul 26 18:07:59 blackbox kernel: md: do_md_run() returned -22 >Jul 26 18:07:59 blackbox kernel: md: md0 stopped. >Jul 26 18:07:59 blackbox kernel: md: unbind<hdi> >Jul 26 18:07:59 blackbox kernel: md: export_rdev(hdi) >Jul 26 18:07:59 blackbox kernel: md: unbind<hdk> >Jul 26 18:07:59 blackbox kernel: md: export_rdev(hdk) >Jul 26 18:07:59 blackbox kernel: md: ... autorun DONE. > >So hde is not fresh (it has been removed from the array for quite some >time now) and hdg has an invalid superblock. > >Any advice on what I should do now? Should I better try to rebuild the >array with hde or with hdg? > > >Greetings, >Frank >- >To unsubscribe from this list: send the line "unsubscribe linux-raid" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html > > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Failed RAID-5 with 4 disks 2005-07-26 21:58 ` Tyler @ 2005-07-26 22:36 ` Dan Stromberg 2005-07-26 23:12 ` Tyler 2005-09-16 11:36 ` Frank Blendinger 2 siblings, 0 replies; 9+ messages in thread From: Dan Stromberg @ 2005-07-26 22:36 UTC (permalink / raw) To: Tyler; +Cc: Frank Blendinger, linux-raid, strombrg [-- Attachment #1: Type: text/plain, Size: 8340 bytes --] It seems theoretically possible, but might take some semi-substantial coding effort, to build a program that would sort of combine fsck and RAID information to get back as much as is feasible. For example, this program might track down all the inodes it can, and perhaps even hunt for inode magic numbers to augment the inode list (using fuzzy logic - if the inode points to a reasonable series of blocks, for example, and has a possible file length and the files are owned by an existing user - each of these strengthens the program's believe that it has heuristically found a real inode), and then look down the block pointers in the inodes for blocks that are known still good, and ignoring the ones that aren't. I suppose then you'd need some sort of table, built out of lots of range arithmetic, indicating that N files were 100% recovered, and M files were recovered only in ranges such and so. Might actually be kind of a fun project once you get really immersed into it. HTH. :) On Tue, 2005-07-26 at 14:58 -0700, Tyler wrote: > My suggestion would be to buy two new drives, and DD (or dd rescue) the > two bad drives onto the new drives, and then plug the new drive that has > the most recent failure on it (HDG?) in, and try running a forced > assemble including the HDG drive, then, in readonly mode, run an fsck to > check the file system, and see if it thinks most things are okay. *IF* > it checks out okay (for the most part.. you will probably lose some > data), then plug the second new disk in, and add it to the array as a > spare, and it would then start a resync of the array. Otherwise, if the > fsck found that the entire filesystem was fubar... then I would try the > above steps, but force the assemble with the original failed disk.. but > depending on how long in between the two failures its been, and if any > data was written to the array after the first failure, this is probably > not going to be a good thing.. but could still be useful if you were > trying to recover specific files that were not touched in between the > two failures. > > I would also suggest googling raid manual recovery procedures, some info > is outdated, but some of it describes what I just described above. > > Tyler. > > Frank Blendinger wrote: > > >Hi, > > > >I have a RAID-5 set up with the following raidtab: > > > >raiddev /dev/md0 > > raid-level 5 > > nr-raid-disks 4 > > nr-spare-disks 0 > > persistent-superblock 1 > > parity-algorithm left-symmetric > > chunk-size 256 > > device /dev/hde > > raid-disk 0 > > device /dev/hdg > > raid-disk 1 > > device /dev/hdi > > raid-disk 2 > > device /dev/hdk > > raid-disk 3 > > > >My hde has failed some time ago, leaving some > > hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } > > hde: dma_intr: error=0x84 { DriveStatusError BadCRC } > >messages in the syslog. > > > >I wanted to get sure it really was damaged, so I did a badblocks > >(read-only) scan on /dev/hde. It actually found a bad sector on the > >disk. > > > > > >I wanted to take the disk out to get me a new one, but unfortunately my > >hdg seems to have run into trouble too, now. I also have some > >SeekComplete/BadCRC errors in my log for that disk, too. > > > >Furthermore, i got this: > > > >Jul 25 10:35:49 blackbox kernel: ide: failed opcode was: unknown > >Jul 25 10:35:49 blackbox kernel: hdg: DMA disabled > >Jul 25 10:35:49 blackbox kernel: PDC202XX: Secondary channel reset. > >Jul 25 10:35:49 blackbox kernel: PDC202XX: Primary channel reset. > >Jul 25 10:35:49 blackbox kernel: hde: lost interrupt > >Jul 25 10:35:49 blackbox kernel: ide3: reset: master: error (0x00?) > >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 488396928 > >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159368976 > >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159368984 > >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159368992 > >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369000 > >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369008 > >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369016 > >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369024 > >Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, sector 159369032 > >Jul 25 10:35:49 blackbox kernel: md: write_disk_sb failed for device hdg > >Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0 > >Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0 > >Jul 25 10:35:49 blackbox kernel: RAID5 conf printout: > >Jul 25 10:35:49 blackbox kernel: --- rd:4 wd:2 fd:2 > >Jul 25 10:35:49 blackbox kernel: disk 0, o:1, dev:hdk > >Jul 25 10:35:49 blackbox kernel: disk 1, o:1, dev:hdi > >Jul 25 10:35:49 blackbox kernel: disk 2, o:0, dev:hdg > >Jul 25 10:35:49 blackbox kernel: RAID5 conf printout: > >Jul 25 10:35:49 blackbox kernel: --- rd:4 wd:2 fd:2 > >Jul 25 10:35:49 blackbox kernel: disk 0, o:1, dev:hdk > >Jul 25 10:35:49 blackbox kernel: disk 1, o:1, dev:hdi > >Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0 > > > > > >Well, now it seems I have to failed disks in my RAID-5, which of course > >would be fatal. I am still hoping to somehow rescue the data on the > >array somehow, but I am not sure what would be the best approach. I don't > >want to cause any more damage. > > > >When booting my system with all four disks connected, hde and hdg as > >expected won't get added: > > > >Jul 26 18:07:59 blackbox kernel: md: hdg has invalid sb, not importing! > >Jul 26 18:07:59 blackbox kernel: md: autorun ... > >Jul 26 18:07:59 blackbox kernel: md: considering hdi ... > >Jul 26 18:07:59 blackbox kernel: md: adding hdi ... > >Jul 26 18:07:59 blackbox kernel: md: adding hdk ... > >Jul 26 18:07:59 blackbox kernel: md: adding hde ... > >Jul 26 18:07:59 blackbox kernel: md: created md0 > >Jul 26 18:07:59 blackbox kernel: md: bind<hde> > >Jul 26 18:07:59 blackbox kernel: md: bind<hdk> > >Jul 26 18:07:59 blackbox kernel: md: bind<hdi> > >Jul 26 18:07:59 blackbox kernel: md: running: <hdi><hdk><hde> > >Jul 26 18:07:59 blackbox kernel: md: kicking non-fresh hde from array! > >Jul 26 18:07:59 blackbox kernel: md: unbind<hde> > >Jul 26 18:07:59 blackbox kernel: md: export_rdev(hde) > >Jul 26 18:07:59 blackbox kernel: raid5: device hdi operational as raid disk 1 > >Jul 26 18:07:59 blackbox kernel: raid5: device hdk operational as raid disk 0 > >Jul 26 18:07:59 blackbox kernel: RAID5 conf printout: > >Jul 26 18:07:59 blackbox kernel: --- rd:4 wd:2 fd:2 > >Jul 26 18:07:59 blackbox kernel: disk 0, o:1, dev:hdk > >Jul 26 18:07:59 blackbox kernel: disk 1, o:1, dev:hdi > >Jul 26 18:07:59 blackbox kernel: md: do_md_run() returned -22 > >Jul 26 18:07:59 blackbox kernel: md: md0 stopped. > >Jul 26 18:07:59 blackbox kernel: md: unbind<hdi> > >Jul 26 18:07:59 blackbox kernel: md: export_rdev(hdi) > >Jul 26 18:07:59 blackbox kernel: md: unbind<hdk> > >Jul 26 18:07:59 blackbox kernel: md: export_rdev(hdk) > >Jul 26 18:07:59 blackbox kernel: md: ... autorun DONE. > > > >So hde is not fresh (it has been removed from the array for quite some > >time now) and hdg has an invalid superblock. > > > >Any advice on what I should do now? Should I better try to rebuild the > >array with hde or with hdg? > > > > > >Greetings, > >Frank > >- > >To unsubscribe from this list: send the line "unsubscribe linux-raid" in > >the body of a message to majordomo@vger.kernel.org > >More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Failed RAID-5 with 4 disks 2005-07-26 21:58 ` Tyler 2005-07-26 22:36 ` Dan Stromberg @ 2005-07-26 23:12 ` Tyler 2005-09-16 11:36 ` Frank Blendinger 2 siblings, 0 replies; 9+ messages in thread From: Tyler @ 2005-07-26 23:12 UTC (permalink / raw) Cc: Frank Blendinger, linux-raid I had a typo in my original email, near the end, where I say "was fubar... then I would try the above steps, but force the assemble with the original failed disk", I actually meant to say "but force the assemble with the newly DD'd copy of the original/first drive that failed." Tyler. Tyler wrote: > My suggestion would be to buy two new drives, and DD (or dd rescue) > the two bad drives onto the new drives, and then plug the new drive > that has the most recent failure on it (HDG?) in, and try running a > forced assemble including the HDG drive, then, in readonly mode, run > an fsck to check the file system, and see if it thinks most things are > okay. *IF* it checks out okay (for the most part.. you will probably > lose some data), then plug the second new disk in, and add it to the > array as a spare, and it would then start a resync of the array. > Otherwise, if the fsck found that the entire filesystem was fubar... > then I would try the above steps, but force the assemble with the > original failed disk.. but depending on how long in between the two > failures its been, and if any data was written to the array after the > first failure, this is probably not going to be a good thing.. but > could still be useful if you were trying to recover specific files > that were not touched in between the two failures. > > I would also suggest googling raid manual recovery procedures, some > info is outdated, but some of it describes what I just described above. > > Tyler. > > Frank Blendinger wrote: > >> Hi, >> >> I have a RAID-5 set up with the following raidtab: >> >> raiddev /dev/md0 >> raid-level 5 >> nr-raid-disks 4 >> nr-spare-disks 0 >> persistent-superblock 1 >> parity-algorithm left-symmetric >> chunk-size 256 >> device /dev/hde >> raid-disk 0 >> device /dev/hdg >> raid-disk 1 >> device /dev/hdi >> raid-disk 2 >> device /dev/hdk >> raid-disk 3 >> >> My hde has failed some time ago, leaving some hde: dma_intr: >> status=0x51 { DriveReady SeekComplete Error } >> hde: dma_intr: error=0x84 { DriveStatusError BadCRC } >> messages in the syslog. >> >> I wanted to get sure it really was damaged, so I did a badblocks >> (read-only) scan on /dev/hde. It actually found a bad sector on the >> disk. >> >> >> I wanted to take the disk out to get me a new one, but unfortunately my >> hdg seems to have run into trouble too, now. I also have some >> SeekComplete/BadCRC errors in my log for that disk, too. >> >> Furthermore, i got this: >> >> Jul 25 10:35:49 blackbox kernel: ide: failed opcode was: unknown >> Jul 25 10:35:49 blackbox kernel: hdg: DMA disabled >> Jul 25 10:35:49 blackbox kernel: PDC202XX: Secondary channel reset. >> Jul 25 10:35:49 blackbox kernel: PDC202XX: Primary channel reset. >> Jul 25 10:35:49 blackbox kernel: hde: lost interrupt >> Jul 25 10:35:49 blackbox kernel: ide3: reset: master: error (0x00?) >> Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, >> sector 488396928 >> Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, >> sector 159368976 >> Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, >> sector 159368984 >> Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, >> sector 159368992 >> Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, >> sector 159369000 >> Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, >> sector 159369008 >> Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, >> sector 159369016 >> Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, >> sector 159369024 >> Jul 25 10:35:49 blackbox kernel: end_request: I/O error, dev hdg, >> sector 159369032 >> Jul 25 10:35:49 blackbox kernel: md: write_disk_sb failed for device hdg >> Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0 >> Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0 >> Jul 25 10:35:49 blackbox kernel: RAID5 conf printout: >> Jul 25 10:35:49 blackbox kernel: --- rd:4 wd:2 fd:2 >> Jul 25 10:35:49 blackbox kernel: disk 0, o:1, dev:hdk >> Jul 25 10:35:49 blackbox kernel: disk 1, o:1, dev:hdi >> Jul 25 10:35:49 blackbox kernel: disk 2, o:0, dev:hdg >> Jul 25 10:35:49 blackbox kernel: RAID5 conf printout: >> Jul 25 10:35:49 blackbox kernel: --- rd:4 wd:2 fd:2 >> Jul 25 10:35:49 blackbox kernel: disk 0, o:1, dev:hdk >> Jul 25 10:35:49 blackbox kernel: disk 1, o:1, dev:hdi >> Jul 25 10:35:49 blackbox kernel: lost page write due to I/O error on md0 >> >> >> Well, now it seems I have to failed disks in my RAID-5, which of course >> would be fatal. I am still hoping to somehow rescue the data on the >> array somehow, but I am not sure what would be the best approach. I >> don't >> want to cause any more damage. >> >> When booting my system with all four disks connected, hde and hdg as >> expected won't get added: >> >> Jul 26 18:07:59 blackbox kernel: md: hdg has invalid sb, not importing! >> Jul 26 18:07:59 blackbox kernel: md: autorun ... >> Jul 26 18:07:59 blackbox kernel: md: considering hdi ... >> Jul 26 18:07:59 blackbox kernel: md: adding hdi ... >> Jul 26 18:07:59 blackbox kernel: md: adding hdk ... >> Jul 26 18:07:59 blackbox kernel: md: adding hde ... >> Jul 26 18:07:59 blackbox kernel: md: created md0 >> Jul 26 18:07:59 blackbox kernel: md: bind<hde> >> Jul 26 18:07:59 blackbox kernel: md: bind<hdk> >> Jul 26 18:07:59 blackbox kernel: md: bind<hdi> >> Jul 26 18:07:59 blackbox kernel: md: running: <hdi><hdk><hde> >> Jul 26 18:07:59 blackbox kernel: md: kicking non-fresh hde from array! >> Jul 26 18:07:59 blackbox kernel: md: unbind<hde> >> Jul 26 18:07:59 blackbox kernel: md: export_rdev(hde) >> Jul 26 18:07:59 blackbox kernel: raid5: device hdi operational as >> raid disk 1 >> Jul 26 18:07:59 blackbox kernel: raid5: device hdk operational as >> raid disk 0 >> Jul 26 18:07:59 blackbox kernel: RAID5 conf printout: >> Jul 26 18:07:59 blackbox kernel: --- rd:4 wd:2 fd:2 >> Jul 26 18:07:59 blackbox kernel: disk 0, o:1, dev:hdk >> Jul 26 18:07:59 blackbox kernel: disk 1, o:1, dev:hdi >> Jul 26 18:07:59 blackbox kernel: md: do_md_run() returned -22 >> Jul 26 18:07:59 blackbox kernel: md: md0 stopped. >> Jul 26 18:07:59 blackbox kernel: md: unbind<hdi> >> Jul 26 18:07:59 blackbox kernel: md: export_rdev(hdi) >> Jul 26 18:07:59 blackbox kernel: md: unbind<hdk> >> Jul 26 18:07:59 blackbox kernel: md: export_rdev(hdk) >> Jul 26 18:07:59 blackbox kernel: md: ... autorun DONE. >> >> So hde is not fresh (it has been removed from the array for quite some >> time now) and hdg has an invalid superblock. >> >> Any advice on what I should do now? Should I better try to rebuild the >> array with hde or with hdg? >> >> >> Greetings, >> Frank >> - >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Failed RAID-5 with 4 disks 2005-07-26 21:58 ` Tyler 2005-07-26 22:36 ` Dan Stromberg 2005-07-26 23:12 ` Tyler @ 2005-09-16 11:36 ` Frank Blendinger [not found] ` <432AFA95.3040709@h3c.com> 2 siblings, 1 reply; 9+ messages in thread From: Frank Blendinger @ 2005-09-16 11:36 UTC (permalink / raw) To: linux-raid On Tue, Jul 26, 2005 at 02:58:43PM -0700, Tyler wrote: > My suggestion would be to buy two new drives, and DD (or dd rescue) the > two bad drives onto the new drives, and then plug the new drive that has > the most recent failure on it (HDG?) in, and try running a forced > assemble including the HDG drive, then, in readonly mode, run an fsck to > check the file system, and see if it thinks most things are okay. *IF* > it checks out okay (for the most part.. you will probably lose some > data), then plug the second new disk in, and add it to the array as a > spare, and it would then start a resync of the array. Otherwise, if the > fsck found that the entire filesystem was fubar... then I would try the > above steps, but force the assemble with the original failed disk.. but > depending on how long in between the two failures its been, and if any > data was written to the array after the first failure, this is probably > not going to be a good thing.. but could still be useful if you were > trying to recover specific files that were not touched in between the > two failures. Thanks for your suggestions. This is what I did so far: I got one of the two bad drives (the one that failed first) replaced with a new one. I copied the other bad drive to the new one with dd. I guess that not everything could be copied alright, I got 10 "Buffer I/O error on device hdg, logical sector ..." and about 35 "end_request: I/O error, hdg, sector ..." error messages in my syslog. Now I'm stuck re-activating the array with the dd'ed hde and the working hdi and hdk. I tried "mdadm --assemble --scan /dev/md0", which told me "mdadm: /dev/md0 assembled from 2 drives - not enough to start the array." I then tried hot-adding hde with "mdadm --add /dev/hde [--force] /dev/md0" but that only got me "mdadm: /dev/hde does not appear to be an md device". Any suggestions? Greets, Frank ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <432AFA95.3040709@h3c.com>]
* Re: Failed RAID-5 with 4 disks [not found] ` <432AFA95.3040709@h3c.com> @ 2005-09-16 19:09 ` Frank Blendinger 2005-09-16 19:52 ` Mike Hardy 2005-09-17 9:31 ` Burkhard Carstens 0 siblings, 2 replies; 9+ messages in thread From: Frank Blendinger @ 2005-09-16 19:09 UTC (permalink / raw) To: linux-raid On Fri, Sep 16, 2005 at 10:02:13AM -0700, Mike Hardy wrote: > Frank Blendinger wrote: > > This is what I did so far: I got one of the two bad drives (the one that > > failed first) replaced with a new one. I copied the other bad drive to > > the new one with dd. I guess that not everything could be copied > > alright, I got 10 "Buffer I/O error on device hdg, logical sector > > ..." and about 35 "end_request: I/O error, hdg, sector ..." error > > messages in my syslog. > > > > Now I'm stuck re-activating the array with the dd'ed hde and the working > > hdi and hdk. I tried "mdadm --assemble --scan /dev/md0", which told me > > "mdadm: /dev/md0 assembled from 2 drives - not enough to start the array." > > Does an mdadm -E on the dd'd /dev/hde show that it has the superblock > and knows about the array? That would confirm that it has the data and > is ready to go. mdadm -E /dev/hde tells me: mdadm: No super block found on /dev/hde (Expected magic a92b4efc, got 00000000) > Given that, you want to do a version of the create/assemble that > forcibly uses all three drives, even though one of them is out of date > from the raid set's perspective. > > I believe its possible to issue a create line that has a 'missing' entry > for the drive missing, but order is important. Luckily since one drive > is missing, md won't resync or anything so you should get multiple tries. > > Something like 'mdadm --create --force -level 5 -n 4 /dev/md0 /dev/hda > /dev/hdb /dev/hde missing' is what you're looking for. > > Obviously I don't know what your disk names are, so put the correct ones > in there, not the ones I used. If you don't get a valid raid set from > that, you could try moving the order around. OK, sounds good. I tried this: $ mdadm --create --force --level 5 -n 4 /dev/md0 /dev/hdi /dev/hdk /dev/hde missing mdadm: /dev/hdi appears to be part of a raid array: level=5 devices=4 ctime=Mon Apr 18 21:05:23 2005 mdadm: /dev/hdk appears to contain an ext2fs file system size=732595200K mtime=Sun Jul 24 03:08:46 2005 mdadm: /dev/hdk appears to be part of a raid array: level=5 devices=4 ctime=Mon Apr 18 21:05:23 2005 Continue creating array? I'm not quite sure about the output, hdk gets listed twice (once false as an ext2) and hde (this is the dd'ed disk) not at all. Should i continue here? > Each time you do that, you'll be creating a brand new raid set with new > superblocks, but the layout will hopefully match, and it won't update > the data because a drive is missing. After the raid is created the right > way, you should find your data. > > Then you can hot-add a new drive to the array, to get your redundancy > back. I'd definitely use smartctl -t long on /dev/hdg to find the blocks > that are bad, and use the BadBlockHowTo (google for that) so you can > clear the bad blocks. Of course I don't want the second broken hard drive as spare. I just used it to dd its content to the new disk. I am going to get a new drive for the second failed one once I got the array back up and running (without redundancy). Should a check for bad blocks on hdg and then repeat the dd to the new disk? > Alternatively you could forcibly assemble the array as it was with > Neil's new faulty-read-correction patch, and the blocks will probably > get auto-cleared. I am still using mdadm 1.9.0 (the package that came with Debian sarge). Would you suggest me to manually upgrade to a 2.0 version? > > I then tried hot-adding hde with "mdadm --add /dev/hde [--force] /dev/md0" > > but that only got me "mdadm: /dev/hde does not appear to be an md > > device". > > You got the array and the drive in the wrong positions here, thus the > error message, and you can't hot-add to an array that isn't started. > hot-add is to add redundancy to an array that is already running - for > instance after a drive has failed you hot-remove it, then after you've > cleared bad blocks, you hot-add it. I see, I completely misunderstood the manpage there. Greets, Frank ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Failed RAID-5 with 4 disks 2005-09-16 19:09 ` Frank Blendinger @ 2005-09-16 19:52 ` Mike Hardy 2005-09-17 9:31 ` Burkhard Carstens 1 sibling, 0 replies; 9+ messages in thread From: Mike Hardy @ 2005-09-16 19:52 UTC (permalink / raw) To: linux-raid Frank Blendinger wrote: > On Fri, Sep 16, 2005 at 10:02:13AM -0700, Mike Hardy wrote: >>Does an mdadm -E on the dd'd /dev/hde show that it has the superblock >>and knows about the array? That would confirm that it has the data and >>is ready to go. > > > mdadm -E /dev/hde tells me: > mdadm: No super block found on /dev/hde (Expected magic a92b4efc, got > 00000000) This is bad - it appears that your dd either did not work or was not complete or something, but /dev/hde does not contain a copy of an array component. You can not continue until you've got one of the two failed disks' data. > OK, sounds good. I tried this: > > $ mdadm --create --force --level 5 -n 4 /dev/md0 /dev/hdi /dev/hdk /dev/hde missing > mdadm: /dev/hdi appears to be part of a raid array: > level=5 devices=4 ctime=Mon Apr 18 21:05:23 2005 > mdadm: /dev/hdk appears to contain an ext2fs file system > size=732595200K mtime=Sun Jul 24 03:08:46 2005 > mdadm: /dev/hdk appears to be part of a raid array: > level=5 devices=4 ctime=Mon Apr 18 21:05:23 2005 > Continue creating array? > > I'm not quite sure about the output, hdk gets listed twice (once false > as an ext2) and hde (this is the dd'ed disk) not at all. > Should i continue here? Not sure why hdk gets listed twice, but its probably not a huge deal. The missing hde is the big problem though. You don't have enough components together yet > Of course I don't want the second broken hard drive as spare. I just > used it to dd its content to the new disk. I am going to get a new drive > for the second failed one once I got the array back up and running > (without redundancy). The 'missing' slot and '-n 4' tells mdadm that you are creating a 4 disk array, but one of the slots is empty at this point. The array will created and initially run in degraded mode. When you add a disk to the array later, it won't be a spare, it will give you the normal redundancy. > Should a check for bad blocks on hdg and then repeat the dd to the new > disk? I'm not going to comment on specific disks, I don't really know your complete situation. The process is the point though. If you have a disk that failed, for any reason, you should run a long SMART test on it ('smartctl -t long <disk>'). If it has bad blocks, you should fix them. If it has data on it that is not redundant you should try to copy it elsewhere first. Once the disks pass the long SMART test, they're capable of being used without problems in a raid array. >>Alternatively you could forcibly assemble the array as it was with >>Neil's new faulty-read-correction patch, and the blocks will probably >>get auto-cleared. > > > I am still using mdadm 1.9.0 (the package that came with Debian sarge). > Would you suggest me to manually upgrade to a 2.0 version? I'd get your data back first. Its clear you haven't used the tools much, so I wouldn't throw an attempt at upgrading them into the mix. Getting the data back is hard enough, even if the tools are old hat. > I see, I completely misunderstood the manpage there. Given this, and your questions about how to add the drives, and the redundancy etc., the main thing I'd recommend is to practice this stuff in a safe environment. If your data is important, and you plan on running a raid for a while, it will really pay off in the long run, and its so much more relaxing to run this stuff when you're confident in the tools you have to use when problems crop up. How to do that? I'd use loop devices. You create a number of files of the same size, you export them as loopback-mounted devices, and build a raid out of them. Open a second terminal so you can `watch cat /proc/mdstat`. Open a third so you can `tail -f /var/log/messages`, then start playing around with creating new arrays out of the loop files, hot-removing, hot-adding etc. All in a safe way. I wrote a script that facilitates this a while back and posted it to the list: http://www.spinics.net/lists/raid/msg07564.html You should be able to simulate what you need to do with your real disks by setting up 4 loop devices and failing two, then attempting to recover. -Mike ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Failed RAID-5 with 4 disks 2005-09-16 19:09 ` Frank Blendinger 2005-09-16 19:52 ` Mike Hardy @ 2005-09-17 9:31 ` Burkhard Carstens 2005-09-17 16:46 ` Frank Blendinger 1 sibling, 1 reply; 9+ messages in thread From: Burkhard Carstens @ 2005-09-17 9:31 UTC (permalink / raw) To: linux-raid Am Freitag, 16. September 2005 21:09 schrieb Frank Blendinger: > On Fri, Sep 16, 2005 at 10:02:13AM -0700, Mike Hardy wrote: > > Frank Blendinger wrote: > > > This is what I did so far: I got one of the two bad drives (the > > > one that failed first) replaced with a new one. I copied the > > > other bad drive to the new one with dd. I guess that not > > > everything could be copied alright, I got 10 "Buffer I/O error on > > > device hdg, logical sector ..." and about 35 "end_request: I/O > > > error, hdg, sector ..." error messages in my syslog. > > > I am unable to find the beginning of this thread, so please excuse me if this has already been said, but: You did use ddrescue, didn't you? Because dd, when it fails to read a block, it won't write that block. That's why your superblock might have "moved" about 45 sectors towards the beginning of the drive. ddrescue writes a block with zeros when reading that block fails, so every readable sector get copdied to its corresponding sector on the new drive.. Burkhard ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Failed RAID-5 with 4 disks 2005-09-17 9:31 ` Burkhard Carstens @ 2005-09-17 16:46 ` Frank Blendinger 0 siblings, 0 replies; 9+ messages in thread From: Frank Blendinger @ 2005-09-17 16:46 UTC (permalink / raw) To: linux-raid On Sat, Sep 17, 2005 at 11:31:12AM +0200, Burkhard Carstens wrote: > Am Freitag, 16. September 2005 21:09 schrieb Frank Blendinger: > > On Fri, Sep 16, 2005 at 10:02:13AM -0700, Mike Hardy wrote: > > > Frank Blendinger wrote: > > > > This is what I did so far: I got one of the two bad drives (the > > > > one that failed first) replaced with a new one. I copied the > > > > other bad drive to the new one with dd. I guess that not > > > > everything could be copied alright, I got 10 "Buffer I/O error on > > > > device hdg, logical sector ..." and about 35 "end_request: I/O > > > > error, hdg, sector ..." error messages in my syslog. > > > > > I am unable to find the beginning of this thread, so please excuse me if > this has already been said, but: > > You did use ddrescue, didn't you? Because dd, when it fails to read a > block, it won't write that block. That's why your superblock might have > "moved" about 45 sectors towards the beginning of the drive. > ddrescue writes a block with zeros when reading that block fails, so > every readable sector get copdied to its corresponding sector on the > new drive.. OK, thanks for the note, I am going to do a bad block scan with smart and then try again with ddrescue. I hope this will restore the superblock. Greets, Frank ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2005-09-17 16:46 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-26 17:03 Failed RAID-5 with 4 disks Frank Blendinger
2005-07-26 21:58 ` Tyler
2005-07-26 22:36 ` Dan Stromberg
2005-07-26 23:12 ` Tyler
2005-09-16 11:36 ` Frank Blendinger
[not found] ` <432AFA95.3040709@h3c.com>
2005-09-16 19:09 ` Frank Blendinger
2005-09-16 19:52 ` Mike Hardy
2005-09-17 9:31 ` Burkhard Carstens
2005-09-17 16:46 ` Frank Blendinger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).