linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* apparent but not real raid1 failure. what happened? still confused. Gurus Please help...
@ 2005-05-02 16:24 Mitchell Laks
  2005-05-02 18:20 ` Peter T. Breuer
  2005-06-02 15:23 ` Is there a drive error "retry" parameter? Carlos Knowlton
  0 siblings, 2 replies; 10+ messages in thread
From: Mitchell Laks @ 2005-05-02 16:24 UTC (permalink / raw)
  To: linux-raid

I recently had what appeared to be 
a raid1 failure and was wondering what insight I should draw. The kernel 
diagnostics suggested a dual drive failure - but the data turned out to still 
be there. What does this mean?

I described what happened in an earlier post, but I really don't understand 
and would be very grateful for insight from the gurus on the list.

Is it a bug in the kernel? in software raid? Is it my stupidity?

my system: 

asus K8v-x motherboard with amd64 processor,
uname -a
Linux A2 2.6.8-1-386 #1 Mon Jan 24 03:01:58 EST 2005 i686 GNU/Linux
(this was the debian stock 2.6.8 kernel circa January)

mdadm-v1.9.0

All harddrives are 250GB  parallel ata ide drives, 
WD2500 JB drives (3 year warranty)

Initially, one raid failed:
/dev/md0 between /dev/hda1 and
/dev/hdg1 with the /dev/hdg1 on a highpoint rocket 133 controller.

From reading the log files I see that initially /dev/hda1 died

Apr 21 07:36:01 A2 kernel: hda: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Apr 21 07:36:01 A2 kernel: hda: dma_intr: error=0x40 { UncorrectableError },
LBAsect=209715335, high=12, low=8388743, sector=209
715335
Apr 21 07:36:01 A2 kernel: end_request: I/O error, dev hda, sector 209715335
Apr 21 07:36:01 A2 kernel: raid1: Disk failure on hda1, disabling device.
Apr 21 07:36:01 A2 kernel: ^IOperation continuing on 1 devices
Apr 21 07:36:01 A2 kernel: raid1: hda1: rescheduling sector 209715272
Apr 21 07:36:01 A2 kernel: RAID1 conf printout:
Apr 21 07:36:01 A2 kernel:  --- wd:1 rd:2
Apr 21 07:36:01 A2 kernel:  disk 0, wo:1, o:0, dev:hda1
Apr 21 07:36:01 A2 kernel:  disk 1, wo:0, o:1, dev:hdg1
Apr 21 07:36:01 A2 kernel: RAID1 conf printout:
Apr 21 07:36:01 A2 kernel:  --- wd:1 rd:2
Apr 21 07:36:01 A2 kernel:  disk 1, wo:0, o:1, dev:hdg1
Apr 21 07:36:01 A2 kernel: raid1: hdg1: redirecting sector 209715272 to
another mirror
Apr 21 07:36:21 A2 kernel: hdg: dma_timer_expiry: dma status == 0x20
Apr 21 07:36:21 A2 kernel: hdg: DMA timeout retry
Apr 21 07:36:21 A2 kernel: hdg: timeout waiting for DMA
Apr 21 07:36:21 A2 kernel: hdg: status error: status=0x58 { DriveReady
SeekComplete DataRequest }
Apr 21 07:36:21 A2 kernel:
Apr 21 07:36:21 A2 kernel: hdg: drive not ready for command
Apr 21 07:36:21 A2 kernel: raid1: hdg1: rescheduling sector 209715272
Apr 21 07:36:21 A2 kernel: raid1: hdg1: redirecting sector 209715272 to
another mirror
Apr 21 07:36:21 A2 kernel: hdg: status error: status=0x58 { DriveReady
SeekComplete DataRequest }
Apr 21 07:36:21 A2 kernel:
Apr 21 07:36:21 A2 kernel: hdg: drive not ready for command
Apr 21 07:36:41 A2 kernel: hdg: dma_timer_expiry: dma status == 0x20
Apr 21 07:36:41 A2 kernel: hdg: DMA timeout retry
Apr 21 07:36:41 A2 kernel: hdg: timeout waiting for DMA
Apr 21 07:36:41 A2 kernel: hdg: status error: status=0x58 { DriveReady
SeekComplete DataRequest }
Apr 21 07:36:41 A2 kernel:
Apr 21 07:36:41 A2 kernel: hdg: drive not ready for command
Apr 21 07:36:41 A2 kernel: raid1: hdg1: rescheduling sector 209715272
Apr 21 07:36:41 A2 kernel: raid1: hdg1: redirecting sector 209715272 to
another mirror
Apr 21 07:36:41 A2 kernel: hdg: status error: status=0x58 { DriveReady
SeekComplete DataRequest }
Apr 21 07:36:41 A2 kernel:

and then /dev/hdg1 immediately began to spew forth error messages of the
following sort 

Apr 22 22:29:21 A2 kernel: hdg: status error: status=0x58 { DriveReady
SeekComplete DataRequest }
Apr 22 22:29:21 A2 kernel:
Apr 22 22:29:21 A2 kernel: hdg: drive not ready for command
Apr 22 22:29:21 A2 kernel: raid1: hdg1: rescheduling sector 209715272
Apr 22 22:29:21 A2 kernel: raid1: hdg1: redirecting sector 209715272 to
another
mirror
Apr 22 22:29:21 A2 kernel: hdg: status error: status=0x58 { DriveReady
SeekCompl
ete DataRequest }
Apr 22 22:29:21 A2 kernel:
Apr 22 22:29:21 A2 kernel: hdg: drive not ready for command
Apr 22 22:29:21 A2 kernel: raid1: hdg1: rescheduling sector 209715272
Apr 22 22:29:21 A2 kernel: raid1: hdg1: redirecting sector 209715272 to other
 ....

These errors continued nonstop all day/night until /
var ran out of space and errors filled the 6GB /var partition.

2.6GB of         /var/log/kern.log and
2.6GB of        /var/log/syslog and
1GB of          /var/log/messages
were filled by these errors.

I then pulled the two drives out of the system and 
put a pair of new drives in for /dev/hda1 and /dev/hdg1 and
and created /dev/md0 anew, and restored the data to my servers from backups.


I then took the two drives /dev/hda1 and /dev/hdg1 to another machine

and ran the Western Digital drive diagnostics on both of them and they 
are both fine. No errors.

I then took /dev/hda1 on the new system and did
modprobe raid1
mknod /dev/md0 b 9 0
mdadm -A /dev/md0  /dev/hda1 

and then mount /dev/md0 /mnt and i see my data which looks intact.
Similarly if I do that with /dev/hdg1 i see the same data.

(note if i then try to do mdadm -A /dev/md0 /dev/hda1 /dev/hdc1 
(where /dev/hdc1 was /dev/hdg1 on the other machine) then i get a message
saying effectively that they are not up to date to each other ...)
 
Has anyone else had this trouble? Could someone explain what happened?

What should I have done when I found the errors when my system failed?

Is it safe for me to continue to use raid1?

Thanks,

Mitchell


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-06-16 16:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-02 16:24 apparent but not real raid1 failure. what happened? still confused. Gurus Please help Mitchell Laks
2005-05-02 18:20 ` Peter T. Breuer
2005-06-02 15:23 ` Is there a drive error "retry" parameter? Carlos Knowlton
2005-06-02 17:16   ` Michael Tokarev
2005-06-03  9:21     ` danci
2005-06-14 21:53     ` Carlos Knowlton
2005-06-14 22:46       ` Michael Tokarev
2005-06-15 21:40         ` Carlos Knowlton
2005-06-16  0:20         ` Paul Clements
2005-06-16 16:23           ` Michael Tokarev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).