apparent but not real raid1 failure. what happened? still confused. Gurus Please help...

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mitchell Laks <mlaks@verizon.net>
To: linux-raid@vger.kernel.org
Subject: apparent but not real raid1 failure. what happened? still confused. Gurus Please help...
Date: Mon, 02 May 2005 12:24:35 -0400	[thread overview]
Message-ID: <200505021224.35396.mlaks@verizon.net> (raw)

I recently had what appeared to be 
a raid1 failure and was wondering what insight I should draw. The kernel 
diagnostics suggested a dual drive failure - but the data turned out to still 
be there. What does this mean?

I described what happened in an earlier post, but I really don't understand 
and would be very grateful for insight from the gurus on the list.

Is it a bug in the kernel? in software raid? Is it my stupidity?

my system: 

asus K8v-x motherboard with amd64 processor,
uname -a
Linux A2 2.6.8-1-386 #1 Mon Jan 24 03:01:58 EST 2005 i686 GNU/Linux
(this was the debian stock 2.6.8 kernel circa January)

mdadm-v1.9.0

All harddrives are 250GB  parallel ata ide drives, 
WD2500 JB drives (3 year warranty)

Initially, one raid failed:
/dev/md0 between /dev/hda1 and
/dev/hdg1 with the /dev/hdg1 on a highpoint rocket 133 controller.

From reading the log files I see that initially /dev/hda1 died

Apr 21 07:36:01 A2 kernel: hda: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Apr 21 07:36:01 A2 kernel: hda: dma_intr: error=0x40 { UncorrectableError },
LBAsect=209715335, high=12, low=8388743, sector=209
715335
Apr 21 07:36:01 A2 kernel: end_request: I/O error, dev hda, sector 209715335
Apr 21 07:36:01 A2 kernel: raid1: Disk failure on hda1, disabling device.
Apr 21 07:36:01 A2 kernel: ^IOperation continuing on 1 devices
Apr 21 07:36:01 A2 kernel: raid1: hda1: rescheduling sector 209715272
Apr 21 07:36:01 A2 kernel: RAID1 conf printout:
Apr 21 07:36:01 A2 kernel:  --- wd:1 rd:2
Apr 21 07:36:01 A2 kernel:  disk 0, wo:1, o:0, dev:hda1
Apr 21 07:36:01 A2 kernel:  disk 1, wo:0, o:1, dev:hdg1
Apr 21 07:36:01 A2 kernel: RAID1 conf printout:
Apr 21 07:36:01 A2 kernel:  --- wd:1 rd:2
Apr 21 07:36:01 A2 kernel:  disk 1, wo:0, o:1, dev:hdg1
Apr 21 07:36:01 A2 kernel: raid1: hdg1: redirecting sector 209715272 to
another mirror
Apr 21 07:36:21 A2 kernel: hdg: dma_timer_expiry: dma status == 0x20
Apr 21 07:36:21 A2 kernel: hdg: DMA timeout retry
Apr 21 07:36:21 A2 kernel: hdg: timeout waiting for DMA
Apr 21 07:36:21 A2 kernel: hdg: status error: status=0x58 { DriveReady
SeekComplete DataRequest }
Apr 21 07:36:21 A2 kernel:
Apr 21 07:36:21 A2 kernel: hdg: drive not ready for command
Apr 21 07:36:21 A2 kernel: raid1: hdg1: rescheduling sector 209715272
Apr 21 07:36:21 A2 kernel: raid1: hdg1: redirecting sector 209715272 to
another mirror
Apr 21 07:36:21 A2 kernel: hdg: status error: status=0x58 { DriveReady
SeekComplete DataRequest }
Apr 21 07:36:21 A2 kernel:
Apr 21 07:36:21 A2 kernel: hdg: drive not ready for command
Apr 21 07:36:41 A2 kernel: hdg: dma_timer_expiry: dma status == 0x20
Apr 21 07:36:41 A2 kernel: hdg: DMA timeout retry
Apr 21 07:36:41 A2 kernel: hdg: timeout waiting for DMA
Apr 21 07:36:41 A2 kernel: hdg: status error: status=0x58 { DriveReady
SeekComplete DataRequest }
Apr 21 07:36:41 A2 kernel:
Apr 21 07:36:41 A2 kernel: hdg: drive not ready for command
Apr 21 07:36:41 A2 kernel: raid1: hdg1: rescheduling sector 209715272
Apr 21 07:36:41 A2 kernel: raid1: hdg1: redirecting sector 209715272 to
another mirror
Apr 21 07:36:41 A2 kernel: hdg: status error: status=0x58 { DriveReady
SeekComplete DataRequest }
Apr 21 07:36:41 A2 kernel:

and then /dev/hdg1 immediately began to spew forth error messages of the
following sort 

Apr 22 22:29:21 A2 kernel: hdg: status error: status=0x58 { DriveReady
SeekComplete DataRequest }
Apr 22 22:29:21 A2 kernel:
Apr 22 22:29:21 A2 kernel: hdg: drive not ready for command
Apr 22 22:29:21 A2 kernel: raid1: hdg1: rescheduling sector 209715272
Apr 22 22:29:21 A2 kernel: raid1: hdg1: redirecting sector 209715272 to
another
mirror
Apr 22 22:29:21 A2 kernel: hdg: status error: status=0x58 { DriveReady
SeekCompl
ete DataRequest }
Apr 22 22:29:21 A2 kernel:
Apr 22 22:29:21 A2 kernel: hdg: drive not ready for command
Apr 22 22:29:21 A2 kernel: raid1: hdg1: rescheduling sector 209715272
Apr 22 22:29:21 A2 kernel: raid1: hdg1: redirecting sector 209715272 to other
 ....

These errors continued nonstop all day/night until /
var ran out of space and errors filled the 6GB /var partition.

2.6GB of         /var/log/kern.log and
2.6GB of        /var/log/syslog and
1GB of          /var/log/messages
were filled by these errors.

I then pulled the two drives out of the system and 
put a pair of new drives in for /dev/hda1 and /dev/hdg1 and
and created /dev/md0 anew, and restored the data to my servers from backups.

I then took the two drives /dev/hda1 and /dev/hdg1 to another machine

and ran the Western Digital drive diagnostics on both of them and they 
are both fine. No errors.

I then took /dev/hda1 on the new system and did
modprobe raid1
mknod /dev/md0 b 9 0
mdadm -A /dev/md0  /dev/hda1 

and then mount /dev/md0 /mnt and i see my data which looks intact.
Similarly if I do that with /dev/hdg1 i see the same data.

(note if i then try to do mdadm -A /dev/md0 /dev/hda1 /dev/hdc1 
(where /dev/hdc1 was /dev/hdg1 on the other machine) then i get a message
saying effectively that they are not up to date to each other ...)

Has anyone else had this trouble? Could someone explain what happened?

What should I have done when I found the errors when my system failed?

Is it safe for me to continue to use raid1?

Thanks,

Mitchell

next             reply	other threads:[~2005-05-02 16:24 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-05-02 16:24 Mitchell Laks [this message]
2005-05-02 18:20 ` apparent but not real raid1 failure. what happened? still confused. Gurus Please help Peter T. Breuer
2005-06-02 15:23 ` Is there a drive error "retry" parameter? Carlos Knowlton
2005-06-02 17:16   ` Michael Tokarev
2005-06-03  9:21     ` danci
2005-06-14 21:53     ` Carlos Knowlton
2005-06-14 22:46       ` Michael Tokarev
2005-06-15 21:40         ` Carlos Knowlton
2005-06-16  0:20         ` Paul Clements
2005-06-16 16:23           ` Michael Tokarev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200505021224.35396.mlaks@verizon.net \
    --to=mlaks@verizon.net \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.