* Fwd: Error on /dev/sda, but takes down RAID-1
[not found] <27710945.1301201097639436.JavaMail.root@zimbra>
@ 2008-01-23 15:05 ` Martin Seebach
2008-01-23 18:35 ` Michael Tokarev
2008-01-23 20:55 ` Neil Brown
0 siblings, 2 replies; 3+ messages in thread
From: Martin Seebach @ 2008-01-23 15:05 UTC (permalink / raw)
To: linux-raid
Hi,
I'm not sure this is completely linux-raid related, but I can't figure out where to start:
A few days ago, my server died. I was able to log in and salvage this content of dmesg:
http://pastebin.com/m4af616df
I talked to my hosting-people and they said it was an io-error on /dev/sda, and replaced that drive.
After this, I was able to boot into a PXE-image and re-build the two RAID-1 devices with no problems - indicating that sdb was fine.
I expected RAID-1 to be able to stomach exactly this kind of error - one drive dying. What did I do wrong?
Regards,
Martin Seebach
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Fwd: Error on /dev/sda, but takes down RAID-1
2008-01-23 15:05 ` Fwd: Error on /dev/sda, but takes down RAID-1 Martin Seebach
@ 2008-01-23 18:35 ` Michael Tokarev
2008-01-23 20:55 ` Neil Brown
1 sibling, 0 replies; 3+ messages in thread
From: Michael Tokarev @ 2008-01-23 18:35 UTC (permalink / raw)
To: Martin Seebach; +Cc: linux-raid
Martin Seebach wrote:
> Hi,
>
> I'm not sure this is completely linux-raid related, but I can't figure out where to start:
>
> A few days ago, my server died. I was able to log in and salvage this content of dmesg:
> http://pastebin.com/m4af616df
>
> I talked to my hosting-people and they said it was an io-error on /dev/sda, and replaced that drive.
> After this, I was able to boot into a PXE-image and re-build the two RAID-1 devices with no problems - indicating that sdb was fine.
>
> I expected RAID-1 to be able to stomach exactly this kind of error - one drive dying. What did I do wrong?
from that pastebin page.
First, sdb has failed for whatever reason:
ata2.00: qc timeout (cmd 0xec)
ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata2.00: revalidation failed (errno=-5)
ata2.00: disabled
ata2: EH complete
sd 1:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sdb, sector 80324865
raid1: Disk failure on sdb1, disabling device.
Operation continuing on 1 devices
RAID1 conf printout:
--- wd:1 rd:2
disk 0, wo:0, o:1, dev:sda1
disk 1, wo:1, o:0, dev:sdb1
RAID1 conf printout:
--- wd:1 rd:2
disk 0, wo:0, o:1, dev:sda1
At this time, it started to (re)sync other(?) arrays for
some reason:
md: syncing RAID array md0
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
md: using 128k window, over a total of 40162432 blocks.
md: md0: sync done.
RAID1 conf printout:
--- wd:1 rd:2
disk 0, wo:0, o:1, dev:sda1
md: syncing RAID array md1
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
md: using 128k window, over a total of 100060736 blocks.
Note again, errors on sdb:
sd 1:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sdb, sector 112455000
sd 1:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sdb, sector 112455256
sd 1:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sdb, sector 112455512
...
raid1: Disk failure on sdb3, disabling device.
Operation continuing on 1 devices
so another md array detected sdb failure. So we're
with sda only. And volia, sda fails too, some time
later:
ata1: EH complete
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 80324865
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 115481
...
At this point, the arrays are hosed - all disks
of each array has failed, there's no data any
more to read/write from/to.
Since later sda has been replaced, and sdb recovered
from the errors (it contains still-valid superblocks
but with somewhat stale information), everything
went ok.
But the original problem is that you had BOTH disks
failed, not only one. What caused THIS problem is
another question. Maybe some overheating or power
unit problem or somesuch, -- I don't know... But
md code worked the best it can here.
/mjt
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Fwd: Error on /dev/sda, but takes down RAID-1
2008-01-23 15:05 ` Fwd: Error on /dev/sda, but takes down RAID-1 Martin Seebach
2008-01-23 18:35 ` Michael Tokarev
@ 2008-01-23 20:55 ` Neil Brown
1 sibling, 0 replies; 3+ messages in thread
From: Neil Brown @ 2008-01-23 20:55 UTC (permalink / raw)
To: Martin Seebach; +Cc: linux-raid
On Wednesday January 23, mail@martinseebach.dk wrote:
> Hi,
>
> I'm not sure this is completely linux-raid related, but I can't figure out where to start:
>
> A few days ago, my server died. I was able to log in and salvage this content of dmesg:
> http://pastebin.com/m4af616df
At line 194:
end_request: I/O error, dev sdb, sector 80324865
then at line 384
end_request: I/O error, dev sda, sector 80324865
>
> I talked to my hosting-people and they said it was an io-error on /dev/sda, and replaced that drive.
> After this, I was able to boot into a PXE-image and re-build the two RAID-1 devices with no problems - indicating that sdb was fine.
>
> I expected RAID-1 to be able to stomach exactly this kind of error - one drive dying. What did I do wrong?
Trouble is it wasn't "one drive dying". You got errors from two
drives, at almost exactly the same time. So maybe the controller
died. Or maybe when one drive died, the controller or the driver got
confused and couldn't work with the other drive any more.
Certainly the "blk: request botched" message (line 233 onwards)
suggest some confusion in the driver.
Maybe post to linux-ide@vger.kernel.org - that is where issues with
SATA drivers and controllers can be discussed.
NeilBrown
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2008-01-23 20:55 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <27710945.1301201097639436.JavaMail.root@zimbra>
2008-01-23 15:05 ` Fwd: Error on /dev/sda, but takes down RAID-1 Martin Seebach
2008-01-23 18:35 ` Michael Tokarev
2008-01-23 20:55 ` Neil Brown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).