* RAID 1 failure on single disk causes disk subsystem to lock up
@ 2008-03-30 23:22 Robert L Mathews
2008-03-31 10:01 ` Justin Piszcz
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Robert L Mathews @ 2008-03-30 23:22 UTC (permalink / raw)
To: linux-raid
I'm using a two-disk SATA RAID 1 array on a number of identical servers,
currently running kernel 2.6.8 (I know that's outdated; we use security
backports and will soon be upgrading to 2.6.18).
Over the last year, a disk has failed on three different servers (with
different brands of disks).
What I'd hope to happen in such situations is that the bad disk would be
dropped from the RAID array automatically, and the machine would
continue running with a degraded array.
However, in all three cases, that's not what happened. Instead,
something like the following is printed to dmesg:
ata2: command 0x35 timeout, stat 0xd0 host_stat 0x20
scsi1: ERROR on channel 0, id 0, lun 0, CDB: Write (10) 00 07 b2 c7 80
00 00 10 00
Current sdb: sense key Medium Error
Additional sense: Write error - auto reallocation failed
end_request: I/O error, dev sdb, sector 129156992
ATA: abnormal status 0xD0 on port 0xE407
Once this happens, all disk reads and writes fail to complete. "top" and
"ps" show many processes stuck in the "D" state, from which they never
recover. Using "kill -9" on them has no effect.
If I run a new program that requires disk access, that program hangs the
terminal and can't be killed.
Using "iostat" shows no reads or writes occurring either at the md layer
or on the underlying /dev/sda and /dev/sdb devices, although the "%util"
column, oddly, shows 100% usage for the failed disk.
Running any mdadm command doesn't work. I don't see anything on the
screen and that terminal hangs, presumably because mdadm tries doing
disk access and gets hung in the "D" state, too.
I've waited several minutes to see if the machine will recover, and it
doesn't. I eventually have to power cycle it.
Shouldn't the write error cause the bad disk to be gracefully removed
from the array? Is this something that's likely to work better when we
upgrade to a newer kernel version?
--
Robert L Mathews
"In the beginning, the universe was created. This has made a lot of
people very angry and has been widely regarded as a bad move."
-- Douglas Adams
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID 1 failure on single disk causes disk subsystem to lock up
2008-03-30 23:22 RAID 1 failure on single disk causes disk subsystem to lock up Robert L Mathews
@ 2008-03-31 10:01 ` Justin Piszcz
2008-03-31 17:30 ` Robert L Mathews
[not found] ` <47F0281F.1070404@harddata.com>
2008-04-02 17:43 ` Bill Davidsen
2 siblings, 1 reply; 12+ messages in thread
From: Justin Piszcz @ 2008-03-31 10:01 UTC (permalink / raw)
To: Robert L Mathews; +Cc: linux-raid
On Sun, 30 Mar 2008, Robert L Mathews wrote:
> I'm using a two-disk SATA RAID 1 array on a number of identical servers,
> currently running kernel 2.6.8 (I know that's outdated; we use security
> backports and will soon be upgrading to 2.6.18).
>
> Over the last year, a disk has failed on three different servers (with
> different brands of disks).
>
> What I'd hope to happen in such situations is that the bad disk would be
> dropped from the RAID array automatically, and the machine would continue
> running with a degraded array.
>
> However, in all three cases, that's not what happened. Instead, something
> like the following is printed to dmesg:
>
> ata2: command 0x35 timeout, stat 0xd0 host_stat 0x20
> scsi1: ERROR on channel 0, id 0, lun 0, CDB: Write (10) 00 07 b2 c7 80 00 00
> 10 00
> Current sdb: sense key Medium Error
> Additional sense: Write error - auto reallocation failed
> end_request: I/O error, dev sdb, sector 129156992
> ATA: abnormal status 0xD0 on port 0xE407
>
> Once this happens, all disk reads and writes fail to complete. "top" and "ps"
> show many processes stuck in the "D" state, from which they never recover.
> Using "kill -9" on them has no effect.
>
> If I run a new program that requires disk access, that program hangs the
> terminal and can't be killed.
>
> Using "iostat" shows no reads or writes occurring either at the md layer or
> on the underlying /dev/sda and /dev/sdb devices, although the "%util" column,
> oddly, shows 100% usage for the failed disk.
>
> Running any mdadm command doesn't work. I don't see anything on the screen
> and that terminal hangs, presumably because mdadm tries doing disk access and
> gets hung in the "D" state, too.
>
> I've waited several minutes to see if the machine will recover, and it
> doesn't. I eventually have to power cycle it.
>
> Shouldn't the write error cause the bad disk to be gracefully removed from
> the array? Is this something that's likely to work better when we upgrade to
> a newer kernel version?
Did you have swap on the RAID1 as well?
I am trying to remember.. when my host failed a disk failure in a
situation similar to yours, it turned out that it did not kickout the bad
disk until I rebooted the host..
Justin.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID 1 failure on single disk causes disk subsystem to lock up
[not found] ` <47F0281F.1070404@harddata.com>
@ 2008-03-31 17:27 ` Robert L Mathews
2008-03-31 19:54 ` Peter Grandi
0 siblings, 1 reply; 12+ messages in thread
From: Robert L Mathews @ 2008-03-31 17:27 UTC (permalink / raw)
To: linux-raid
Maurice Hilarius wrote:
> How old are the controllers/motherboards?
>
> Is the controller ON the motherboard?
They're SuperMicro 6013A-T servers with this motherboard:
http://www.supermicro.com/products/motherboard/Xeon/E7501/X5DPA-TGM+.cfm
It appears to use an "Adaptec ICH5R SATA controller" on the motherboard
(there's no separate SATA card or anything like that). Although that
controller apparently has an optional RAID feature, I'm not using it;
it's just in standard JBOD mode.
> What you describe sounds suspiciously like an IDE to SATA bridge chip.
> Or, in other words, ATA behaviour.
Here's part of the output from "lshw" on one of these machines:
*-ide:1
description: IDE interface
product: 82801EB (ICH5) Serial ATA 150 Storage Controller
vendor: Intel Corp.
physical id: 1f.2
bus info: pci@00:1f.2
logical name: scsi0
logical name: scsi1
version: 02
width: 32 bits
clock: 66MHz
capabilities: ide bus_master emulated scsi-host
configuration: driver=ata_piix
resources: ioport:ec00-ec07 ioport:e800-e803 ioport:e400-e407
ioport:e000-e003 ioport:dc00-dc0f irq:185
*-disk:0
description: SCSI Disk
product: Maxtor 7H500F0
vendor: ATA
physical id: 0
bus info: scsi@0.0:0.0
logical name: /dev/sda
version: HA43
size: 465GB
configuration: ansiversion=5
*-disk:1
description: SCSI Disk
product: SAMSUNG HD501LJ
vendor: ATA
physical id: 1
bus info: scsi@1.0:0.0
logical name: /dev/sdb
version: CR10
size: 465GB
configuration: ansiversion=5
I do see that both disks are under "ide:1". Is that what you mean?
> This is not something from mdadm, anyway.
> Once the disk "dies" you are losing the disk bus, and that is "all she
> wrote".
So mdadm can't protect against disk failures on these machines? Whenever
a disk returns a write error, the machine will lock up?
--
Robert L Mathews
"In the beginning, the universe was created. This has made a lot of
people very angry and has been widely regarded as a bad move."
-- Douglas Adams
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID 1 failure on single disk causes disk subsystem to lock up
2008-03-31 10:01 ` Justin Piszcz
@ 2008-03-31 17:30 ` Robert L Mathews
2008-03-31 19:12 ` Justin Piszcz
0 siblings, 1 reply; 12+ messages in thread
From: Robert L Mathews @ 2008-03-31 17:30 UTC (permalink / raw)
To: linux-raid
Justin Piszcz wrote:
>Did you have swap on the RAID1 as well?
Yes, although swap usage on these machines is generally zero (they have
4 GB of RAM and there's rarely anything in swap at all).
Even if there was some swap usage, though, that shouldn't cause this
problem, should it? The point of putting swap on RAID is to avoid
exactly this kind of issue.
--
Robert L Mathews
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID 1 failure on single disk causes disk subsystem to lock up
2008-03-31 17:30 ` Robert L Mathews
@ 2008-03-31 19:12 ` Justin Piszcz
0 siblings, 0 replies; 12+ messages in thread
From: Justin Piszcz @ 2008-03-31 19:12 UTC (permalink / raw)
To: Robert L Mathews; +Cc: linux-raid
On Mon, 31 Mar 2008, Robert L Mathews wrote:
> Justin Piszcz wrote:
>
>> Did you have swap on the RAID1 as well?
>
> Yes, although swap usage on these machines is generally zero (they have 4 GB
> of RAM and there's rarely anything in swap at all).
>
> Even if there was some swap usage, though, that shouldn't cause this problem,
> should it? The point of putting swap on RAID is to avoid exactly this kind of
> issue.
Trying to recall, I don't know if my host froze up or not but it was
lagging and being unresponsive, I have had this happen twice, once with a
74gb raptor and once with a 150gb raptor, I know the second time I had to
reboot to make it kickout the bad disk. The first time though I do not
recall if the host froze up but it was acting very strange.
74gb raptor was on a PCI system (no PCI-e) (i875p chipset i believe?)
150gb raptor was on a 965 chipset
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID 1 failure on single disk causes disk subsystem to lock up
2008-03-31 17:27 ` Robert L Mathews
@ 2008-03-31 19:54 ` Peter Grandi
2008-04-01 19:33 ` Richard Scobie
2008-04-02 18:33 ` Robert L Mathews
0 siblings, 2 replies; 12+ messages in thread
From: Peter Grandi @ 2008-03-31 19:54 UTC (permalink / raw)
To: Linux RAID
>>> On Mon, 31 Mar 2008 10:27:46 -0700, Robert L Mathews
>>> <lists@tigertech.com> said:
[ ... ]
> I do see that both disks are under "ide:1". Is that what you
> mean?
Indeed the symptoms reported are likely to be from drives on the
same channel.
>> This is not something from mdadm, anyway. Once the disk "dies"
>> you are losing the disk bus, and that is "all she wrote".
That happens when the disk dies badly, but it is common enough.
> So mdadm can't protect against disk failures on these machines?
You can expect the Linux IO and RAID subsystems to only handle
reported, clean errors, after which the state of the whole machine
is well defined and known.
If you have high availability requirements perhaps you should buy
from an established storage vendor a storage system designed by
integration engineers and guaranteed by the vendor for some high
availability level.
> Whenever a disk returns a write error, the machine will lock
> up?
Perhaps without realizing it you have engaged in storage system
design and integration and there are many, many, many, many subtle
pitfalls in that (as the archives of this list show abundantly).
You cannot just slap things together and it all works. Have you
done even sketchy common mode failure analysis?
Also putting two drives belonging to a RAID set on the same
IDE/ATA channel is usually a bad idea for performance too.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID 1 failure on single disk causes disk subsystem to lock up
2008-03-31 19:54 ` Peter Grandi
@ 2008-04-01 19:33 ` Richard Scobie
2008-04-02 18:33 ` Robert L Mathews
1 sibling, 0 replies; 12+ messages in thread
From: Richard Scobie @ 2008-04-01 19:33 UTC (permalink / raw)
To: Linux RAID
Peter Grandi wrote:
> Also putting two drives belonging to a RAID set on the same
> IDE/ATA channel is usually a bad idea for performance too.
>
True in general, but in this case the OP has SATA drives:
http://marc.info/?l=linux-raid&m=120692062718510&w=2
Very much a guess on my part, but I would suspect the SATA support in
the 2.6.8 kernel you are using is not helping. There was a major rewrite
or libata error handling code in later kernels, along with significant
other SATA and RAID improvements.
Regards,
Richard
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID 1 failure on single disk causes disk subsystem to lock up
2008-03-30 23:22 RAID 1 failure on single disk causes disk subsystem to lock up Robert L Mathews
2008-03-31 10:01 ` Justin Piszcz
[not found] ` <47F0281F.1070404@harddata.com>
@ 2008-04-02 17:43 ` Bill Davidsen
2008-04-02 18:42 ` Robert L Mathews
2 siblings, 1 reply; 12+ messages in thread
From: Bill Davidsen @ 2008-04-02 17:43 UTC (permalink / raw)
To: Robert L Mathews; +Cc: linux-raid
Robert L Mathews wrote:
> I'm using a two-disk SATA RAID 1 array on a number of identical
> servers, currently running kernel 2.6.8 (I know that's outdated; we
> use security backports and will soon be upgrading to 2.6.18).
If you look at the change logs for versions between 2.6.8 and 2.6.24 you
will see a lot of improvements in disk error handling. The kernel you
are running is probably incapable of handling the particular failure you
are getting, at least for values of "handle" which include "cleanly
report failure back to the md layer." Unless you have some very good
reason for upgrading to another old kernel, you really should consider
moving to something more recent. Security backports seem to be aimed at
intrusion prevention and not data security in this case.
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID 1 failure on single disk causes disk subsystem to lock up
2008-03-31 19:54 ` Peter Grandi
2008-04-01 19:33 ` Richard Scobie
@ 2008-04-02 18:33 ` Robert L Mathews
2008-04-03 21:01 ` Peter Grandi
1 sibling, 1 reply; 12+ messages in thread
From: Robert L Mathews @ 2008-04-02 18:33 UTC (permalink / raw)
To: Linux RAID
Peter Grandi wrote:
> If you have high availability requirements perhaps you should buy
> from an established storage vendor a storage system designed by
> integration engineers and guaranteed by the vendor for some high
> availability level.
Actually, I don't trust such systems. That's our main reason for using
software RAID 1: if all else fails with regard to RAID, we can take one
of the disks and mount it as a non-RAID ext3 file system. No
"guaranteed" proprietary system can offer that.
(And other than this one perplexing problem, we've been extremely happy
with software RAID for many years -- thanks, Neal and everyone else
involved.)
> Perhaps without realizing it you have engaged in storage system
> design and integration and there are many, many, many, many subtle
> pitfalls in that (as the archives of this list show abundantly).
>
> You cannot just slap things together and it all works. Have you
> done even sketchy common mode failure analysis?
Ouch! :-)
Just for the record, this isn't "slapped together" hardware. They're
off-the-shelf, server-grade, currently sold, genuine Intel, etc.
SuperMicro servers, with no modifications, specifically chosen because
they're widely used. The only storage system design we've done is
connect a SATA drive to each of the two motherboard SATA ports and use
software RAID 1 (yeah, I know that's "design", and we did think about it
and test it, but still).
We've done many stress/failure tests for data storage, all of which pass
as expected. What I unfortunately can't test in advance is how they
behave when a working hard disk suddenly has a mechanical failure, which
is the only time we've seen a problem.
I could sacrifice a working disk by opening it up while running and
poking the platters with a screwdriver (I've seriously considered this),
but repeating the test more than a few times would get expensive.
> Also putting two drives belonging to a RAID set on the same
> IDE/ATA channel is usually a bad idea for performance too.
They're SATA drives. There's no actual IDE hardware involved.
--
Robert L Mathews
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID 1 failure on single disk causes disk subsystem to lock up
2008-04-02 17:43 ` Bill Davidsen
@ 2008-04-02 18:42 ` Robert L Mathews
0 siblings, 0 replies; 12+ messages in thread
From: Robert L Mathews @ 2008-04-02 18:42 UTC (permalink / raw)
To: linux-raid
Bill Davidsen wrote:
> If you look at the change logs for versions between 2.6.8 and 2.6.24 you
> will see a lot of improvements in disk error handling. The kernel you
> are running is probably incapable of handling the particular failure you
> are getting, at least for values of "handle" which include "cleanly
> report failure back to the md layer."
Thanks -- yep, that makes sense. I appreciate the advice. I'll see what
happens now that we're using 2.6.18 instead of 2.6.8.
> Unless you have some very good
> reason for upgrading to another old kernel
> ...
> Security backports seem to be aimed at intrusion prevention and not
> data security in this case.
Yep. We have to stick to Debian versions offering security support,
unfortunately, for security reasons. The machines are Web servers that
allow strangers (aka customers) shell access, and intrusion prevention
is paramount :-( In a few months, the next version of Debian will
allow us to upgrade to 2.6.24.
Again, thanks for the comments, everyone; much appreciated.
--
Robert L Mathews
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID 1 failure on single disk causes disk subsystem to lock up
2008-04-02 18:33 ` Robert L Mathews
@ 2008-04-03 21:01 ` Peter Grandi
2008-04-04 19:11 ` Robert L Mathews
0 siblings, 1 reply; 12+ messages in thread
From: Peter Grandi @ 2008-04-03 21:01 UTC (permalink / raw)
To: Linux RAID
>>> On Wed, 02 Apr 2008 11:33:23 -0700, Robert L Mathews
>>> <lists@tigertech.com> said:
[ ... transfer errors on one disk affect transfers on a disk
attached to the same chip ... ]
> Just for the record, this isn't "slapped together"
> hardware. They're off-the-shelf, server-grade, currently
> sold, genuine Intel, etc. SuperMicro servers, [ ... ]
Yes, but system integration engineers spend time and effort
qualifying even good quality stuff like SuperMicro motherboards
in the specific configurations designed, because there are
*lots* of potential pitfalls.
One amusing example I heard about recently is that at CERN a
whole batch of storage servers built using excellent bits was
running much slower than expected because some cooling fans were
making drives vibrate a bit thus making arm seek stabilization
a lot more difficult, and then causing drive failure much sooner
than expected.
> [ ... ] The only storage system design we've done is connect a
> SATA drive to each of the two motherboard SATA ports and use
> software RAID 1 (yeah, I know that's "design", and we did
> think about it and test it, but still).
But have you checked whether the two ports use shared circuitry
and in effect the two ports are on the same channel? Because my
impression and that of another poster is that your drives are
sharing the same transfer logic, as if two IDE drives on the
same ribbon.
The ICH5R chipset was the first Intel one with SATA support and
it has extensive IDE/ATA compatibility:
http://en.Wikipedia.org/wiki/I/O_Controller_Hub#ICH5
http://WWW.Intel.com/design/chipsets/manuals/25267102.pdf
Perhaps corners were cut and the chip does not actually operate
the SATA channels independently At least that is what looks like
given the errors that you are seeing.
For example this guy noticed the exact same problem you are
seeing with a slightly different (one drive going offline)
cause:
http://www.mail-archive.com/linux-ide%40vger.kernel.org/msg07691.html
Note also the driver developer's response...
[ ... ]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RAID 1 failure on single disk causes disk subsystem to lock up
2008-04-03 21:01 ` Peter Grandi
@ 2008-04-04 19:11 ` Robert L Mathews
0 siblings, 0 replies; 12+ messages in thread
From: Robert L Mathews @ 2008-04-04 19:11 UTC (permalink / raw)
To: Linux RAID
Peter Grandi wrote:
> But have you checked whether the two ports use shared circuitry
> and in effect the two ports are on the same channel?
I've not been able to find such information at the level of detail which
would explain whether I should expect this to fail, unfortunately (and I
have searched).
> Because my
> impression and that of another poster is that your drives are
> sharing the same transfer logic, as if two IDE drives on the
> same ribbon.
> ...
> For example this guy noticed the exact same problem you are
> seeing with a slightly different (one drive going offline)
> cause:
If that's the same thing, though, it doesn't seem to be a literal "the
hardware can't handle this" issue, because the poster mentioned that the
same test worked when running Windows instead of Linux on the same box.
It does sound like perhaps the kernel version I was using just doesn't
quite handle this situation, and that newer versions of the kernel have
a rewritten version of the applicable code that could conceivably make a
difference. If (when) a disk hardware failure happens on one of these
machines again, I'll report if anything different happens with newer
kernels.
Thanks again!
--
Robert L Mathews
"In the beginning, the universe was created. This has made a lot of
people very angry and has been widely regarded as a bad move."
-- Douglas Adams
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2008-04-04 19:11 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-30 23:22 RAID 1 failure on single disk causes disk subsystem to lock up Robert L Mathews
2008-03-31 10:01 ` Justin Piszcz
2008-03-31 17:30 ` Robert L Mathews
2008-03-31 19:12 ` Justin Piszcz
[not found] ` <47F0281F.1070404@harddata.com>
2008-03-31 17:27 ` Robert L Mathews
2008-03-31 19:54 ` Peter Grandi
2008-04-01 19:33 ` Richard Scobie
2008-04-02 18:33 ` Robert L Mathews
2008-04-03 21:01 ` Peter Grandi
2008-04-04 19:11 ` Robert L Mathews
2008-04-02 17:43 ` Bill Davidsen
2008-04-02 18:42 ` Robert L Mathews
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).