RAID 1 failure on single disk causes disk subsystem to lock up

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID 1 failure on single disk causes disk subsystem to lock up
@ 2008-03-30 23:22 Robert L Mathews
  2008-03-31 10:01 ` Justin Piszcz
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Robert L Mathews @ 2008-03-30 23:22 UTC (permalink / raw)
  To: linux-raid

I'm using a two-disk SATA RAID 1 array on a number of identical servers, 
currently running kernel 2.6.8 (I know that's outdated; we use security 
backports and will soon be upgrading to 2.6.18).

Over the last year, a disk has failed on three different servers (with 
different brands of disks).

What I'd hope to happen in such situations is that the bad disk would be 
dropped from the RAID array automatically, and the machine would 
continue running with a degraded array.

However, in all three cases, that's not what happened. Instead, 
something like the following is printed to dmesg:

  ata2: command 0x35 timeout, stat 0xd0 host_stat 0x20
  scsi1: ERROR on channel 0, id 0, lun 0, CDB: Write (10) 00 07 b2 c7 80 
00 00 10 00
  Current sdb: sense key Medium Error
  Additional sense: Write error - auto reallocation failed
  end_request: I/O error, dev sdb, sector 129156992
  ATA: abnormal status 0xD0 on port 0xE407

Once this happens, all disk reads and writes fail to complete. "top" and 
"ps" show many processes stuck in the "D" state, from which they never 
recover. Using "kill -9" on them has no effect.

If I run a new program that requires disk access, that program hangs the 
terminal and can't be killed.

Using "iostat" shows no reads or writes occurring either at the md layer 
or on the underlying /dev/sda and /dev/sdb devices, although the "%util" 
column, oddly, shows 100% usage for the failed disk.

Running any mdadm command doesn't work. I don't see anything on the 
screen and that terminal hangs, presumably because mdadm tries doing 
disk access and gets hung in the "D" state, too.

I've waited several minutes to see if the machine will recover, and it 
doesn't. I eventually have to power cycle it.

Shouldn't the write error cause the bad disk to be gracefully removed 
from the array? Is this something that's likely to work better when we 
upgrade to a newer kernel version?

-- 
Robert L Mathews

  "In the beginning, the universe was created. This has made a lot of
   people very angry and has been widely regarded as a bad move."
                                                    -- Douglas Adams

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID 1 failure on single disk causes disk subsystem to lock up
  2008-03-30 23:22 RAID 1 failure on single disk causes disk subsystem to lock up Robert L Mathews
@ 2008-03-31 10:01 ` Justin Piszcz
  2008-03-31 17:30   ` Robert L Mathews
       [not found] ` <47F0281F.1070404@harddata.com>
  2008-04-02 17:43 ` Bill Davidsen
  2 siblings, 1 reply; 12+ messages in thread
From: Justin Piszcz @ 2008-03-31 10:01 UTC (permalink / raw)
  To: Robert L Mathews; +Cc: linux-raid



On Sun, 30 Mar 2008, Robert L Mathews wrote:

> I'm using a two-disk SATA RAID 1 array on a number of identical servers, 
> currently running kernel 2.6.8 (I know that's outdated; we use security 
> backports and will soon be upgrading to 2.6.18).
>
> Over the last year, a disk has failed on three different servers (with 
> different brands of disks).
>
> What I'd hope to happen in such situations is that the bad disk would be 
> dropped from the RAID array automatically, and the machine would continue 
> running with a degraded array.
>
> However, in all three cases, that's not what happened. Instead, something 
> like the following is printed to dmesg:
>
> ata2: command 0x35 timeout, stat 0xd0 host_stat 0x20
> scsi1: ERROR on channel 0, id 0, lun 0, CDB: Write (10) 00 07 b2 c7 80 00 00 
> 10 00
> Current sdb: sense key Medium Error
> Additional sense: Write error - auto reallocation failed
> end_request: I/O error, dev sdb, sector 129156992
> ATA: abnormal status 0xD0 on port 0xE407
>
> Once this happens, all disk reads and writes fail to complete. "top" and "ps" 
> show many processes stuck in the "D" state, from which they never recover. 
> Using "kill -9" on them has no effect.
>
> If I run a new program that requires disk access, that program hangs the 
> terminal and can't be killed.
>
> Using "iostat" shows no reads or writes occurring either at the md layer or 
> on the underlying /dev/sda and /dev/sdb devices, although the "%util" column, 
> oddly, shows 100% usage for the failed disk.
>
> Running any mdadm command doesn't work. I don't see anything on the screen 
> and that terminal hangs, presumably because mdadm tries doing disk access and 
> gets hung in the "D" state, too.
>
> I've waited several minutes to see if the machine will recover, and it 
> doesn't. I eventually have to power cycle it.
>
> Shouldn't the write error cause the bad disk to be gracefully removed from 
> the array? Is this something that's likely to work better when we upgrade to 
> a newer kernel version?

Did you have swap on the RAID1 as well?

I am trying to remember.. when my host failed a disk failure in a 
situation similar to yours, it turned out that it did not kickout the bad 
disk until I rebooted the host..

Justin.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID 1 failure on single disk causes disk subsystem to lock up
       [not found] ` <47F0281F.1070404@harddata.com>
@ 2008-03-31 17:27   ` Robert L Mathews
  2008-03-31 19:54     ` Peter Grandi
  0 siblings, 1 reply; 12+ messages in thread
From: Robert L Mathews @ 2008-03-31 17:27 UTC (permalink / raw)
  To: linux-raid

Maurice Hilarius wrote:

> How old are the controllers/motherboards?
> 
> Is the controller ON the motherboard?

They're SuperMicro 6013A-T servers with this motherboard:

  http://www.supermicro.com/products/motherboard/Xeon/E7501/X5DPA-TGM+.cfm

It appears to use an "Adaptec ICH5R SATA controller" on the motherboard 
(there's no separate SATA card or anything like that). Although that 
controller apparently has an optional RAID feature, I'm not using it; 
it's just in standard JBOD mode.


> What you describe sounds suspiciously like an IDE to SATA bridge chip.
> Or, in other words, ATA behaviour.

Here's part of the output from "lshw" on one of these machines:

  *-ide:1
         description: IDE interface
         product: 82801EB (ICH5) Serial ATA 150 Storage Controller
         vendor: Intel Corp.
         physical id: 1f.2
         bus info: pci@00:1f.2
         logical name: scsi0
         logical name: scsi1
         version: 02
         width: 32 bits
         clock: 66MHz
         capabilities: ide bus_master emulated scsi-host
         configuration: driver=ata_piix
         resources: ioport:ec00-ec07 ioport:e800-e803 ioport:e400-e407 
ioport:e000-e003 ioport:dc00-dc0f irq:185
     *-disk:0
         description: SCSI Disk
         product: Maxtor 7H500F0
         vendor: ATA
         physical id: 0
         bus info: scsi@0.0:0.0
         logical name: /dev/sda
         version: HA43
         size: 465GB
         configuration: ansiversion=5
     *-disk:1
         description: SCSI Disk
         product: SAMSUNG HD501LJ
         vendor: ATA
         physical id: 1
         bus info: scsi@1.0:0.0
         logical name: /dev/sdb
         version: CR10
         size: 465GB
         configuration: ansiversion=5

I do see that both disks are under "ide:1". Is that what you mean?


> This is not something from mdadm, anyway.
> Once the disk "dies" you are losing the disk bus, and that is "all she 
> wrote".

So mdadm can't protect against disk failures on these machines? Whenever 
a disk returns a write error, the machine will lock up?

-- 
Robert L Mathews

  "In the beginning, the universe was created. This has made a lot of
   people very angry and has been widely regarded as a bad move."
                                                    -- Douglas Adams

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID 1 failure on single disk causes disk subsystem to lock up
  2008-03-31 10:01 ` Justin Piszcz
@ 2008-03-31 17:30   ` Robert L Mathews
  2008-03-31 19:12     ` Justin Piszcz
  0 siblings, 1 reply; 12+ messages in thread
From: Robert L Mathews @ 2008-03-31 17:30 UTC (permalink / raw)
  To: linux-raid

Justin Piszcz wrote:

>Did you have swap on the RAID1 as well?

Yes, although swap usage on these machines is generally zero (they have 
4 GB of RAM and there's rarely anything in swap at all).

Even if there was some swap usage, though, that shouldn't cause this 
problem, should it? The point of putting swap on RAID is to avoid 
exactly this kind of issue.

-- 
Robert L Mathews

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID 1 failure on single disk causes disk subsystem to lock up
  2008-03-31 17:30   ` Robert L Mathews
@ 2008-03-31 19:12     ` Justin Piszcz
  0 siblings, 0 replies; 12+ messages in thread
From: Justin Piszcz @ 2008-03-31 19:12 UTC (permalink / raw)
  To: Robert L Mathews; +Cc: linux-raid



On Mon, 31 Mar 2008, Robert L Mathews wrote:

> Justin Piszcz wrote:
>
>> Did you have swap on the RAID1 as well?
>
> Yes, although swap usage on these machines is generally zero (they have 4 GB 
> of RAM and there's rarely anything in swap at all).
>
> Even if there was some swap usage, though, that shouldn't cause this problem, 
> should it? The point of putting swap on RAID is to avoid exactly this kind of 
> issue.
Trying to recall, I don't know if my host froze up or not but it was 
lagging and being unresponsive, I have had this happen twice, once with a 
74gb raptor and once with a 150gb raptor, I know the second time I had to 
reboot to make it kickout the bad disk.  The first time though I do not 
recall if the host froze up but it was acting very strange.

74gb raptor was on a PCI system (no PCI-e) (i875p chipset i believe?)
150gb raptor was on a 965 chipset


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID 1 failure on single disk causes disk subsystem to lock up
  2008-03-31 17:27   ` Robert L Mathews
@ 2008-03-31 19:54     ` Peter Grandi
  2008-04-01 19:33       ` Richard Scobie
  2008-04-02 18:33       ` Robert L Mathews
  0 siblings, 2 replies; 12+ messages in thread
From: Peter Grandi @ 2008-03-31 19:54 UTC (permalink / raw)
  To: Linux RAID

>>> On Mon, 31 Mar 2008 10:27:46 -0700, Robert L Mathews
>>> <lists@tigertech.com> said:

[ ... ]

> I do see that both disks are under "ide:1". Is that what you
> mean?

Indeed the symptoms reported are likely to be from drives on the
same channel.

>> This is not something from mdadm, anyway.  Once the disk "dies"
>> you are losing the disk bus, and that is "all she wrote".

That happens when the disk dies badly, but it is common enough.

> So mdadm can't protect against disk failures on these machines?

You can expect the Linux IO and RAID subsystems to only handle
reported, clean errors, after which the state of the whole machine
is well defined and known.

If you have high availability requirements perhaps you should buy
from an established storage vendor a storage system designed by
integration engineers and guaranteed by the vendor for some high
availability level.

> Whenever a disk returns a write error, the machine will lock
> up?

Perhaps without realizing it you have engaged in storage system
design and integration and there are many, many, many, many subtle
pitfalls in that (as the archives of this list show abundantly).

You cannot just slap things together and it all works. Have you
done even sketchy common mode failure analysis?

Also putting two drives belonging to a RAID set on the same
IDE/ATA channel is usually a bad idea for performance too.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID 1 failure on single disk causes disk subsystem to lock up
  2008-03-31 19:54     ` Peter Grandi
@ 2008-04-01 19:33       ` Richard Scobie
  2008-04-02 18:33       ` Robert L Mathews
  1 sibling, 0 replies; 12+ messages in thread
From: Richard Scobie @ 2008-04-01 19:33 UTC (permalink / raw)
  To: Linux RAID

Peter Grandi wrote:

> Also putting two drives belonging to a RAID set on the same
> IDE/ATA channel is usually a bad idea for performance too.
> 

True in general, but in this case the OP has SATA drives:

http://marc.info/?l=linux-raid&m=120692062718510&w=2

Very much a guess on my part, but I would suspect the SATA support in 
the 2.6.8 kernel you are using is not helping. There was a major rewrite 
or libata error handling code in later kernels, along with significant 
other SATA and RAID improvements.

Regards,

Richard

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID 1 failure on single disk causes disk subsystem to lock up
  2008-03-30 23:22 RAID 1 failure on single disk causes disk subsystem to lock up Robert L Mathews
  2008-03-31 10:01 ` Justin Piszcz
       [not found] ` <47F0281F.1070404@harddata.com>
@ 2008-04-02 17:43 ` Bill Davidsen
  2008-04-02 18:42   ` Robert L Mathews
  2 siblings, 1 reply; 12+ messages in thread
From: Bill Davidsen @ 2008-04-02 17:43 UTC (permalink / raw)
  To: Robert L Mathews; +Cc: linux-raid

Robert L Mathews wrote:
> I'm using a two-disk SATA RAID 1 array on a number of identical 
> servers, currently running kernel 2.6.8 (I know that's outdated; we 
> use security backports and will soon be upgrading to 2.6.18).

If you look at the change logs for versions between 2.6.8 and 2.6.24 you 
will see a lot of improvements in disk error handling.  The kernel you 
are running is probably incapable of handling the particular failure you 
are getting, at least for values of "handle" which include "cleanly 
report failure back to the md layer." Unless you have some very good 
reason for upgrading to another old kernel, you really should consider 
moving to something more recent. Security backports seem to be aimed at 
intrusion prevention and not data security in this case.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID 1 failure on single disk causes disk subsystem to lock up
  2008-03-31 19:54     ` Peter Grandi
  2008-04-01 19:33       ` Richard Scobie
@ 2008-04-02 18:33       ` Robert L Mathews
  2008-04-03 21:01         ` Peter Grandi
  1 sibling, 1 reply; 12+ messages in thread
From: Robert L Mathews @ 2008-04-02 18:33 UTC (permalink / raw)
  To: Linux RAID

Peter Grandi wrote:

 > If you have high availability requirements perhaps you should buy
 > from an established storage vendor a storage system designed by
 > integration engineers and guaranteed by the vendor for some high
 > availability level.

Actually, I don't trust such systems. That's our main reason for using 
software RAID 1: if all else fails with regard to RAID, we can take one 
of the disks and mount it as a non-RAID ext3 file system. No 
"guaranteed" proprietary system can offer that.

(And other than this one perplexing problem, we've been extremely happy 
with software RAID for many years -- thanks, Neal and everyone else 
involved.)

 > Perhaps without realizing it you have engaged in storage system
 > design and integration and there are many, many, many, many subtle
 > pitfalls in that (as the archives of this list show abundantly).
 >
 > You cannot just slap things together and it all works. Have you
 > done even sketchy common mode failure analysis?

Ouch!  :-)

Just for the record, this isn't "slapped together" hardware. They're 
off-the-shelf, server-grade, currently sold, genuine Intel, etc. 
SuperMicro servers, with no modifications, specifically chosen because 
they're widely used. The only storage system design we've done is 
connect a SATA drive to each of the two motherboard SATA ports and use 
software RAID 1 (yeah, I know that's "design", and we did think about it 
and test it, but still).

We've done many stress/failure tests for data storage, all of which pass 
as expected. What I unfortunately can't test in advance is how they 
behave when a working hard disk suddenly has a mechanical failure, which 
is the only time we've seen a problem.

I could sacrifice a working disk by opening it up while running and 
poking the platters with a screwdriver (I've seriously considered this), 
but repeating the test more than a few times would get expensive.

 > Also putting two drives belonging to a RAID set on the same
 > IDE/ATA channel is usually a bad idea for performance too.

They're SATA drives. There's no actual IDE hardware involved.

-- 
Robert L Mathews

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID 1 failure on single disk causes disk subsystem to lock up
  2008-04-02 17:43 ` Bill Davidsen
@ 2008-04-02 18:42   ` Robert L Mathews
  0 siblings, 0 replies; 12+ messages in thread
From: Robert L Mathews @ 2008-04-02 18:42 UTC (permalink / raw)
  To: linux-raid

Bill Davidsen wrote:

> If you look at the change logs for versions between 2.6.8 and 2.6.24 you 
> will see a lot of improvements in disk error handling.  The kernel you 
> are running is probably incapable of handling the particular failure you 
> are getting, at least for values of "handle" which include "cleanly 
> report failure back to the md layer."

Thanks -- yep, that makes sense. I appreciate the advice. I'll see what 
happens now that we're using 2.6.18 instead of 2.6.8.

> Unless you have some very good 
> reason for upgrading to another old kernel
 > ...
 > Security backports seem to be aimed at intrusion prevention and not
 > data security in this case.

Yep. We have to stick to Debian versions offering security support, 
unfortunately, for security reasons. The machines are Web servers that 
allow strangers (aka customers) shell access, and intrusion prevention 
is paramount  :-(  In a few months, the next version of Debian will 
allow us to upgrade to 2.6.24.

Again, thanks for the comments, everyone; much appreciated.

-- 
Robert L Mathews

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID 1 failure on single disk causes disk subsystem to lock up
  2008-04-02 18:33       ` Robert L Mathews
@ 2008-04-03 21:01         ` Peter Grandi
  2008-04-04 19:11           ` Robert L Mathews
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Grandi @ 2008-04-03 21:01 UTC (permalink / raw)
  To: Linux RAID

>>> On Wed, 02 Apr 2008 11:33:23 -0700, Robert L Mathews
>>> <lists@tigertech.com> said:

[ ... transfer errors on one disk affect transfers on a disk
attached to the same chip ... ]

> Just for the record, this isn't "slapped together"
> hardware. They're off-the-shelf, server-grade, currently
> sold, genuine Intel, etc. SuperMicro servers, [ ... ]

Yes, but system integration engineers spend time and effort
qualifying even good quality stuff like SuperMicro motherboards
in the specific configurations designed, because there are
*lots* of potential pitfalls.

One amusing example I heard about recently is that at CERN a
whole batch of storage servers built using excellent bits was
running much slower than expected because some cooling fans were
making drives vibrate a bit thus making arm seek stabilization
a lot more difficult, and then causing drive failure much sooner
than expected.

> [ ... ] The only storage system design we've done is connect a
> SATA drive to each of the two motherboard SATA ports and use
> software RAID 1 (yeah, I know that's "design", and we did
> think about it and test it, but still).

But have you checked whether the two ports use shared circuitry
and in effect the two ports are on the same channel? Because my
impression and that of another poster is that your drives are
sharing the same transfer logic, as if two IDE drives on the
same ribbon.

The ICH5R chipset was the first Intel one with SATA support and
it has extensive IDE/ATA compatibility:

  http://en.Wikipedia.org/wiki/I/O_Controller_Hub#ICH5
  http://WWW.Intel.com/design/chipsets/manuals/25267102.pdf

Perhaps corners were cut and the chip does not actually operate
the SATA channels independently At least that is what looks like
given the errors that you are seeing.

For example this guy noticed the exact same problem you are
seeing with a slightly different (one drive going offline)
cause:

  http://www.mail-archive.com/linux-ide%40vger.kernel.org/msg07691.html

Note also the driver developer's response...

[ ... ]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RAID 1 failure on single disk causes disk subsystem to lock up
  2008-04-03 21:01         ` Peter Grandi
@ 2008-04-04 19:11           ` Robert L Mathews
  0 siblings, 0 replies; 12+ messages in thread
From: Robert L Mathews @ 2008-04-04 19:11 UTC (permalink / raw)
  To: Linux RAID

Peter Grandi wrote:

> But have you checked whether the two ports use shared circuitry
> and in effect the two ports are on the same channel?

I've not been able to find such information at the level of detail which 
would explain whether I should expect this to fail, unfortunately (and I 
have searched).

> Because my
> impression and that of another poster is that your drives are
> sharing the same transfer logic, as if two IDE drives on the
> same ribbon.
 > ...
> For example this guy noticed the exact same problem you are
> seeing with a slightly different (one drive going offline)
> cause:

If that's the same thing, though, it doesn't seem to be a literal "the 
hardware can't handle this" issue, because the poster mentioned that the 
same test worked when running Windows instead of Linux on the same box.

It does sound like perhaps the kernel version I was using just doesn't 
quite handle this situation, and that newer versions of the kernel have 
a rewritten version of the applicable code that could conceivably make a 
difference. If (when) a disk hardware failure happens on one of these 
machines again, I'll report if anything different happens with newer 
kernels.

Thanks again!

-- 
Robert L Mathews

  "In the beginning, the universe was created. This has made a lot of
   people very angry and has been widely regarded as a bad move."
                                                    -- Douglas Adams

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-04-04 19:11 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-30 23:22 RAID 1 failure on single disk causes disk subsystem to lock up Robert L Mathews
2008-03-31 10:01 ` Justin Piszcz
2008-03-31 17:30   ` Robert L Mathews
2008-03-31 19:12     ` Justin Piszcz
     [not found] ` <47F0281F.1070404@harddata.com>
2008-03-31 17:27   ` Robert L Mathews
2008-03-31 19:54     ` Peter Grandi
2008-04-01 19:33       ` Richard Scobie
2008-04-02 18:33       ` Robert L Mathews
2008-04-03 21:01         ` Peter Grandi
2008-04-04 19:11           ` Robert L Mathews
2008-04-02 17:43 ` Bill Davidsen
2008-04-02 18:42   ` Robert L Mathews

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).