Megaraid lockup on 2.6.[7-8]

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Megaraid lockup on 2.6.[7-8]
@ 2004-08-07  1:55 Dan Merillat
       [not found] ` <1091882205l.7352l.1l@serve.riede.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Dan Merillat @ 2004-08-07  1:55 UTC (permalink / raw)
  To: linux-scsi

[-- Attachment #1: Type: text/plain, Size: 3767 bytes --]

This dump is from 2.6.7-rc3.  I'm using the in-kernel megaraid driver
on this Express 500.

0000:00:09.0 RAID bus controller: American Megatrends Inc. MegaRAID (rev 02)
        Subsystem: American Megatrends Inc. MegaRAID 475 Express
500/500LC RAID Controller

At this point, I'm really unsure what's going on.  It went from
"working" to "not working"
overnight, after a 2 month uptime with no problem.  Now it lasts
between 5 minutes and 3 hours before it locks up.  Somtimes the
lockups are 'soft', in that all disk IO hangs in wait.  Othertimes
it's hard (kernel panic and reload)

Due to the sudden nature of the problem, I suspected hardware. 
Starting with the SCSI cable,
RAID card, RAM, PCI riser card, motherboard... it's all been replaced.
 The megaraid itself
dosn't complain about problems accessing the drives (and the state is
always Optimal on reboot, no media or "other" errors on any drive)

Software wise, I've tried from 2.6.7-rc3, 2.6.7, 2.6.8-rc2 and 2.6.8-rc3.

I tried the 2.20 series, but they don't appear to recognize older
cards?  Are they for
U320 only or did I do something wrong?  (2.20 works fine on another
box with a U320 controller)

I'm at a loss.  I guess I could replace all the drives, being as
that's the last thing left to try,
but A) it's a serious PITA and B) they "appear" fine, even when doing
a full consistency check
via the Ctrl-M BIOS.   

This is made even more fun due to the fact it's a LIVE server, or was,
until two days ago.  It's
also stored in a colo facility about an hour from my house.   I'm
going to go back tomorrow
and try updating the firmware on the megaraid, but that's about the
only thing I can think
left to try.   Perhaps some PCI setting I should try?

Any help would be greatly appreciated!

(Different PCI Bus/Slot due to being in a different MB when I captured
this backtrace)

megaraid: found 0x101e:0x1960:bus 1:slot 2:func 0
scsi0:Found MegaRAID controller at 0xf8846000, IRQ:27
megaraid: [C170:3.13] detected 1 logical drives.
megaraid: supports extended CDBs.
megaraid: channel[0] is raid.
scsi0 : LSI Logic MegaRAID C170 254 commands 16 targs 4 chans 7 luns
scsi0: scanning scsi channel 0 for logical drives.
  Vendor: MegaRAID  Model: LD 0 RAID5  210G  Rev: C170
  Type:   Direct-Access                      ANSI SCSI revision: 02
SCSI device sda: 430116864 512-byte hdwr sectors (220220 MB)
sda: asking for cache data failed
sda: assuming drive cache: write through
 /dev/scsi/host0/bus0/target0/lun0: p1
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
Attached scsi generic sg0 at scsi0, channel 0, id 0, lun 0,  type 0
scsi0: scanning scsi channel 1 for logical drives.
scsi0: scanning scsi channel 2 for logical drives.
scsi0: scanning scsi channel 4 [P0] for physical devices.
...
megaraid: ABORTING-12793 cmd=2a <c=0 t=0 l=0>
megaraid: ABORTING-12796 cmd=2a <c=0 t=0 l=0>
megaraid: ABORTING-12797 cmd=2a <c=0 t=0 l=0>
megaraid: ABORTING-12798 cmd=2a <c=0 t=0 l=0>
megaraid: reservation reset failed.
megaraid: RESET-12793 cmd=2a <c=0 t=0 l=0>
megaraid: reservation reset failed.
megaraid: RESET-12793 cmd=2a <c=0 t=0 l=0>
megaraid: reservation reset failed.
megaraid: RESET-12793 cmd=2a <c=0 t=0 l=0>
scsi: Device offlined - not ready after error recovery: host 0 channel
0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel
0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel
0 id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel
0 id 0 lun 0
SCSI error : <0 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 197231
Buffer I/O error on device dm-64, logical block 24598
lost page write due to I/O error on dm-64
scsi0 (0:0): rejecting I/O to offline device

[-- Attachment #2: typhoon-config.gz --]
[-- Type: application/gzip, Size: 7202 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Megaraid lockup on 2.6.[7-8]
       [not found] ` <1091882205l.7352l.1l@serve.riede.org>
@ 2004-08-07 16:26   ` Dan Merillat
  0 siblings, 0 replies; 5+ messages in thread
From: Dan Merillat @ 2004-08-07 16:26 UTC (permalink / raw)
  To: linux-scsi

> Did you consider power supply instability? Temperature problems?

Given that it's kept in a 70 degree machine room, I don't think it's
temperature related.

It's a 450 watt redundant powersupply.  It's possible that it has
failed, but unlikely. (both halves
are independant)

Wouldn't power supply/thermal problems manifest as random errors,
though?   Memory generally
fails the same way if it's a specific address flaking out (and that
address is in kernel space) but
I've already replaced it.

Also, I checked the archives and I'm not the only one that's getting
this.  Apparently it's a 2.6 specific problem, 2.4 did not have it.

What is the latest driver for a LSI Megaraid Express 500?  As I said,
2.20 seems to be only for
the U320/SATA cards, or at least it dosn't detect mine.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Megaraid lockup on 2.6.[7-8]
@ 2004-08-09 14:11 Mukker, Atul
  2004-08-09 22:33 ` Dan Merillat
  0 siblings, 1 reply; 5+ messages in thread
From: Mukker, Atul @ 2004-08-09 14:11 UTC (permalink / raw)
  To: 'Dan Merillat', linux-scsi

Dan,

I speculate your drives are good. BTW, what raid level they are in? Also,
have you considered your drives enclosure as a possible source of errors. I
would highly recommended trying another box and see if it changes anything.
I can suggest FW trace collection, but let's wait a bit for that.

The latest 2.20 series of drivers
(ftp://ftp.lsil.com/pub/linux-megaraid/drivers/version-2.20.2.0/) should
support your card. This driver does have more extensive error reporting
capabilities.

-Atul Mukker
LSI Logic Corporation

> -----Original Message-----
> From: Dan Merillat [mailto:harik.attar@gmail.com]
> Sent: Friday, August 06, 2004 9:56 PM
> To: linux-scsi@vger.kernel.org
> Subject: Megaraid lockup on 2.6.[7-8]
> 
> 
> This dump is from 2.6.7-rc3.  I'm using the in-kernel megaraid driver
> on this Express 500.
> 
> 0000:00:09.0 RAID bus controller: American Megatrends Inc. 
> MegaRAID (rev 02)
>         Subsystem: American Megatrends Inc. MegaRAID 475 Express
> 500/500LC RAID Controller
> 
> At this point, I'm really unsure what's going on.  It went from
> "working" to "not working"
> overnight, after a 2 month uptime with no problem.  Now it lasts
> between 5 minutes and 3 hours before it locks up.  Somtimes the
> lockups are 'soft', in that all disk IO hangs in wait.  Othertimes
> it's hard (kernel panic and reload)
> 
> Due to the sudden nature of the problem, I suspected hardware. 
> Starting with the SCSI cable,
> RAID card, RAM, PCI riser card, motherboard... it's all been replaced.
>  The megaraid itself
> dosn't complain about problems accessing the drives (and the state is
> always Optimal on reboot, no media or "other" errors on any drive)
> 
> Software wise, I've tried from 2.6.7-rc3, 2.6.7, 2.6.8-rc2 
> and 2.6.8-rc3.
> 
> I tried the 2.20 series, but they don't appear to recognize older
> cards?  Are they for
> U320 only or did I do something wrong?  (2.20 works fine on another
> box with a U320 controller)
> 
> I'm at a loss.  I guess I could replace all the drives, being as
> that's the last thing left to try,
> but A) it's a serious PITA and B) they "appear" fine, even when doing
> a full consistency check
> via the Ctrl-M BIOS.   
> 
> This is made even more fun due to the fact it's a LIVE server, or was,
> until two days ago.  It's
> also stored in a colo facility about an hour from my house.   I'm
> going to go back tomorrow
> and try updating the firmware on the megaraid, but that's about the
> only thing I can think
> left to try.   Perhaps some PCI setting I should try?
> 
> Any help would be greatly appreciated!
> 
> (Different PCI Bus/Slot due to being in a different MB when I captured
> this backtrace)
> 
> 
> megaraid: found 0x101e:0x1960:bus 1:slot 2:func 0
> scsi0:Found MegaRAID controller at 0xf8846000, IRQ:27
> megaraid: [C170:3.13] detected 1 logical drives.
> megaraid: supports extended CDBs.
> megaraid: channel[0] is raid.
> scsi0 : LSI Logic MegaRAID C170 254 commands 16 targs 4 chans 7 luns
> scsi0: scanning scsi channel 0 for logical drives.
>   Vendor: MegaRAID  Model: LD 0 RAID5  210G  Rev: C170
>   Type:   Direct-Access                      ANSI SCSI revision: 02
> SCSI device sda: 430116864 512-byte hdwr sectors (220220 MB)
> sda: asking for cache data failed
> sda: assuming drive cache: write through
>  /dev/scsi/host0/bus0/target0/lun0: p1
> Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
> Attached scsi generic sg0 at scsi0, channel 0, id 0, lun 0,  type 0
> scsi0: scanning scsi channel 1 for logical drives.
> scsi0: scanning scsi channel 2 for logical drives.
> scsi0: scanning scsi channel 4 [P0] for physical devices.
> ...
> megaraid: ABORTING-12793 cmd=2a <c=0 t=0 l=0>
> megaraid: ABORTING-12796 cmd=2a <c=0 t=0 l=0>
> megaraid: ABORTING-12797 cmd=2a <c=0 t=0 l=0>
> megaraid: ABORTING-12798 cmd=2a <c=0 t=0 l=0>
> megaraid: reservation reset failed.
> megaraid: RESET-12793 cmd=2a <c=0 t=0 l=0>
> megaraid: reservation reset failed.
> megaraid: RESET-12793 cmd=2a <c=0 t=0 l=0>
> megaraid: reservation reset failed.
> megaraid: RESET-12793 cmd=2a <c=0 t=0 l=0>
> scsi: Device offlined - not ready after error recovery: host 0 channel
> 0 id 0 lun 0
> scsi: Device offlined - not ready after error recovery: host 0 channel
> 0 id 0 lun 0
> scsi: Device offlined - not ready after error recovery: host 0 channel
> 0 id 0 lun 0
> scsi: Device offlined - not ready after error recovery: host 0 channel
> 0 id 0 lun 0
> SCSI error : <0 0 0 0> return code = 0x6000000
> end_request: I/O error, dev sda, sector 197231
> Buffer I/O error on device dm-64, logical block 24598
> lost page write due to I/O error on dm-64
> scsi0 (0:0): rejecting I/O to offline device
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Megaraid lockup on 2.6.[7-8]
  2004-08-09 14:11 Megaraid lockup on 2.6.[7-8] Mukker, Atul
@ 2004-08-09 22:33 ` Dan Merillat
  2004-08-09 23:02   ` Dan Merillat
  0 siblings, 1 reply; 5+ messages in thread
From: Dan Merillat @ 2004-08-09 22:33 UTC (permalink / raw)
  To: Mukker, Atul; +Cc: linux-scsi

On Mon, 9 Aug 2004 10:11:42 -0400 , Mukker, Atul <atulm@lsil.com> wrote:
> Dan,
>
> I speculate your drives are good. BTW, what raid level they are in? Also,
> have you considered your drives enclosure as a possible source of errors. I
> would highly recommended trying another box and see if it changes anything.
> I can suggest FW trace collection, but let's wait a bit for that.

Raid 5.  I've used AMI/LSI cards for quite a while (4 years?) and
normally if there's
any SCSI/drive problem, the internal alarm sounds and the array goes
into degraded
or offline mode.  In this case, a full consistancy check/rebuild of
the drives succeeds,
but running in linux for 4-5 minutes errors out.

I would try a different enclosure, but I don't have one available.
(SCA drives in hotswap enclosures).

> The latest 2.20 series of drivers
> (ftp://ftp.lsil.com/pub/linux-megaraid/drivers/version-2.20.2.0/) should
> support your card. This driver does have more extensive error reporting
> capabilities.

Actually, they don't.  They work once you include the PCI ID in the
table, though.
Since you said they 'should work' I went ahead and dug around and patched them.

Here's the PCI ID for this card, you may want to add the rest of the
megaraid series
IDs to the driver:

0000:01:02.0 Class 0104: 101e:1960 (rev 02)
        Subsystem: 101e:0475
        Flags: bus master, medium devsel, latency 64, IRQ 27
        Memory at fc1f0000 (32-bit, prefetchable) [size=febf8000]
        Expansion ROM at 00008000 [disabled]
        Capabilities: <available only to root>

So far, so good.  When I stress-tested it I had a device-mapper related lockup,
but no scsi/Megaraid problems.   I'll report back in a few days if I
get any further errors.

--Dan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Megaraid lockup on 2.6.[7-8]
  2004-08-09 22:33 ` Dan Merillat
@ 2004-08-09 23:02   ` Dan Merillat
  0 siblings, 0 replies; 5+ messages in thread
From: Dan Merillat @ 2004-08-09 23:02 UTC (permalink / raw)
  To: Mukker, Atul; +Cc: linux-scsi

On Mon, 9 Aug 2004 15:33:23 -0700, Dan Merillat <harik.attar@gmail.com> wrote:

> So far, so good.  When I stress-tested it I had a device-mapper related lockup,
> but no scsi/Megaraid problems.   I'll report back in a few days if I
> get any further errors.

First lockup, but 2.20.2 recovered gracefully:

megaraid: aborting-37816 cmd=2a <c=1 t=0 l=0>
megaraid abort: 37816:53[255:0], fw owner
megaraid: aborting-37817 cmd=2a <c=1 t=0 l=0>
megaraid abort: 37817:34[255:0], fw owner
megaraid: aborting-37818 cmd=2a <c=1 t=0 l=0>
megaraid abort: 37818:26[255:0], fw owner
megaraid: aborting-37821 cmd=2a <c=1 t=0 l=0>
megaraid abort: 37821:18[255:0], fw owner
megaraid: aborting-37824 cmd=2a <c=1 t=0 l=0>
megaraid abort: 37824:7[255:0], fw owner
megaraid: reseting the host...
megaraid: 5 outstanding commands. Max wait 180 sec
megaraid mbox: Wait for 5 commands to complete:180
megaraid mbox: Wait for 5 commands to complete:175
megaraid mbox: Wait for 5 commands to complete:170
megaraid mbox: Wait for 5 commands to complete:165
megaraid mbox: reset sequence completed sucessfully

This time, it appears that we have 3 media errors on one of the drives
(Since this is the first time it's survived a timeout, it's the first
time I can actually get into megamgr without rebooting)

If so, mystery solved: Drive failure and bad error-handling code leads
to an undiagnosed lockup.  I KNEW it was too sudden to be random
software bitrot.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2004-08-09 23:02 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-09 14:11 Megaraid lockup on 2.6.[7-8] Mukker, Atul
2004-08-09 22:33 ` Dan Merillat
2004-08-09 23:02   ` Dan Merillat
  -- strict thread matches above, loose matches on Subject: below --
2004-08-07  1:55 Dan Merillat
     [not found] ` <1091882205l.7352l.1l@serve.riede.org>
2004-08-07 16:26   ` Dan Merillat

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).