All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: Megaraid and Dell PERC 4 controllers
@ 2005-08-29 20:25 Ju, Seokmann
  2005-08-30  7:43 ` Steve Sutphen
  0 siblings, 1 reply; 3+ messages in thread
From: Ju, Seokmann @ 2005-08-29 20:25 UTC (permalink / raw)
  To: linux-scsi, 'Jonathan Fischer'; +Cc: Kolli, Neela Syam

FYI - Resending due to failure on previous sending.  

> -----Original Message-----
> From: Ju, Seokmann 
> Sent: Friday, August 26, 2005 11:00 AM
> To: 'Jonathan Fischer'
> Cc: Kolli, Neela Syam
> Subject: RE: Megaraid and Dell PERC 4 controllers
> 
> Hi Jonathan,
> 
> On Tuesday, August 23, 2005 4:52 PM, Jonathan Fischer wrote:
> > I think next up I'm trying writethru mode, instead of write 
> back, but
> > has anyone seen anything like this, or have any insight they might
> > offer?  I'm quickly getting to the point of being stumped.
> Can you please specify detail system configuration? (memory 
> size, # of cpus)
> And, what kind of load are you putting on the system when it locks up.
> Also, I assuem that the system doesn't have any monitoring 
> applications running for those PERC controllers. Please confirm this.
> From the message, the controller takes more than 3 minutes to 
> return certain I/O requests and it leads system to lock up.
> 
> Thank you.
> 
> Seokmann
> 
> > -----Original Message-----
> > From: Jonathan Fischer [mailto:jfischer@csusm.edu] 
> > Sent: Tuesday, August 23, 2005 4:52 PM
> > To: linux-scsi@vger.kernel.org
> > Subject: Megaraid and Dell PERC 4 controllers
> > 
> > I apologize if this is the wrong list to ask this kind of 
> question on;
> > I've posted on Dell's PowerEdge list and Red Hat's lists as 
> > well, but I
> > figure the people here might know better what to try for 
> this problem.
> > 
> > I have 2 Dell PowerEdge 2850's, one with a PERC 4e/DC raid 
> controller,
> > and the other with a PERC 4e/Di.  On both of these systems, I can
> > reliably cause the controllers to lock up under heavy load.  This is
> > using a fully up-to-date Red Hat 4 EL (non x86_64) 
> > installation on both
> > computers.  The controllers use the megaraid_mbox driver.
> > 
> > During a period of high load, the controller suddenly seems to stop
> > responding to the driver, causing the driver to go into a 
> waiting loop
> > for it.  It waits 3 minutes for the controller to respond, which it
> > never does, and then takes the controller offline, pretty 
> much yanking
> > the filesystem out from underneath the OS.
> > 
> > Some things keep running alright, so (working with Red Hat's 
> > support) I
> > got the thing set up to netdump to another server to see if we could
> > figure out what was going wrong.  The kernel never actually 
> > crashes, so
> > netdump doesn't produce a vmcore to look through, but syslog keeps
> > spouting out information, so I've got that.
> > 
> > Every time this lockup occurs, the log file looks like this:
> > 
> > megaraid: aborting-29762 cmd=2a <c=2 t=0 l=0>
> > megaraid abort: 29762:21[255:128], fw owner
> > megaraid: aborting-29763 cmd=2a <c=2 t=0 l=0>
> > megaraid abort: 29763:39[255:128], fw owner
> > megaraid: aborting-29764 cmd=2a <c=2 t=0 l=0>
> > megaraid abort: 29764:16[255:128], fw owner
> > megaraid: aborting-29768 cmd=2a <c=2 t=0 l=0>
> > megaraid abort: 29768:53[255:128], fw owner
> > 
> > 	This part repeats 64 times, then...
> > 
> > megaraid: aborting-29831 cmd=2a <c=2 t=0 l=0>
> > megaraid abort: 29831:8[255:128], fw owner
> > megaraid: resetting the host...
> > megaraid: 64 outstanding commands. Max wait 180 sec
> > megaraid mbox: Wait for 64 commands to complete:180
> > megaraid mbox: Wait for 64 commands to complete:175
> > 	
> > 	megaraid mbox counts down to 0, and then...
> > 
> > megaraid mbox: critical hardware error!
> > megaraid: resetting the host...
> > megaraid: hw error, cannot reset
> > megaraid: resetting the host...
> > megaraid: hw error, cannot reset
> > SCSI error : <0 2 0 0> return code = 0x6000000
> > end_request: I/O error, dev sda, sector 242938701
> > Buffer I/O error on device dm-4, logical block 9893952 lost 
> page write
> > due to I/O error on dm-4
> > scsi0 (0:0): rejecting I/O to offline device
> > 
> > The commands that the driver are waiting for are always the 
> > same, except
> > for the sequence number (the number right after "aborting-" 
> > and  "abort:
> > ").  And there are always 64 commands backed up that the driver is
> > waiting for.
> > 
> > Both machines in question pass memtest86 and Dell's 
> > diagnostic sets, and
> > since the failure is identical in both I don't believe it's bad
> > hardware.  We've got the latest BIOS, RAID firmware, and backplane
> > firmware on the machines.
> > 
> > I've also tried:
> > - the RHEL 4 Update 2 Beta kernel (at Red Hat's suggestion)
> > - RHEL 4 x86_64
> > - RHEL 3 x86_64
> > - Fedora Core 4 x86
> > - disabling Patrol Read in the RAID bios
> > - disabling read-ahead in the RAID bios
> > - changing the writeback cache flush to every 2 seconds, 
> > instead of the
> > default 4
> > 
> > I think next up I'm trying writethru mode, instead of write 
> back, but
> > has anyone seen anything like this, or have any insight they might
> > offer?  I'm quickly getting to the point of being stumped.
> > 
> > Jonathan Fischer
> > Operating Systems Analyst - CSU San Marcos
> > jfischer@csusm.edu
> > 
> > -
> > To unsubscribe from this list: send the line "unsubscribe 
> > linux-scsi" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread
* Megaraid and Dell PERC 4 controllers
@ 2005-08-23 20:51 Jonathan Fischer
  0 siblings, 0 replies; 3+ messages in thread
From: Jonathan Fischer @ 2005-08-23 20:51 UTC (permalink / raw)
  To: linux-scsi

I apologize if this is the wrong list to ask this kind of question on;
I've posted on Dell's PowerEdge list and Red Hat's lists as well, but I
figure the people here might know better what to try for this problem.

I have 2 Dell PowerEdge 2850's, one with a PERC 4e/DC raid controller,
and the other with a PERC 4e/Di.  On both of these systems, I can
reliably cause the controllers to lock up under heavy load.  This is
using a fully up-to-date Red Hat 4 EL (non x86_64) installation on both
computers.  The controllers use the megaraid_mbox driver.

During a period of high load, the controller suddenly seems to stop
responding to the driver, causing the driver to go into a waiting loop
for it.  It waits 3 minutes for the controller to respond, which it
never does, and then takes the controller offline, pretty much yanking
the filesystem out from underneath the OS.

Some things keep running alright, so (working with Red Hat's support) I
got the thing set up to netdump to another server to see if we could
figure out what was going wrong.  The kernel never actually crashes, so
netdump doesn't produce a vmcore to look through, but syslog keeps
spouting out information, so I've got that.

Every time this lockup occurs, the log file looks like this:

megaraid: aborting-29762 cmd=2a <c=2 t=0 l=0>
megaraid abort: 29762:21[255:128], fw owner
megaraid: aborting-29763 cmd=2a <c=2 t=0 l=0>
megaraid abort: 29763:39[255:128], fw owner
megaraid: aborting-29764 cmd=2a <c=2 t=0 l=0>
megaraid abort: 29764:16[255:128], fw owner
megaraid: aborting-29768 cmd=2a <c=2 t=0 l=0>
megaraid abort: 29768:53[255:128], fw owner

	This part repeats 64 times, then...

megaraid: aborting-29831 cmd=2a <c=2 t=0 l=0>
megaraid abort: 29831:8[255:128], fw owner
megaraid: resetting the host...
megaraid: 64 outstanding commands. Max wait 180 sec
megaraid mbox: Wait for 64 commands to complete:180
megaraid mbox: Wait for 64 commands to complete:175
	
	megaraid mbox counts down to 0, and then...

megaraid mbox: critical hardware error!
megaraid: resetting the host...
megaraid: hw error, cannot reset
megaraid: resetting the host...
megaraid: hw error, cannot reset
SCSI error : <0 2 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 242938701
Buffer I/O error on device dm-4, logical block 9893952 lost page write
due to I/O error on dm-4
scsi0 (0:0): rejecting I/O to offline device

The commands that the driver are waiting for are always the same, except
for the sequence number (the number right after "aborting-" and  "abort:
").  And there are always 64 commands backed up that the driver is
waiting for.

Both machines in question pass memtest86 and Dell's diagnostic sets, and
since the failure is identical in both I don't believe it's bad
hardware.  We've got the latest BIOS, RAID firmware, and backplane
firmware on the machines.

I've also tried:
- the RHEL 4 Update 2 Beta kernel (at Red Hat's suggestion)
- RHEL 4 x86_64
- RHEL 3 x86_64
- Fedora Core 4 x86
- disabling Patrol Read in the RAID bios
- disabling read-ahead in the RAID bios
- changing the writeback cache flush to every 2 seconds, instead of the
default 4

I think next up I'm trying writethru mode, instead of write back, but
has anyone seen anything like this, or have any insight they might
offer?  I'm quickly getting to the point of being stumped.

Jonathan Fischer
Operating Systems Analyst - CSU San Marcos
jfischer@csusm.edu


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2005-08-30  7:43 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-29 20:25 Megaraid and Dell PERC 4 controllers Ju, Seokmann
2005-08-30  7:43 ` Steve Sutphen
  -- strict thread matches above, loose matches on Subject: below --
2005-08-23 20:51 Jonathan Fischer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.