public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
* Re: megaraid_sas waiting for command and then offline
@ 2006-12-12  3:04 Joe Malicki
  2006-12-12  5:24 ` Brett G. Durrett
  0 siblings, 1 reply; 6+ messages in thread
From: Joe Malicki @ 2006-12-12  3:04 UTC (permalink / raw)
  To: Brett G. Durrett, "linux-scsi ", David N. Welton,
	linux-poweredge, Sumant Patro
  Cc: Marc A. Meadows, Keith R. Baker

>  I have the same or a similar issue running 2.6.17 SMP x86_64 - the
> megaraid_sas driver hangs waiting for commands and then the filesystem
> unmounts, leaving the machine in an unusable state until there is a hard
> reboot (the machine is responsive but any access, shell or otherwise, is
> impossible without the filesystem). While I do not have much debugging
> information available, this happens to me about once every 6-7 days in
> my pool of seven machines, so I can probably get debugging info. Since
> the disk is offline and I can't get remote console, I don't have any
> details except something similar to Dave Lloyd's post, below.

Brett, is this still happening to you?  We're seeing this very
sporadically, but it does concern us.  We've seen driver updates in
2.6.19 (v00.00.03.05) and a new Dell PERC 5/i firmware:

Package Version - 5.0.2-0003
Firmware Version - 1.00.01-0157
SASBIOS Version - MT23
Ctrl-R Version - 1.02-007
MPT Version - 00.06.71.00-IT

and haven't been able to reproduce it, but we can't find a test case to
reliably reproduce the problem to know that anything was fixed (out of
31 identically configured Dell 2950's with the PERC 5/i RAID controller
(configured with 6 300MB SAS drives in a RAID 5, most (all?) of them
Maxtor Atlas 10k, not hot spare).  Our 2950s do have 16GB of RAM each,
so the firmware update (which mentions that it fixes DMA beyond 8GB)
sounds promising, but I would think that if that was the problem we were
experiencing, we would reproduce this much more often?  We are certainly
using the RAM for cache and memory, it's not like we've never touched
beyond 8GB.

Does anyone have a test case to reproduce this problem reliably, or a
detailed description of what actually happens (on low levels) when this
problem occurs that can help to make a test?  We are more interested in
making this reproducible now than in finding a workaround... if anyone
has any tips on how to make this *more* likely to happen we'd like to
know (so far, I know to try to use XFS and enable ReadAhead).

We have seen this correlated with Patrol Reads going on at the same
time, but aren't sure if this is a red herring, and haven't been able to
force the issue to happen by enabling Patrol Reads.

We've only ever seen these on two machines - one machine reproduces the
problem in a little over a week, and the other has reproduced it a small
number of times.  The machines that reproduce it run an experimental
demo workload, but we have not found a test case so far to reproduce the
problem on demand to find or verify solutions.  We're currently swapping
out machines to verify that there are no hardware problems, but the
machines diagnose themselves cleanly, and the workload they run is
different enough that something about the workload we can't yet
synthesize into a test case is the problem.

Thank you!
Joe Malicki
Software Engineer
Metacarta, Inc.
email: jmalicki@metacarta.com

> The only thing that the machines with these failures seem to have in
> common is the fact that they are almost exclusively writes - they are
> slave database machines with large memory and pretty much just
> replicate. The read/write machines seem to have less failures.
> 
> I am happy to help provide debugging information in any reasonable way.
> In the mean time, if there are any known suggestions or workarounds for
> the problem, I would be grateful for the guidance.
> 
> Here are what details on the controller. If you want additional info,
> let me know exactly what you need and I will do what I can to get it to
> you.:
> 
> Product Name : PERC 5/i Integrated
> Serial No : 12345
> FW Package Build: 5.0.1-0030
> FW Version : 1.00.01-0088
> BIOS Version : MT23
> Ctrl-R Version :1.02-007
> 
> B-


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: megaraid_sas waiting for command and then offline
  2006-12-12  3:04 megaraid_sas waiting for command and then offline Joe Malicki
@ 2006-12-12  5:24 ` Brett G. Durrett
  2006-12-12  5:53   ` Joseph Malicki
  0 siblings, 1 reply; 6+ messages in thread
From: Brett G. Durrett @ 2006-12-12  5:24 UTC (permalink / raw)
  To: Joe Malicki
  Cc: "linux-scsi ", David N. Welton, linux-poweredge,
	Sumant Patro, Marc A. Meadows, Keith R. Baker


I am still seeing this and we have between 2 and 5 failures per week 
(across almost 20 machines).  I am seeing it on ext3 (we migrated all of 
the machines from XFS) and with ReadAhead disabled.

You mention a firmware update but I don't see any new PERC 5 firmware 
packages on Dell's site... can you give me a pointer to the firmware update?

Also, has anybody had this problem on RHE?  Dell does not support Linux 
unless it is RHE... I would be surprised is somehow RHE did not have 
this problem.

B-



Joe Malicki wrote:
>>  I have the same or a similar issue running 2.6.17 SMP x86_64 - the
>> megaraid_sas driver hangs waiting for commands and then the filesystem
>> unmounts, leaving the machine in an unusable state until there is a hard
>> reboot (the machine is responsive but any access, shell or otherwise, is
>> impossible without the filesystem). While I do not have much debugging
>> information available, this happens to me about once every 6-7 days in
>> my pool of seven machines, so I can probably get debugging info. Since
>> the disk is offline and I can't get remote console, I don't have any
>> details except something similar to Dave Lloyd's post, below.
>>     
>
> Brett, is this still happening to you?  We're seeing this very
> sporadically, but it does concern us.  We've seen driver updates in
> 2.6.19 (v00.00.03.05) and a new Dell PERC 5/i firmware:
>
> Package Version - 5.0.2-0003
> Firmware Version - 1.00.01-0157
> SASBIOS Version - MT23
> Ctrl-R Version - 1.02-007
> MPT Version - 00.06.71.00-IT
>
> and haven't been able to reproduce it, but we can't find a test case to
> reliably reproduce the problem to know that anything was fixed (out of
> 31 identically configured Dell 2950's with the PERC 5/i RAID controller
> (configured with 6 300MB SAS drives in a RAID 5, most (all?) of them
> Maxtor Atlas 10k, not hot spare).  Our 2950s do have 16GB of RAM each,
> so the firmware update (which mentions that it fixes DMA beyond 8GB)
> sounds promising, but I would think that if that was the problem we were
> experiencing, we would reproduce this much more often?  We are certainly
> using the RAM for cache and memory, it's not like we've never touched
> beyond 8GB.
>
> Does anyone have a test case to reproduce this problem reliably, or a
> detailed description of what actually happens (on low levels) when this
> problem occurs that can help to make a test?  We are more interested in
> making this reproducible now than in finding a workaround... if anyone
> has any tips on how to make this *more* likely to happen we'd like to
> know (so far, I know to try to use XFS and enable ReadAhead).
>
> We have seen this correlated with Patrol Reads going on at the same
> time, but aren't sure if this is a red herring, and haven't been able to
> force the issue to happen by enabling Patrol Reads.
>
> We've only ever seen these on two machines - one machine reproduces the
> problem in a little over a week, and the other has reproduced it a small
> number of times.  The machines that reproduce it run an experimental
> demo workload, but we have not found a test case so far to reproduce the
> problem on demand to find or verify solutions.  We're currently swapping
> out machines to verify that there are no hardware problems, but the
> machines diagnose themselves cleanly, and the workload they run is
> different enough that something about the workload we can't yet
> synthesize into a test case is the problem.
>
> Thank you!
> Joe Malicki
> Software Engineer
> Metacarta, Inc.
> email: jmalicki@metacarta.com
>
>   
>> The only thing that the machines with these failures seem to have in
>> common is the fact that they are almost exclusively writes - they are
>> slave database machines with large memory and pretty much just
>> replicate. The read/write machines seem to have less failures.
>>
>> I am happy to help provide debugging information in any reasonable way.
>> In the mean time, if there are any known suggestions or workarounds for
>> the problem, I would be grateful for the guidance.
>>
>> Here are what details on the controller. If you want additional info,
>> let me know exactly what you need and I will do what I can to get it to
>> you.:
>>
>> Product Name : PERC 5/i Integrated
>> Serial No : 12345
>> FW Package Build: 5.0.1-0030
>> FW Version : 1.00.01-0088
>> BIOS Version : MT23
>> Ctrl-R Version :1.02-007
>>
>> B-
>>     
>
>   

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: megaraid_sas waiting for command and then offline
  2006-12-12  5:24 ` Brett G. Durrett
@ 2006-12-12  5:53   ` Joseph Malicki
  2006-12-12 12:30     ` Greg Dickie
  2006-12-20  1:03     ` Brett G. Durrett
  0 siblings, 2 replies; 6+ messages in thread
From: Joseph Malicki @ 2006-12-12  5:53 UTC (permalink / raw)
  To: Brett G. Durrett
  Cc: "linux-scsi ", David N. Welton, linux-poweredge,
	Sumant Patro, Marc A. Meadows, Keith R. Baker

Hi Brett!

Thanks for the response, hopefully we can gather enough data points to 
help solve the problem.

The new PERC 5/i integrated firmware dated 11/21/2006 is at:
http://support.dell.com/support/downloads/format.aspx?c=us&l=en&s=gen&SystemID=PWE_2950&os=LIN4&osl=en&deviceid=9182&typecnt=2&libid=46&releaseid=R139225&vercnt=3
PERC 5/E adapter: 
http://support.dell.com/support/downloads/format.aspx?c=us&l=en&s=gen&SystemID=PWE_2950&os=LIN4&osl=en&deviceid=9181&typecnt=2&libid=46&releaseid=R139227&vercnt=2

The release notes describe very similar symptoms, but I am not ready to 
believe it yet as I can't reliably reproduce the problem well enough to 
be confident of a fix, though it sounds like you might be able to.    
Unfortunately we're using Debian at the moment, but if I can reproduce I 
can run on RHEL in a heartbeat to duplicate it for support (for now I'm 
trying to minimize variables).

Also, which driver version are you running?  I noticed you were using 
some patches from Sumant Patro@LSI - is your driver identical to the one 
in 2.6.19?  If not, what does it look like?

Have you noticed any correlations with patrol reads at the times of the 
failures? You can tell by running MegaCli -FwTermLog -Dsply -aALL

What hardware are you running (CPUs, RAM, disk configuration)?

Have you noticed any correlation with heavy network I/O (as well as disk 
I/O)?  Some of our systems may have experienced this when running more 
network load than typical.


Thanks!
Joe

Brett G. Durrett wrote:
>
> I am still seeing this and we have between 2 and 5 failures per week 
> (across almost 20 machines).  I am seeing it on ext3 (we migrated all 
> of the machines from XFS) and with ReadAhead disabled.
>
> You mention a firmware update but I don't see any new PERC 5 firmware 
> packages on Dell's site... can you give me a pointer to the firmware 
> update?
>
> Also, has anybody had this problem on RHE?  Dell does not support 
> Linux unless it is RHE... I would be surprised is somehow RHE did not 
> have this problem.
>
> B-
>
>
>
> Joe Malicki wrote:
>>>  I have the same or a similar issue running 2.6.17 SMP x86_64 - the
>>> megaraid_sas driver hangs waiting for commands and then the filesystem
>>> unmounts, leaving the machine in an unusable state until there is a 
>>> hard
>>> reboot (the machine is responsive but any access, shell or 
>>> otherwise, is
>>> impossible without the filesystem). While I do not have much debugging
>>> information available, this happens to me about once every 6-7 days in
>>> my pool of seven machines, so I can probably get debugging info. Since
>>> the disk is offline and I can't get remote console, I don't have any
>>> details except something similar to Dave Lloyd's post, below.
>>>     
>>
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: megaraid_sas waiting for command and then offline
  2006-12-12  5:53   ` Joseph Malicki
@ 2006-12-12 12:30     ` Greg Dickie
  2006-12-12 18:40       ` Joe Malicki
  2006-12-20  1:03     ` Brett G. Durrett
  1 sibling, 1 reply; 6+ messages in thread
From: Greg Dickie @ 2006-12-12 12:30 UTC (permalink / raw)
  To: Joseph Malicki
  Cc: Brett G. Durrett, linux-scsi, linux-poweredge, David N. Welton,
	Sumant Patro, Marc A. Meadows, Keith R. Baker

We've never had lockups like this but we did notice that the
megaraid_sas modules defaults to a much higher commands per lun setting
than the hardware seems to be able to handle. IIRC the default is 128
and we lowered it to 16 for the 5i and 32 for the 5E.

HTH,
Greg


On Tue, 2006-12-12 at 00:53 -0500, Joseph Malicki wrote:
> Hi Brett!
> 
> Thanks for the response, hopefully we can gather enough data points to 
> help solve the problem.
> 
> The new PERC 5/i integrated firmware dated 11/21/2006 is at:
> http://support.dell.com/support/downloads/format.aspx?c=us&l=en&s=gen&SystemID=PWE_2950&os=LIN4&osl=en&deviceid=9182&typecnt=2&libid=46&releaseid=R139225&vercnt=3
> PERC 5/E adapter: 
> http://support.dell.com/support/downloads/format.aspx?c=us&l=en&s=gen&SystemID=PWE_2950&os=LIN4&osl=en&deviceid=9181&typecnt=2&libid=46&releaseid=R139227&vercnt=2
> 
> The release notes describe very similar symptoms, but I am not ready to 
> believe it yet as I can't reliably reproduce the problem well enough to 
> be confident of a fix, though it sounds like you might be able to.    
> Unfortunately we're using Debian at the moment, but if I can reproduce I 
> can run on RHEL in a heartbeat to duplicate it for support (for now I'm 
> trying to minimize variables).
> 
> Also, which driver version are you running?  I noticed you were using 
> some patches from Sumant Patro@LSI - is your driver identical to the one 
> in 2.6.19?  If not, what does it look like?
> 
> Have you noticed any correlations with patrol reads at the times of the 
> failures? You can tell by running MegaCli -FwTermLog -Dsply -aALL
> 
> What hardware are you running (CPUs, RAM, disk configuration)?
> 
> Have you noticed any correlation with heavy network I/O (as well as disk 
> I/O)?  Some of our systems may have experienced this when running more 
> network load than typical.
> 
> 
> Thanks!
> Joe
> 
> Brett G. Durrett wrote:
> >
> > I am still seeing this and we have between 2 and 5 failures per week 
> > (across almost 20 machines).  I am seeing it on ext3 (we migrated all 
> > of the machines from XFS) and with ReadAhead disabled.
> >
> > You mention a firmware update but I don't see any new PERC 5 firmware 
> > packages on Dell's site... can you give me a pointer to the firmware 
> > update?
> >
> > Also, has anybody had this problem on RHE?  Dell does not support 
> > Linux unless it is RHE... I would be surprised is somehow RHE did not 
> > have this problem.
> >
> > B-
> >
> >
> >
> > Joe Malicki wrote:
> >>>  I have the same or a similar issue running 2.6.17 SMP x86_64 - the
> >>> megaraid_sas driver hangs waiting for commands and then the filesystem
> >>> unmounts, leaving the machine in an unusable state until there is a 
> >>> hard
> >>> reboot (the machine is responsive but any access, shell or 
> >>> otherwise, is
> >>> impossible without the filesystem). While I do not have much debugging
> >>> information available, this happens to me about once every 6-7 days in
> >>> my pool of seven machines, so I can probably get debugging info. Since
> >>> the disk is offline and I can't get remote console, I don't have any
> >>> details except something similar to Dave Lloyd's post, below.
> >>>     
> >>
> >
> 
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge@dell.com
> http://lists.us.dell.com/mailman/listinfo/linux-poweredge
> Please read the FAQ at http://lists.us.dell.com/faq
-- 
Greg Dickie
just a guy
Maximum Throughput


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: megaraid_sas waiting for command and then offline
  2006-12-12 12:30     ` Greg Dickie
@ 2006-12-12 18:40       ` Joe Malicki
  0 siblings, 0 replies; 6+ messages in thread
From: Joe Malicki @ 2006-12-12 18:40 UTC (permalink / raw)
  To: Greg Dickie
  Cc: Brett G. Durrett, linux-scsi, linux-poweredge, David N. Welton,
	Sumant Patro, Marc A. Meadows, Keith R. Baker

Thanks Greg,

Is there documentation or tests of the number of commands per LUN that
the hardware can handle?  The driver is clearly reading the value out of
a register on the card.

thanks,
joe

Greg Dickie wrote:
> We've never had lockups like this but we did notice that the
> megaraid_sas modules defaults to a much higher commands per lun setting
> than the hardware seems to be able to handle. IIRC the default is 128
> and we lowered it to 16 for the 5i and 32 for the 5E.
> 
> HTH,
> Greg
> 
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: megaraid_sas waiting for command and then offline
  2006-12-12  5:53   ` Joseph Malicki
  2006-12-12 12:30     ` Greg Dickie
@ 2006-12-20  1:03     ` Brett G. Durrett
  1 sibling, 0 replies; 6+ messages in thread
From: Brett G. Durrett @ 2006-12-20  1:03 UTC (permalink / raw)
  To: Joseph Malicki
  Cc: "linux-scsi ", David N. Welton, linux-poweredge,
	Sumant Patro, Marc A. Meadows, Keith R. Baker


Just replied to another poster but wanted to respond to this thread as 
well...

Joe,

Huge thanks for the pointer to the new firmware... I had a page 
bookmarked for the 2950 firmware but the bookmark went to an old page. 

We are running 16G machines, dual core, dual CPU, RAID 5 on Perc5i.  The 
kernel is 2.6.18 and I think all of Sumant's changes are in 2.6.19.  The 
patrol reads did not seem to correlate to the failures.

Some possibly good news: It is probably too early to say for sure, but I 
upgraded the firmware and have not had a failure on any of the machines 
with the new firmware.  I will not feel this is "fixed" until I go 
another two weeks with no failures.

The notes in the firmware update are supposed to fix a problem that is 
consistent with our failures:

4.0 Fixes

Addresses potential issue with PERC 5 controllers that may become 
unresponsive on systems with 8GB of memory or more.  This fix corrects 
an issue on systems with 8+ GB of memory PERC 5 controllers may become 
unresponsive. If the affected controller is the boot device this would 
cause an OS crash, hang, or bluescreen. If not the boot controller the 
system would experience timeouts (event 129 and 9 in windows, IO aborts 
in Linux). Once the controller is in this state it will not return to 
operation until the system has rebooted, any storage connected to the 
controller will not be accessible until the reboot.
This has now been corrected.

B-

Joseph Malicki wrote:
> Hi Brett!
>
> Thanks for the response, hopefully we can gather enough data points to 
> help solve the problem.
>
> The new PERC 5/i integrated firmware dated 11/21/2006 is at:
> http://support.dell.com/support/downloads/format.aspx?c=us&l=en&s=gen&SystemID=PWE_2950&os=LIN4&osl=en&deviceid=9182&typecnt=2&libid=46&releaseid=R139225&vercnt=3 
>
> PERC 5/E adapter: 
> http://support.dell.com/support/downloads/format.aspx?c=us&l=en&s=gen&SystemID=PWE_2950&os=LIN4&osl=en&deviceid=9181&typecnt=2&libid=46&releaseid=R139227&vercnt=2 
>
>
> The release notes describe very similar symptoms, but I am not ready 
> to believe it yet as I can't reliably reproduce the problem well 
> enough to be confident of a fix, though it sounds like you might be 
> able to.    Unfortunately we're using Debian at the moment, but if I 
> can reproduce I can run on RHEL in a heartbeat to duplicate it for 
> support (for now I'm trying to minimize variables).
>
> Also, which driver version are you running?  I noticed you were using 
> some patches from Sumant Patro@LSI - is your driver identical to the 
> one in 2.6.19?  If not, what does it look like?
>
> Have you noticed any correlations with patrol reads at the times of 
> the failures? You can tell by running MegaCli -FwTermLog -Dsply -aALL
>
> What hardware are you running (CPUs, RAM, disk configuration)?
>
> Have you noticed any correlation with heavy network I/O (as well as 
> disk I/O)?  Some of our systems may have experienced this when running 
> more network load than typical.
>
>
> Thanks!
> Joe
>
> Brett G. Durrett wrote:
>>
>> I am still seeing this and we have between 2 and 5 failures per week 
>> (across almost 20 machines).  I am seeing it on ext3 (we migrated all 
>> of the machines from XFS) and with ReadAhead disabled.
>>
>> You mention a firmware update but I don't see any new PERC 5 firmware 
>> packages on Dell's site... can you give me a pointer to the firmware 
>> update?
>>
>> Also, has anybody had this problem on RHE?  Dell does not support 
>> Linux unless it is RHE... I would be surprised is somehow RHE did not 
>> have this problem.
>>
>> B-
>>
>>
>>
>> Joe Malicki wrote:
>>>>  I have the same or a similar issue running 2.6.17 SMP x86_64 - the
>>>> megaraid_sas driver hangs waiting for commands and then the filesystem
>>>> unmounts, leaving the machine in an unusable state until there is a 
>>>> hard
>>>> reboot (the machine is responsive but any access, shell or 
>>>> otherwise, is
>>>> impossible without the filesystem). While I do not have much debugging
>>>> information available, this happens to me about once every 6-7 days in
>>>> my pool of seven machines, so I can probably get debugging info. Since
>>>> the disk is offline and I can't get remote console, I don't have any
>>>> details except something similar to Dave Lloyd's post, below.
>>>>     
>>>
>>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-12-20  1:27 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-12-12  3:04 megaraid_sas waiting for command and then offline Joe Malicki
2006-12-12  5:24 ` Brett G. Durrett
2006-12-12  5:53   ` Joseph Malicki
2006-12-12 12:30     ` Greg Dickie
2006-12-12 18:40       ` Joe Malicki
2006-12-20  1:03     ` Brett G. Durrett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox