RE: megaraid_sas waiting for command and then offline

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* RE: megaraid_sas waiting for command and then offline
@ 2006-09-06 17:14 Patro, Sumant
  2006-09-06 20:44 ` Brett G. Durrett
  0 siblings, 1 reply; 9+ messages in thread
From: Patro, Sumant @ 2006-09-06 17:14 UTC (permalink / raw)
  To: Brett G. Durrett, Dave Lloyd
  Cc: Bagalkote, Sreenivas, lkml, Berkley Shands, Kolli, Neela,
	Yang, Bo

Hello Brett,

	A DMA related bug was fixed in FW ver *.0095 that was causing
the FW to stop responding. 
	
	Please upgrade the FW version to >= *.0095 and let me know if
you still see the issue.
 
Regards,

Sumant


-----Original Message-----
From: Brett G. Durrett [mailto:brett@imvu.com] 
Sent: Wednesday, September 06, 2006 9:04 AM
To: Dave Lloyd
Cc: Patro, Sumant; Bagalkote, Sreenivas; lkml; Berkley Shands
Subject: Re: megaraid_sas waiting for command and then offline


The machines are Dell 2900s, so the mobo is custom.  From a Dell SE, 
"Dell uses a custom mobo that is Dell branded with the Intel chipset 
Greencreek.".

B-




Dave Lloyd wrote:

> Brett G. Durrett wrote:
> >
> > I have the same or a similar issue running 2.6.17 SMP x86_64 - the
> > megaraid_sas driver hangs waiting for commands and then the
filesystem
> > unmounts, leaving the machine in an unusable state until there is a 
> hard
> > reboot (the machine is responsive but any access, shell or 
> otherwise, is
> > impossible without the filesystem).  While I do not have much
debugging
> > information available, this happens to me about once every 6-7 days
in
> > my pool of seven machines, so I can probably get debugging info.
Since
> > the disk is offline and I can't get remote console, I don't have any
> > details except something similar to Dave Lloyd's post, below.
> >
> > The only thing that the machines with these failures seem to have in
> > common is the fact that they are almost exclusively writes - they
are
> > slave database machines with large memory and pretty much just
> > replicate.  The read/write machines seem to have less failures.
> >
> > I am happy to help provide debugging information in any reasonable
way.
> > In the mean time, if there are any known suggestions or workarounds
for
> > the problem, I would be grateful for the guidance.
> >
> > Here are what details on the controller.  If you want additional
info,
> > let me know exactly what you need and I will do what I can to get it
to
> > you.:
> >
> > Product Name    : PERC 5/i Integrated
> > Serial No       : 12345
> > FW Package Build: 5.0.1-0030
> > FW Version      : 1.00.01-0088
> > BIOS Version    : MT23
> > Ctrl-R Version  :1.02-007
> >
> > B-
>
> Which motherboard are you using?  We believe that this may be a
> motherboard specific issue.  It appears to happen on a SuperMicro
> motherboard but not a Tyan motherboard.
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: megaraid_sas waiting for command and then offline
  2006-09-06 17:14 megaraid_sas waiting for command and then offline Patro, Sumant
@ 2006-09-06 20:44 ` Brett G. Durrett
  0 siblings, 0 replies; 9+ messages in thread
From: Brett G. Durrett @ 2006-09-06 20:44 UTC (permalink / raw)
  To: Patro, Sumant
  Cc: Dave Lloyd, Bagalkote, Sreenivas, lkml, Berkley Shands,
	Kolli, Neela, Yang, Bo


Sumant,

Not sure if I am missing something - I appear to be running the latest 
FW available:

                Versions
                ================
Product Name    : PERC 5/i Integrated
Serial No       : 12345
FW Package Build: 5.0.1-0030
FW Version      : 1.00.01-0088
BIOS Version    : MT23
Ctrl-R Version  :1.02-007



Patro, Sumant wrote:

>Hello Brett,
>
>	A DMA related bug was fixed in FW ver *.0095 that was causing
>the FW to stop responding. 
>	
>	Please upgrade the FW version to >= *.0095 and let me know if
>you still see the issue.
> 
>Regards,
>
>Sumant
>
>
>-----Original Message-----
>From: Brett G. Durrett [mailto:brett@imvu.com] 
>Sent: Wednesday, September 06, 2006 9:04 AM
>To: Dave Lloyd
>Cc: Patro, Sumant; Bagalkote, Sreenivas; lkml; Berkley Shands
>Subject: Re: megaraid_sas waiting for command and then offline
>
>
>The machines are Dell 2900s, so the mobo is custom.  From a Dell SE, 
>"Dell uses a custom mobo that is Dell branded with the Intel chipset 
>Greencreek.".
>
>B-
>
>
>
>
>Dave Lloyd wrote:
>
>  
>
>>Brett G. Durrett wrote:
>>    
>>
>>>I have the same or a similar issue running 2.6.17 SMP x86_64 - the
>>>megaraid_sas driver hangs waiting for commands and then the
>>>      
>>>
>filesystem
>  
>
>>>unmounts, leaving the machine in an unusable state until there is a 
>>>      
>>>
>>hard
>>    
>>
>>>reboot (the machine is responsive but any access, shell or 
>>>      
>>>
>>otherwise, is
>>    
>>
>>>impossible without the filesystem).  While I do not have much
>>>      
>>>
>debugging
>  
>
>>>information available, this happens to me about once every 6-7 days
>>>      
>>>
>in
>  
>
>>>my pool of seven machines, so I can probably get debugging info.
>>>      
>>>
>Since
>  
>
>>>the disk is offline and I can't get remote console, I don't have any
>>>details except something similar to Dave Lloyd's post, below.
>>>
>>>The only thing that the machines with these failures seem to have in
>>>common is the fact that they are almost exclusively writes - they
>>>      
>>>
>are
>  
>
>>>slave database machines with large memory and pretty much just
>>>replicate.  The read/write machines seem to have less failures.
>>>
>>>I am happy to help provide debugging information in any reasonable
>>>      
>>>
>way.
>  
>
>>>In the mean time, if there are any known suggestions or workarounds
>>>      
>>>
>for
>  
>
>>>the problem, I would be grateful for the guidance.
>>>
>>>Here are what details on the controller.  If you want additional
>>>      
>>>
>info,
>  
>
>>>let me know exactly what you need and I will do what I can to get it
>>>      
>>>
>to
>  
>
>>>you.:
>>>
>>>Product Name    : PERC 5/i Integrated
>>>Serial No       : 12345
>>>FW Package Build: 5.0.1-0030
>>>FW Version      : 1.00.01-0088
>>>BIOS Version    : MT23
>>>Ctrl-R Version  :1.02-007
>>>
>>>B-
>>>      
>>>
>>Which motherboard are you using?  We believe that this may be a
>>motherboard specific issue.  It appears to happen on a SuperMicro
>>motherboard but not a Tyan motherboard.
>>
>>    
>>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>  
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* megaraid_sas waiting for command and then offline
@ 2006-10-25  8:46 David N. Welton
  2006-10-25 22:48 ` Brett G. Durrett
  0 siblings, 1 reply; 9+ messages in thread
From: David N. Welton @ 2006-10-25  8:46 UTC (permalink / raw)
  To: bdurrett; +Cc: linux-kernel

Hi,

I found someone corresponding to your name writing about a problem with
the megaraid sas driver/hardware on the LKML:

http://lkml.org/lkml/2006/9/6/12

We have a Dell (2950, running 2.6.18 #1 SMP) as well, and the way I
managed to kill the thing dead in its tracks (symptoms basically what
you you describe) is with smartctl:

root@salgari:~# smartctl --all /dev/sda
smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: DELL     PERC 5/i         Version: 1.00
Device type: disk
Local Time is: Wed Oct 25 10:14:40 2006 CEST
Device does not support SMART

Error Counter logging not supported


Device does not support Self Test logging

----

[61101.681857] sd 0:2:0:0: rejecting I/O to offline device
[61101.681944] EXT3-fs error (device sda1): ext3_readdir: directory
#7553069 contains a hole at offset 0
[61103.944794] sd 0:2:0:0: rejecting I/O to offline device
[61103.944879] EXT3-fs error (device sda1): ext3_readdir: directory
#7553069 contains a hole at offset 0
[61104.672212] sd 0:2:0:0: rejecting I/O to offline device
[61104.672295] EXT3-fs error (device sda1): ext3_readdir: directory
#7553069 contains a hole at offset 0
[61105.255981] sd 0:2:0:0: rejecting I/O to offline device
[61105.256066] EXT3-fs error (device sda1): ext3_readdir: directory
#7553069 contains a hole at offset 0

----

Dead in the water.  We suspect that in any case there are some disk
problems, which is why we were trying to use smartctl in the first place.

I was just curious if you managed to figure anything out...

Thanks,
Dave Welton
-- 
Webster srl
Sede legale:
Via del Seminario, 3 35122 Padova
Sede operativa:
Via S. Breda, 28 35010 Limena (PD)

Tel. +39 049 8842188
Email: d.welton@webster.it

Visita www.webster.it

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: megaraid_sas waiting for command and then offline
  2006-10-25  8:46 David N. Welton
@ 2006-10-25 22:48 ` Brett G. Durrett
  2006-10-25 23:03   ` Alan Cox
  2006-11-13 21:40   ` Brett G. Durrett
  0 siblings, 2 replies; 9+ messages in thread
From: Brett G. Durrett @ 2006-10-25 22:48 UTC (permalink / raw)
  To: David N. Welton; +Cc: linux-kernel

David,

We switched to 2.6.18 (SMP) and applied the latest patches from LSI (got 
them directly from Sumant Patro).  Also, he told me to make sure "read 
ahead" was set to "off".  This seems to have reduced the frequency of 
the failures to about once per week (across 10+ machines), down from 
several times per week.

After I reported an additional failure, Sumant said they were able to 
reproduce the problems with XFS but they have not seen it with EXT3.  I 
prefer XFS but I prefer to have reliable databases even more...

I now have a couple of systems running in the new configuration and I am 
slowly migrating others to it as well.  I have not seen a failure with 
EXT3 but I statistically it would have been unlikely... I won't declare 
victory until I have more systems converted with a few weeks of reliable 
use.

Hope this helps... if anybody solves the root cause I will happily offer 
them a small gift to show my gratitude.

B-

David N. Welton wrote:

>Hi,
>
>I found someone corresponding to your name writing about a problem with
>the megaraid sas driver/hardware on the LKML:
>
>http://lkml.org/lkml/2006/9/6/12
>
>We have a Dell (2950, running 2.6.18 #1 SMP) as well, and the way I
>managed to kill the thing dead in its tracks (symptoms basically what
>you you describe) is with smartctl:
>
>root@salgari:~# smartctl --all /dev/sda
>smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce Allen
>Home page is http://smartmontools.sourceforge.net/
>
>Device: DELL     PERC 5/i         Version: 1.00
>Device type: disk
>Local Time is: Wed Oct 25 10:14:40 2006 CEST
>Device does not support SMART
>
>Error Counter logging not supported
>
>
>Device does not support Self Test logging
>
>----
>
>[61101.681857] sd 0:2:0:0: rejecting I/O to offline device
>[61101.681944] EXT3-fs error (device sda1): ext3_readdir: directory
>#7553069 contains a hole at offset 0
>[61103.944794] sd 0:2:0:0: rejecting I/O to offline device
>[61103.944879] EXT3-fs error (device sda1): ext3_readdir: directory
>#7553069 contains a hole at offset 0
>[61104.672212] sd 0:2:0:0: rejecting I/O to offline device
>[61104.672295] EXT3-fs error (device sda1): ext3_readdir: directory
>#7553069 contains a hole at offset 0
>[61105.255981] sd 0:2:0:0: rejecting I/O to offline device
>[61105.256066] EXT3-fs error (device sda1): ext3_readdir: directory
>#7553069 contains a hole at offset 0
>
>----
>
>Dead in the water.  We suspect that in any case there are some disk
>problems, which is why we were trying to use smartctl in the first place.
>
>I was just curious if you managed to figure anything out...
>
>Thanks,
>Dave Welton
>  
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: megaraid_sas waiting for command and then offline
  2006-10-25 22:48 ` Brett G. Durrett
@ 2006-10-25 23:03   ` Alan Cox
  2006-11-13 21:40   ` Brett G. Durrett
  1 sibling, 0 replies; 9+ messages in thread
From: Alan Cox @ 2006-10-25 23:03 UTC (permalink / raw)
  To: Brett G. Durrett; +Cc: David N. Welton, linux-kernel

Ar Mer, 2006-10-25 am 15:48 -0700, ysgrifennodd Brett G. Durrett:
> After I reported an additional failure, Sumant said they were able to 
> reproduce the problems with XFS but they have not seen it with EXT3. 

I've seen precisely that pattern with a couple of IDE controllers. In
both cases they had problems with very large I/O requests. XFS was
generating extremely long linear reads and writes while ext3 tended to
generate nice I/O patterns but never really huge ones.

(The IDE drivers in question have since been fixed except for IT821x
where some firmware versions in raid mode still barf)

Alan


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: megaraid_sas waiting for command and then offline
  2006-10-25 22:48 ` Brett G. Durrett
  2006-10-25 23:03   ` Alan Cox
@ 2006-11-13 21:40   ` Brett G. Durrett
  1 sibling, 0 replies; 9+ messages in thread
From: Brett G. Durrett @ 2006-11-13 21:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: Brett G. Durrett, David N. Welton


Bad news - I just reproduced the failure using EXT3 on a system that had 
a complete install 4 days ago, so it looks like the megaraid_sas driver 
fails with both XFS and EXT3 (although EXT3 seems more reliable).

I was running EXT with no read ahead:
# ./MegaCli -LDGetProp -Cache -L0 -A0
Adapter 0-VD 0: Cache Policy:WriteBack, ReadAheadNone, Direct
# mount
/dev/sda1 on / type ext3 (rw,errors=remount-ro)
# uname -a
Linux AF001158 2.6.18-imvuamd64smpmsastest #1 SMP Mon Oct 9 21:26:46 PDT 
2006 x86_64 GNU/Linux

Here are the megaraid entries from syslog:

FACILITY 	DATE TIME 	MESSAGE
kern-warning 	2006-11-13 12:56:25 	kernel: megasas[0]: 64 bit SGLs were 
sent to FW
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Pending OS cmds in FW :
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0x15351800 : <3>megasas[0]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe238b77, lba_hi : 0x0, sense_buf addr : 0x1534d900,sge count 
: 0x47
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0x1535c800 : <3>megasas[0]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe23991f, lba_hi : 0x0, sense_buf addr : 0x15356d00,sge count 
: 0x50
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0x15375000 : <3>megasas[0]: frame count : 0x6, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe23aaaf, lba_hi : 0x0, sense_buf addr : 0x15371800,sge count 
: 0x1a
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0x15377c00 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xae0005f, lba_hi : 0x0, sense_buf addr : 0x15371d80,sge count 
: 0x2
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0x1537b400 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe208367, lba_hi : 0x0, sense_buf addr : 0x1537a280,sge count 
: 0x1
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0x1537d400 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe239697, lba_hi : 0x0, sense_buf addr : 0x1537a680,sge count 
: 0x1
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff00000 : <3>megasas[0]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe238f17, lba_hi : 0x0, sense_buf addr : 0x1537ac00,sge count 
: 0x45
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff01400 : <3>megasas[0]: frame count : 0x7, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe238df7, lba_hi : 0x0, sense_buf addr : 0x1537ae80,sge count 
: 0x22
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff06400 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xa68d66f, lba_hi : 0x0, sense_buf addr : 0xcff03680,sge count 
: 0x1
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff18400 : <3>megasas[0]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe239e27, lba_hi : 0x0, sense_buf addr : 0xcff15680,sge count 
: 0x50
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff1f000 : <3>megasas[0]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe239b9f, lba_hi : 0x0, sense_buf addr : 0xcff1e200,sge count 
: 0x50
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff20000 : <3>megasas[0]: frame count : 0x4, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe23c41f, lba_hi : 0x0, sense_buf addr : 0xcff1e400,sge count 
: 0xf
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff2b000 : <3>megasas[0]: frame count : 0x3, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe23a377, lba_hi : 0x0, sense_buf addr : 0xcff27800,sge count 
: 0xa
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff35c00 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xa601697, lba_hi : 0x0, sense_buf addr : 0xcff30b80,sge count 
: 0x1
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff44400 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe238b6f, lba_hi : 0x0, sense_buf addr : 0xcff42480,sge count 
: 0x1
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff4cc00 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe20a287, lba_hi : 0x0, sense_buf addr : 0xcff4b380,sge count 
: 0x1
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff4f800 : <3>megasas[0]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe23a0f7, lba_hi : 0x0, sense_buf addr : 0xcff4b900,sge count 
: 0x38
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff52400 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0x5f4009f, lba_hi : 0x0, sense_buf addr : 0xcff4be80,sge count 
: 0x1
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff5fc00 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe238f0f, lba_hi : 0x0, sense_buf addr : 0xcff5d580,sge count 
: 0x1
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff60000 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xa6000df, lba_hi : 0x0, sense_buf addr : 0xcff5d600,sge count 
: 0x1
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff6bc00 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe239e1f, lba_hi : 0x0, sense_buf addr : 0xcff66b80,sge count 
: 0x1
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff75800 : <3>megasas[0]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe239197, lba_hi : 0x0, sense_buf addr : 0xcff6fd00,sge count 
: 0x50
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff76400 : <3>megasas[0]: frame count : 0x3, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe23a0a7, lba_hi : 0x0, sense_buf addr : 0xcff6fe80,sge count 
: 0xa
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff7b400 : <3>megasas[0]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe23969f, lba_hi : 0x0, sense_buf addr : 0xcff78680,sge count 
: 0x50
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0xcff7e400 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe23aaa7, lba_hi : 0x0, sense_buf addr : 0xcff78c80,sge count 
: 0x1
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0x15391400 : <3>megasas[0]: frame count : 0x2, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xd0c004f, lba_hi : 0x0, sense_buf addr : 0x1538ae80,sge count 
: 0x3
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0x153a3000 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0x5f40217, lba_hi : 0x0, sense_buf addr : 0x1539ce00,sge count 
: 0x1
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0x153adc00 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe2343e7, lba_hi : 0x0, sense_buf addr : 0x153ae180,sge count 
: 0x1
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0x153bdc00 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xa601657, lba_hi : 0x0, sense_buf addr : 0x153b7d80,sge count 
: 0x1
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0x153c3000 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xae00057, lba_hi : 0x0, sense_buf addr : 0x153c0600,sge count 
: 0x1
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0x153c4000 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe2324af, lba_hi : 0x0, sense_buf addr : 0x153c0800,sge count 
: 0x1
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Frame addr 
:0x153c7400 : <3>megasas[0]: frame count : 0x8, Cmd : 0x2, Tgt id : 0x0, 
lba lo : 0xe239417, lba_hi : 0x0, sense_buf addr : 0x153c0e80,sge count 
: 0x50
kern-warning 	2006-11-13 12:56:25 	kernel: megasas[0]: Pending Internal 
cmds in FW :
kern-err 	2006-11-13 12:56:25 	kernel: megasas[0]: Dumping Done.
kern-err 	2006-11-13 12:56:25 	kernel: megasas: failed to do reset
kern-notice 	2006-11-13 12:56:25 	kernel: sd 0:2:0:0: megasas: RESET 
-20487153 cmd=2a
kern-err 	2006-11-13 12:56:25 	kernel: megasas: cannot recover from 
previous reset failures
kern-notice 	2006-11-13 12:56:25 	kernel: sd 0:2:0:0: megasas: RESET 
-20487153 cmd=2a
kern-err 	2006-11-13 12:56:25 	kernel: megasas: cannot recover from 
previous reset failures
kern-notice 	2006-11-13 12:56:24 	kernel: megasas: [100]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:56:24 	kernel: megasas: [105]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:56:24 	kernel: megasas: [110]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:56:24 	kernel: megasas: [115]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:56:24 	kernel: megasas: [120]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:56:24 	kernel: megasas: [125]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:56:24 	kernel: megasas: [130]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:56:24 	kernel: megasas: [135]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:56:24 	kernel: megasas: [140]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:56:24 	kernel: megasas: [145]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:56:24 	kernel: megasas: [150]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:56:24 	kernel: megasas: [155]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:56:24 	kernel: megasas: [160]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:56:24 	kernel: megasas: [165]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:56:24 	kernel: megasas: [170]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:56:24 	kernel: megasas: [175]waiting for 32 
commands to complete
kern-warning 	2006-11-13 12:56:24 	kernel: megasas[0]: Dumping Frame 
Phys Address of all pending cmds in FW
kern-err 	2006-11-13 12:56:24 	kernel: megasas[0]: Total OS Pending cmds 
: 32
kern-notice 	2006-11-13 12:54:59 	kernel: megasas: [95]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:54:54 	kernel: megasas: [90]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:54:49 	kernel: megasas: [85]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:54:44 	kernel: megasas: [80]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:54:39 	kernel: megasas: [75]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:54:34 	kernel: megasas: [70]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:54:29 	kernel: megasas: [65]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:54:24 	kernel: megasas: [60]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:54:19 	kernel: megasas: [55]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:54:14 	kernel: megasas: [50]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:54:09 	kernel: megasas: [45]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:54:04 	kernel: megasas: [40]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:53:59 	kernel: megasas: [35]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:53:54 	kernel: megasas: [30]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:53:49 	kernel: megasas: [25]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:53:44 	kernel: megasas: [20]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:53:39 	kernel: megasas: [15]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:53:34 	kernel: megasas: [10]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:53:29 	kernel: megasas: [ 5]waiting for 32 
commands to complete
kern-notice 	2006-11-13 12:53:24 	kernel: sd 0:2:0:0: megasas: RESET 
-20487153 cmd=2a
kern-notice 	2006-11-13 12:53:24 	kernel: megasas: [ 0]waiting for 32 
commands to complete





Brett G. Durrett wrote:

>
> David,
>
> We switched to 2.6.18 (SMP) and applied the latest patches from LSI 
> (got them directly from Sumant Patro).  Also, he told me to make sure 
> "read ahead" was set to "off".  This seems to have reduced the 
> frequency of the failures to about once per week (across 10+ 
> machines), down from several times per week.
>
> After I reported an additional failure, Sumant said they were able to 
> reproduce the problems with XFS but they have not seen it with EXT3.  
> I prefer XFS but I prefer to have reliable databases even more...
>
> I now have a couple of systems running in the new configuration and I 
> am slowly migrating others to it as well.  I have not seen a failure 
> with EXT3 but I statistically it would have been unlikely... I won't 
> declare victory until I have more systems converted with a few weeks 
> of reliable use.
>
> Hope this helps... if anybody solves the root cause I will happily 
> offer them a small gift to show my gratitude.
>
> B-
>
>
>
> David N. Welton wrote:
>
>> Hi,
>>
>> I found someone corresponding to your name writing about a problem with
>> the megaraid sas driver/hardware on the LKML:
>>
>> http://lkml.org/lkml/2006/9/6/12
>>
>> We have a Dell (2950, running 2.6.18 #1 SMP) as well, and the way I
>> managed to kill the thing dead in its tracks (symptoms basically what
>> you you describe) is with smartctl:
>>
>> root@salgari:~# smartctl --all /dev/sda
>> smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce 
>> Allen
>> Home page is http://smartmontools.sourceforge.net/
>>
>> Device: DELL     PERC 5/i         Version: 1.00
>> Device type: disk
>> Local Time is: Wed Oct 25 10:14:40 2006 CEST
>> Device does not support SMART
>>
>> Error Counter logging not supported
>>
>>
>> Device does not support Self Test logging
>>
>> ----
>>
>> [61101.681857] sd 0:2:0:0: rejecting I/O to offline device
>> [61101.681944] EXT3-fs error (device sda1): ext3_readdir: directory
>> #7553069 contains a hole at offset 0
>> [61103.944794] sd 0:2:0:0: rejecting I/O to offline device
>> [61103.944879] EXT3-fs error (device sda1): ext3_readdir: directory
>> #7553069 contains a hole at offset 0
>> [61104.672212] sd 0:2:0:0: rejecting I/O to offline device
>> [61104.672295] EXT3-fs error (device sda1): ext3_readdir: directory
>> #7553069 contains a hole at offset 0
>> [61105.255981] sd 0:2:0:0: rejecting I/O to offline device
>> [61105.256066] EXT3-fs error (device sda1): ext3_readdir: directory
>> #7553069 contains a hole at offset 0
>>
>> ----
>>
>> Dead in the water.  We suspect that in any case there are some disk
>> problems, which is why we were trying to use smartctl in the first 
>> place.
>>
>> I was just curious if you managed to figure anything out...
>>
>> Thanks,
>> Dave Welton
>>  
>>
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 9+ messages in thread

* megaraid_sas waiting for command and then offline
@ 2006-09-06  4:49 Brett G. Durrett
  2006-09-06 14:11 ` Dave Lloyd
  0 siblings, 1 reply; 9+ messages in thread
From: Brett G. Durrett @ 2006-09-06  4:49 UTC (permalink / raw)
  To: Sumant.Patro, Sreenivas.Bagalkote; +Cc: lkml, dlloyd

I have the same or a similar issue running 2.6.17 SMP x86_64 - the 
megaraid_sas driver hangs waiting for commands and then the filesystem 
unmounts, leaving the machine in an unusable state until there is a hard 
reboot (the machine is responsive but any access, shell or otherwise, is 
impossible without the filesystem).  While I do not have much debugging 
information available, this happens to me about once every 6-7 days in 
my pool of seven machines, so I can probably get debugging info.  Since 
the disk is offline and I can't get remote console, I don't have any 
details except something similar to Dave Lloyd's post, below.

The only thing that the machines with these failures seem to have in 
common is the fact that they are almost exclusively writes - they are 
slave database machines with large memory and pretty much just 
replicate.  The read/write machines seem to have less failures.

I am happy to help provide debugging information in any reasonable way.  
In the mean time, if there are any known suggestions or workarounds for 
the problem, I would be grateful for the guidance.

Here are what details on the controller.  If you want additional info, 
let me know exactly what you need and I will do what I can to get it to 
you.:

Product Name    : PERC 5/i Integrated
Serial No       : 12345
FW Package Build: 5.0.1-0030
FW Version      : 1.00.01-0088
BIOS Version    : MT23
Ctrl-R Version  :1.02-007

B-

Subject 	RE: MegaRaid 8408E goes out to lunch with nr_requests > 8
Date 	Thu, 13 Jul 2006 09:25:09 -0600
>From 	"Patro, Sumant" <>

Hello Dave,

	I tried to duplicate the issue with 2.6.18rc1 but did not see
the issue. From the message it looks like the Firmware has stopped
processing cmds. Could you please let us know the Firmware version of
the controller ? 

Thanks,

Sumant

-----Original Message----- 
From: linux-kernel-owner@vger.kernel.org
[mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Dave Lloyd
Sent: Wednesday, July 12, 2006 7:47 AM
To: linux-kernel@vger.kernel.org; Berkley Shands
Subject: MegaRaid 8408E goes out to lunch with nr_requests > 8
This happens both on 2.6.17 and 2.6.18rc1 using the megaraid, mptsas and
mptscsih drivers supplied with the kernel.

While writing data to raid0 devs on a LSI MegaRaid 8408E controller, the
devices will hang after somewhere between 4-7gb of data written.  If I
dial the nr_requests back from the default down to 8, the hang will not
occur.  The hang does occur at 16.  I haven't tested values between the
two, but I'm not too optimistic.  From what I can see, it looks like 8
should be a magic number to make the queue look congested more often
than not.

Here are the messages I get when the devices go out to lunch:
Jul 11 14:13:34 systemname kernel: sd 4:2:0:0: megasas: RESET -40213
cmd=2a
Jul 11 14:13:34 systemname kernel: megasas: [ 0]waiting for 256 commands
to complete
Jul 11 14:13:39 systemname kernel: megasas: [ 5]waiting for 256 commands
to complete
Jul 11 14:13:44 systemname kernel: megasas: [10]waiting for 256 commands
to complete
Jul 11 14:13:49 systemname kernel: megasas: [15]waiting for 256 commands
to complete

[...]

Jul 11 14:16:35 systemname kernel: megasas: [175]waiting for 256
commands to complete
Jul 11 14:16:35 systemname kernel: megasas: failed to do reset
Jul 11 14:16:35 systemname kernel: sd 4:2:1:0: megasas: RESET -40216
cmd=2a
Jul 11 14:16:35 systemname kernel: megasas: cannot recover from previous
reset failures
Jul 11 14:16:35 systemname kernel: sd 4:2:0:0: megasas: RESET -40213
cmd=2a
Jul 11 14:16:35 systemname kernel: megasas: cannot recover from previous
reset failures
Jul 11 14:16:35 systemname kernel: sd 4:2:0:0: megasas: RESET -40213
cmd=2a
Jul 11 14:16:35 systemname kernel: megasas: cannot recover from previous
reset failures
Jul 11 14:16:35 systemname kernel: sd 4:2:0:0: scsi: Device offlined -
not ready after error recovery
Jul 11 14:16:36 systemname last message repeated 13 times

Interestingly, the machine will hang on shutdown and requires a hard
reset to reboot.  Bummer!

My next step is to try and reproduce and dig into this some in KDB.

Has anyone else seen this and/or does anyone have some suggestions for
further debugging info?

-- 
Dave Lloyd
Test Engineer, Exegy, Inc.
314.450.5342
dlloyd@exegy.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: megaraid_sas waiting for command and then offline
  2006-09-06  4:49 Brett G. Durrett
@ 2006-09-06 14:11 ` Dave Lloyd
  2006-09-06 16:04   ` Brett G. Durrett
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Lloyd @ 2006-09-06 14:11 UTC (permalink / raw)
  To: Brett G. Durrett; +Cc: Sumant.Patro, Sreenivas.Bagalkote, lkml, Berkley Shands

Brett G. Durrett wrote:
 >
 > I have the same or a similar issue running 2.6.17 SMP x86_64 - the
 > megaraid_sas driver hangs waiting for commands and then the filesystem
 > unmounts, leaving the machine in an unusable state until there is a hard
 > reboot (the machine is responsive but any access, shell or otherwise, is
 > impossible without the filesystem).  While I do not have much debugging
 > information available, this happens to me about once every 6-7 days in
 > my pool of seven machines, so I can probably get debugging info.  Since
 > the disk is offline and I can't get remote console, I don't have any
 > details except something similar to Dave Lloyd's post, below.
 >
 > The only thing that the machines with these failures seem to have in
 > common is the fact that they are almost exclusively writes - they are
 > slave database machines with large memory and pretty much just
 > replicate.  The read/write machines seem to have less failures.
 >
 > I am happy to help provide debugging information in any reasonable way.
 > In the mean time, if there are any known suggestions or workarounds for
 > the problem, I would be grateful for the guidance.
 >
 > Here are what details on the controller.  If you want additional info,
 > let me know exactly what you need and I will do what I can to get it to
 > you.:
 >
 > Product Name    : PERC 5/i Integrated
 > Serial No       : 12345
 > FW Package Build: 5.0.1-0030
 > FW Version      : 1.00.01-0088
 > BIOS Version    : MT23
 > Ctrl-R Version  :1.02-007
 >
 > B-

Which motherboard are you using?  We believe that this may be a
motherboard specific issue.  It appears to happen on a SuperMicro
motherboard but not a Tyan motherboard.

-- 
Dave Lloyd
Test Engineer, Exegy, Inc.
314.450.5342
dlloyd@exegy.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: megaraid_sas waiting for command and then offline
  2006-09-06 14:11 ` Dave Lloyd
@ 2006-09-06 16:04   ` Brett G. Durrett
  0 siblings, 0 replies; 9+ messages in thread
From: Brett G. Durrett @ 2006-09-06 16:04 UTC (permalink / raw)
  To: Dave Lloyd; +Cc: Sumant.Patro, Sreenivas.Bagalkote, lkml, Berkley Shands


The machines are Dell 2900s, so the mobo is custom.  From a Dell SE, 
"Dell uses a custom mobo that is Dell branded with the Intel chipset 
Greencreek.".

B-




Dave Lloyd wrote:

> Brett G. Durrett wrote:
> >
> > I have the same or a similar issue running 2.6.17 SMP x86_64 - the
> > megaraid_sas driver hangs waiting for commands and then the filesystem
> > unmounts, leaving the machine in an unusable state until there is a 
> hard
> > reboot (the machine is responsive but any access, shell or 
> otherwise, is
> > impossible without the filesystem).  While I do not have much debugging
> > information available, this happens to me about once every 6-7 days in
> > my pool of seven machines, so I can probably get debugging info.  Since
> > the disk is offline and I can't get remote console, I don't have any
> > details except something similar to Dave Lloyd's post, below.
> >
> > The only thing that the machines with these failures seem to have in
> > common is the fact that they are almost exclusively writes - they are
> > slave database machines with large memory and pretty much just
> > replicate.  The read/write machines seem to have less failures.
> >
> > I am happy to help provide debugging information in any reasonable way.
> > In the mean time, if there are any known suggestions or workarounds for
> > the problem, I would be grateful for the guidance.
> >
> > Here are what details on the controller.  If you want additional info,
> > let me know exactly what you need and I will do what I can to get it to
> > you.:
> >
> > Product Name    : PERC 5/i Integrated
> > Serial No       : 12345
> > FW Package Build: 5.0.1-0030
> > FW Version      : 1.00.01-0088
> > BIOS Version    : MT23
> > Ctrl-R Version  :1.02-007
> >
> > B-
>
> Which motherboard are you using?  We believe that this may be a
> motherboard specific issue.  It appears to happen on a SuperMicro
> motherboard but not a Tyan motherboard.
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-11-13 21:40 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-06 17:14 megaraid_sas waiting for command and then offline Patro, Sumant
2006-09-06 20:44 ` Brett G. Durrett
  -- strict thread matches above, loose matches on Subject: below --
2006-10-25  8:46 David N. Welton
2006-10-25 22:48 ` Brett G. Durrett
2006-10-25 23:03   ` Alan Cox
2006-11-13 21:40   ` Brett G. Durrett
2006-09-06  4:49 Brett G. Durrett
2006-09-06 14:11 ` Dave Lloyd
2006-09-06 16:04   ` Brett G. Durrett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox