public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
* RE: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraidPERC 3/Di Container goes offline
@ 2004-11-24 12:59 Salyzyn, Mark
  2004-11-24 13:09 ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Salyzyn, Mark @ 2004-11-24 12:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ryan Anderson, linux-scsi

I dropped Andrew Morton from the direct mail recipients.

Thanks, made the change in my branch of the code. This adjustment will
probably never be submitted to MarkH since I use it only for debugging
purposes. However, would it be nice if the global scsi timeout could be
`user' adjustable?

Not that I advocate it as readily accessible, since any storage device
that takes longer than ten seconds is in `trouble', and any timeout
longer than two minutes will no doubt cause servers to go offline on the
internet. Its purpose is only for troubleshooting.

Sincerely -- Mark Salyzyn

-----Original Message-----
From: Christoph Hellwig [mailto:hch@infradead.org] 
Sent: Tuesday, November 23, 2004 5:35 PM
To: Salyzyn, Mark
Cc: Ryan Anderson; Andrew Morton; linux-scsi@vger.kernel.org
Subject: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600
aacraidPERC 3/Di Container goes offline

On Tue, Nov 23, 2004 at 05:07:51PM -0500, Salyzyn, Mark wrote:
> Do you have the latest Firmware from Dell? Do you have the Read and
> Write Cache disabled as Dell has recommended (for pre 6091(?)
Firmware)?
> 
> The `container going offline' is a result of the Firmware in the card
> not responding to a SCSI command within 60 seconds (the Linux SCSI
layer
> timeout). In the older firmware this would occur at the combination of
> high load, drive or scsi bus problems and the card flushing the cache.
> If the problem persists, preventing the card building up a large
amount
> of cache data may be the only way to mitigate this.
> 
> I have had others experiment with overriding the SCSI timeout (the
> Adaptec driver branch has an AAC_EXTENDED_TIMEOUT) to limited success.
> Turning off the SCSI timeout (add a scsi_del_timer as command is
issued
> to the controller, and a scsi_add_timer in the interrupt service
routine
> before completion) worked extremely well, but this makes me
> understandably nervous.

You can do this without these horrible timer hacks by setting
sdev->timeout
to a bigger value in your ->slave_configure method.


^ permalink raw reply	[flat|nested] 9+ messages in thread
* RE: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraidPERC 3/Di Container goes offline
@ 2004-11-23 22:07 Salyzyn, Mark
  2004-11-23 22:15 ` Ryan Anderson
  2004-11-23 22:35 ` Christoph Hellwig
  0 siblings, 2 replies; 9+ messages in thread
From: Salyzyn, Mark @ 2004-11-23 22:07 UTC (permalink / raw)
  To: Ryan Anderson, Andrew Morton; +Cc: linux-scsi

Do you have the latest Firmware from Dell? Do you have the Read and
Write Cache disabled as Dell has recommended (for pre 6091(?) Firmware)?

The `container going offline' is a result of the Firmware in the card
not responding to a SCSI command within 60 seconds (the Linux SCSI layer
timeout). In the older firmware this would occur at the combination of
high load, drive or scsi bus problems and the card flushing the cache.
If the problem persists, preventing the card building up a large amount
of cache data may be the only way to mitigate this.

I have had others experiment with overriding the SCSI timeout (the
Adaptec driver branch has an AAC_EXTENDED_TIMEOUT) to limited success.
Turning off the SCSI timeout (add a scsi_del_timer as command is issued
to the controller, and a scsi_add_timer in the interrupt service routine
before completion) worked extremely well, but this makes me
understandably nervous.

Sincerely -- Mark Salyzyn

-----Original Message-----
From: linux-scsi-owner@vger.kernel.org
[mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Ryan Anderson
Sent: Tuesday, November 23, 2004 4:42 PM
To: Andrew Morton
Cc: linux-scsi@vger.kernel.org
Subject: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600
aacraidPERC 3/Di Container goes offline

On Thu, 2004-10-28 at 00:53 -0700, Andrew Morton wrote:
> Subject: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC
3/Di Container goes offline
> 
> 
> http://bugme.osdl.org/show_bug.cgi?id=3651
> 
>            Summary: dell poweredge 4600 aacraid PERC 3/Di Container
goes
>                     offline
>     Kernel Version: 2.6.10-rc1, 2.6.9, 2.6.8, 2.6.7, 2.6.6
>             Status: NEW
>           Severity: high
>              Owner: andmike@us.ibm.com
>          Submitter: oliver.polterauer@ewave.at
>                 CC: oliver.polterauer@ewave.at

Is there any update on this problem?
To reiterate my particular hardware involved that can trigger this
problem:

Dell 2650, Dual 2.4Ghz Xeon processors (hyperthreading no, though the
problem occured in 2.4.20 without hyperthreading disabled via "noht")

4 GB of ram
Only load is PostgreSQL related (i.e, network queries, plus twice daily
dumps of the database to a NFS store, and a rsync back to the server for
a second copy)

Under load, I repeatedly saw containers go offline.

Dell's recommended hardware diagnostics do not turn up anything (at
all!)

The harddrive are Fujitsu drives, so the Seagate Firmware issue should
not affect them.

I have since taken this server out of production.  Unfortunately, this
makes the error much harder to trigger (i.e, I have failed so far to
trigger it, even with multiple bonnie++ runs)

Suggestions, diagnostics, etc, would be greatly appreciated.


-- 

Ryan Anderson                
AutoWeb Communications, Inc. 
email: ryan@autoweb.net 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2004-11-24 21:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-11-24 12:59 Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraidPERC 3/Di Container goes offline Salyzyn, Mark
2004-11-24 13:09 ` Christoph Hellwig
2004-11-24 14:58   ` Brian King
2004-11-24 20:29     ` Mike Christie
2004-11-24 20:28       ` Brian King
2004-11-24 20:31       ` Mike Christie
  -- strict thread matches above, loose matches on Subject: below --
2004-11-23 22:07 Salyzyn, Mark
2004-11-23 22:15 ` Ryan Anderson
2004-11-23 22:35 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox