* RE: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraidPERC 3/Di Container goes offline
@ 2004-11-23 22:07 Salyzyn, Mark
2004-11-23 22:15 ` Ryan Anderson
2004-11-23 22:35 ` Christoph Hellwig
0 siblings, 2 replies; 9+ messages in thread
From: Salyzyn, Mark @ 2004-11-23 22:07 UTC (permalink / raw)
To: Ryan Anderson, Andrew Morton; +Cc: linux-scsi
Do you have the latest Firmware from Dell? Do you have the Read and
Write Cache disabled as Dell has recommended (for pre 6091(?) Firmware)?
The `container going offline' is a result of the Firmware in the card
not responding to a SCSI command within 60 seconds (the Linux SCSI layer
timeout). In the older firmware this would occur at the combination of
high load, drive or scsi bus problems and the card flushing the cache.
If the problem persists, preventing the card building up a large amount
of cache data may be the only way to mitigate this.
I have had others experiment with overriding the SCSI timeout (the
Adaptec driver branch has an AAC_EXTENDED_TIMEOUT) to limited success.
Turning off the SCSI timeout (add a scsi_del_timer as command is issued
to the controller, and a scsi_add_timer in the interrupt service routine
before completion) worked extremely well, but this makes me
understandably nervous.
Sincerely -- Mark Salyzyn
-----Original Message-----
From: linux-scsi-owner@vger.kernel.org
[mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Ryan Anderson
Sent: Tuesday, November 23, 2004 4:42 PM
To: Andrew Morton
Cc: linux-scsi@vger.kernel.org
Subject: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600
aacraidPERC 3/Di Container goes offline
On Thu, 2004-10-28 at 00:53 -0700, Andrew Morton wrote:
> Subject: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC
3/Di Container goes offline
>
>
> http://bugme.osdl.org/show_bug.cgi?id=3651
>
> Summary: dell poweredge 4600 aacraid PERC 3/Di Container
goes
> offline
> Kernel Version: 2.6.10-rc1, 2.6.9, 2.6.8, 2.6.7, 2.6.6
> Status: NEW
> Severity: high
> Owner: andmike@us.ibm.com
> Submitter: oliver.polterauer@ewave.at
> CC: oliver.polterauer@ewave.at
Is there any update on this problem?
To reiterate my particular hardware involved that can trigger this
problem:
Dell 2650, Dual 2.4Ghz Xeon processors (hyperthreading no, though the
problem occured in 2.4.20 without hyperthreading disabled via "noht")
4 GB of ram
Only load is PostgreSQL related (i.e, network queries, plus twice daily
dumps of the database to a NFS store, and a rsync back to the server for
a second copy)
Under load, I repeatedly saw containers go offline.
Dell's recommended hardware diagnostics do not turn up anything (at
all!)
The harddrive are Fujitsu drives, so the Seagate Firmware issue should
not affect them.
I have since taken this server out of production. Unfortunately, this
makes the error much harder to trigger (i.e, I have failed so far to
trigger it, even with multiple bonnie++ runs)
Suggestions, diagnostics, etc, would be greatly appreciated.
--
Ryan Anderson
AutoWeb Communications, Inc.
email: ryan@autoweb.net
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraidPERC 3/Di Container goes offline
2004-11-23 22:07 Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraidPERC 3/Di Container goes offline Salyzyn, Mark
@ 2004-11-23 22:15 ` Ryan Anderson
2004-11-23 22:35 ` Christoph Hellwig
1 sibling, 0 replies; 9+ messages in thread
From: Ryan Anderson @ 2004-11-23 22:15 UTC (permalink / raw)
To: Salyzyn, Mark; +Cc: Andrew Morton, linux-scsi
[-- Attachment #1: Type: text/plain, Size: 2061 bytes --]
On Tue, 2004-11-23 at 17:07 -0500, Salyzyn, Mark wrote:
> Do you have the latest Firmware from Dell? Do you have the Read and
> Write Cache disabled as Dell has recommended (for pre 6091(?) Firmware)?
The latest Dell firmware that I have seen is 6092, which would seem to
make your second question irrelevant. Is that correct?
dmesg says this about the controller:
AAC0: kernel 2.8.4 build 6092
AAC0: monitor 2.8.4 build 6092
AAC0: bios 2.8.0 build 6092
AAC0: serial 83ac41d3fafaf001
scsi0 : percraid
Vendor: DELL Model: PERCRAID RAID5 Rev: V1.0
Type: Direct-Access ANSI SCSI revision: 02
Vendor: DELL Model: PERCRAID RAID5 Rev: V1.0
Type: Direct-Access ANSI SCSI revision: 02
> The `container going offline' is a result of the Firmware in the card
> not responding to a SCSI command within 60 seconds (the Linux SCSI layer
> timeout). In the older firmware this would occur at the combination of
> high load, drive or scsi bus problems and the card flushing the cache.
> If the problem persists, preventing the card building up a large amount
> of cache data may be the only way to mitigate this.
Ok, I can experiment with this. Where should I start? (I'm not afraid
of source-level hacking, just don't know where to start.)
> I have had others experiment with overriding the SCSI timeout (the
> Adaptec driver branch has an AAC_EXTENDED_TIMEOUT) to limited success.
> Turning off the SCSI timeout (add a scsi_del_timer as command is issued
> to the controller, and a scsi_add_timer in the interrupt service routine
> before completion) worked extremely well, but this makes me
> understandably nervous.
That completely disables *any* timeout, correct?
That would make me a .. bit nervous, too.
I'm going to try to build a load today that can trigger the problem.
(It's rather hard to debug when the problem is very hard to trigger -
sigh)
--
Ryan Anderson
AutoWeb Communications, Inc.
email: ryan@autoweb.net
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraidPERC 3/Di Container goes offline
2004-11-23 22:07 Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraidPERC 3/Di Container goes offline Salyzyn, Mark
2004-11-23 22:15 ` Ryan Anderson
@ 2004-11-23 22:35 ` Christoph Hellwig
1 sibling, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2004-11-23 22:35 UTC (permalink / raw)
To: Salyzyn, Mark; +Cc: Ryan Anderson, Andrew Morton, linux-scsi
On Tue, Nov 23, 2004 at 05:07:51PM -0500, Salyzyn, Mark wrote:
> Do you have the latest Firmware from Dell? Do you have the Read and
> Write Cache disabled as Dell has recommended (for pre 6091(?) Firmware)?
>
> The `container going offline' is a result of the Firmware in the card
> not responding to a SCSI command within 60 seconds (the Linux SCSI layer
> timeout). In the older firmware this would occur at the combination of
> high load, drive or scsi bus problems and the card flushing the cache.
> If the problem persists, preventing the card building up a large amount
> of cache data may be the only way to mitigate this.
>
> I have had others experiment with overriding the SCSI timeout (the
> Adaptec driver branch has an AAC_EXTENDED_TIMEOUT) to limited success.
> Turning off the SCSI timeout (add a scsi_del_timer as command is issued
> to the controller, and a scsi_add_timer in the interrupt service routine
> before completion) worked extremely well, but this makes me
> understandably nervous.
You can do this without these horrible timer hacks by setting sdev->timeout
to a bigger value in your ->slave_configure method.
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraidPERC 3/Di Container goes offline
@ 2004-11-24 12:59 Salyzyn, Mark
2004-11-24 13:09 ` Christoph Hellwig
0 siblings, 1 reply; 9+ messages in thread
From: Salyzyn, Mark @ 2004-11-24 12:59 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Ryan Anderson, linux-scsi
I dropped Andrew Morton from the direct mail recipients.
Thanks, made the change in my branch of the code. This adjustment will
probably never be submitted to MarkH since I use it only for debugging
purposes. However, would it be nice if the global scsi timeout could be
`user' adjustable?
Not that I advocate it as readily accessible, since any storage device
that takes longer than ten seconds is in `trouble', and any timeout
longer than two minutes will no doubt cause servers to go offline on the
internet. Its purpose is only for troubleshooting.
Sincerely -- Mark Salyzyn
-----Original Message-----
From: Christoph Hellwig [mailto:hch@infradead.org]
Sent: Tuesday, November 23, 2004 5:35 PM
To: Salyzyn, Mark
Cc: Ryan Anderson; Andrew Morton; linux-scsi@vger.kernel.org
Subject: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600
aacraidPERC 3/Di Container goes offline
On Tue, Nov 23, 2004 at 05:07:51PM -0500, Salyzyn, Mark wrote:
> Do you have the latest Firmware from Dell? Do you have the Read and
> Write Cache disabled as Dell has recommended (for pre 6091(?)
Firmware)?
>
> The `container going offline' is a result of the Firmware in the card
> not responding to a SCSI command within 60 seconds (the Linux SCSI
layer
> timeout). In the older firmware this would occur at the combination of
> high load, drive or scsi bus problems and the card flushing the cache.
> If the problem persists, preventing the card building up a large
amount
> of cache data may be the only way to mitigate this.
>
> I have had others experiment with overriding the SCSI timeout (the
> Adaptec driver branch has an AAC_EXTENDED_TIMEOUT) to limited success.
> Turning off the SCSI timeout (add a scsi_del_timer as command is
issued
> to the controller, and a scsi_add_timer in the interrupt service
routine
> before completion) worked extremely well, but this makes me
> understandably nervous.
You can do this without these horrible timer hacks by setting
sdev->timeout
to a bigger value in your ->slave_configure method.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraidPERC 3/Di Container goes offline
2004-11-24 12:59 Salyzyn, Mark
@ 2004-11-24 13:09 ` Christoph Hellwig
2004-11-24 14:58 ` Brian King
0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2004-11-24 13:09 UTC (permalink / raw)
To: Salyzyn, Mark; +Cc: Christoph Hellwig, Ryan Anderson, linux-scsi
On Wed, Nov 24, 2004 at 07:59:04AM -0500, Salyzyn, Mark wrote:
> I dropped Andrew Morton from the direct mail recipients.
>
> Thanks, made the change in my branch of the code. This adjustment will
> probably never be submitted to MarkH since I use it only for debugging
> purposes. However, would it be nice if the global scsi timeout could be
> `user' adjustable?
>
> Not that I advocate it as readily accessible, since any storage device
> that takes longer than ten seconds is in `trouble', and any timeout
> longer than two minutes will no doubt cause servers to go offline on the
> internet. Its purpose is only for troubleshooting.
Yes, it would probably be nice to have a writeable timeout attribute for
the scsi device. Care to submit a patch?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraidPERC 3/Di Container goes offline
2004-11-24 13:09 ` Christoph Hellwig
@ 2004-11-24 14:58 ` Brian King
2004-11-24 20:29 ` Mike Christie
0 siblings, 1 reply; 9+ messages in thread
From: Brian King @ 2004-11-24 14:58 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Salyzyn, Mark, Ryan Anderson, linux-scsi
Christoph Hellwig wrote:
> On Wed, Nov 24, 2004 at 07:59:04AM -0500, Salyzyn, Mark wrote:
>
>>I dropped Andrew Morton from the direct mail recipients.
>>
>>Thanks, made the change in my branch of the code. This adjustment will
>>probably never be submitted to MarkH since I use it only for debugging
>>purposes. However, would it be nice if the global scsi timeout could be
>>`user' adjustable?
>>
>>Not that I advocate it as readily accessible, since any storage device
>>that takes longer than ten seconds is in `trouble', and any timeout
>>longer than two minutes will no doubt cause servers to go offline on the
>>internet. Its purpose is only for troubleshooting.
>
>
> Yes, it would probably be nice to have a writeable timeout attribute for
> the scsi device. Care to submit a patch?
This already exists today.
--
Brian King
eServer Storage I/O
IBM Linux Technology Center
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraidPERC 3/Di Container goes offline
2004-11-24 20:29 ` Mike Christie
@ 2004-11-24 20:28 ` Brian King
2004-11-24 20:31 ` Mike Christie
1 sibling, 0 replies; 9+ messages in thread
From: Brian King @ 2004-11-24 20:28 UTC (permalink / raw)
To: Mike Christie; +Cc: Christoph Hellwig, Salyzyn, Mark, Ryan Anderson, linux-scsi
Mike Christie wrote:
> Brian King wrote:
>
>>
>>
>> Christoph Hellwig wrote:
>>
>>> On Wed, Nov 24, 2004 at 07:59:04AM -0500, Salyzyn, Mark wrote:
>>>
>>>> I dropped Andrew Morton from the direct mail recipients.
>>>>
>>>> Thanks, made the change in my branch of the code. This adjustment will
>>>> probably never be submitted to MarkH since I use it only for debugging
>>>> purposes. However, would it be nice if the global scsi timeout could be
>>>> `user' adjustable?
>>>>
>>>> Not that I advocate it as readily accessible, since any storage device
>>>> that takes longer than ten seconds is in `trouble', and any timeout
>>>> longer than two minutes will no doubt cause servers to go offline on
>>>> the
>>>> internet. Its purpose is only for troubleshooting.
>>>
>>>
>>>
>>>
>>> Yes, it would probably be nice to have a writeable timeout attribute for
>>> the scsi device. Care to submit a patch?
>>
>>
>>
>> This already exists today.
>>
>
> Is there any problems with that attr and drivers not being able to handle
> the timeouts someone sets in sysfs. I mean if in my slave_configure I set
> 60 secs becuase for some reason my HW takes this long no matter what,
> someone
> can reset this to 5 or 1 sec and there are is chance for the lld to
> override
> this if it is invalid. Is this just one of those things where we say
> only touch the timeout if you know what you are doing, or is it that the
> lld
> should not have that power, or sometihng else?
Well, personally, I think this is a case of only touch the timeout if you know
what you are doing, but the infrastructure is certainly there for the LLD to
override the timeout attribute and do whatever policing might need to be done.
--
Brian King
eServer Storage I/O
IBM Linux Technology Center
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraidPERC 3/Di Container goes offline
2004-11-24 14:58 ` Brian King
@ 2004-11-24 20:29 ` Mike Christie
2004-11-24 20:28 ` Brian King
2004-11-24 20:31 ` Mike Christie
0 siblings, 2 replies; 9+ messages in thread
From: Mike Christie @ 2004-11-24 20:29 UTC (permalink / raw)
To: brking; +Cc: Christoph Hellwig, Salyzyn, Mark, Ryan Anderson, linux-scsi
Brian King wrote:
>
>
> Christoph Hellwig wrote:
>
>> On Wed, Nov 24, 2004 at 07:59:04AM -0500, Salyzyn, Mark wrote:
>>
>>> I dropped Andrew Morton from the direct mail recipients.
>>>
>>> Thanks, made the change in my branch of the code. This adjustment will
>>> probably never be submitted to MarkH since I use it only for debugging
>>> purposes. However, would it be nice if the global scsi timeout could be
>>> `user' adjustable?
>>>
>>> Not that I advocate it as readily accessible, since any storage device
>>> that takes longer than ten seconds is in `trouble', and any timeout
>>> longer than two minutes will no doubt cause servers to go offline on the
>>> internet. Its purpose is only for troubleshooting.
>>
>>
>>
>> Yes, it would probably be nice to have a writeable timeout attribute for
>> the scsi device. Care to submit a patch?
>
>
> This already exists today.
>
Is there any problems with that attr and drivers not being able to handle
the timeouts someone sets in sysfs. I mean if in my slave_configure I set
60 secs becuase for some reason my HW takes this long no matter what, someone
can reset this to 5 or 1 sec and there are is chance for the lld to override
this if it is invalid. Is this just one of those things where we say
only touch the timeout if you know what you are doing, or is it that the lld
should not have that power, or sometihng else?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraidPERC 3/Di Container goes offline
2004-11-24 20:29 ` Mike Christie
2004-11-24 20:28 ` Brian King
@ 2004-11-24 20:31 ` Mike Christie
1 sibling, 0 replies; 9+ messages in thread
From: Mike Christie @ 2004-11-24 20:31 UTC (permalink / raw)
To: Mike Christie
Cc: brking, Christoph Hellwig, Salyzyn, Mark, Ryan Anderson,
linux-scsi
Mike Christie wrote:
> Brian King wrote:
>
>>
>>
>> Christoph Hellwig wrote:
>>
>>> On Wed, Nov 24, 2004 at 07:59:04AM -0500, Salyzyn, Mark wrote:
>>>
>>>> I dropped Andrew Morton from the direct mail recipients.
>>>>
>>>> Thanks, made the change in my branch of the code. This adjustment will
>>>> probably never be submitted to MarkH since I use it only for debugging
>>>> purposes. However, would it be nice if the global scsi timeout could be
>>>> `user' adjustable?
>>>>
>>>> Not that I advocate it as readily accessible, since any storage device
>>>> that takes longer than ten seconds is in `trouble', and any timeout
>>>> longer than two minutes will no doubt cause servers to go offline on
>>>> the
>>>> internet. Its purpose is only for troubleshooting.
>>>
>>>
>>>
>>>
>>> Yes, it would probably be nice to have a writeable timeout attribute for
>>> the scsi device. Care to submit a patch?
>>
>>
>>
>> This already exists today.
>>
>
> Is there any problems with that attr and drivers not being able to handle
> the timeouts someone sets in sysfs. I mean if in my slave_configure I set
> 60 secs becuase for some reason my HW takes this long no matter what,
> someone
> can reset this to 5 or 1 sec and there are is chance for the lld to
> override
Sorry, that should have been
"there is no chance"
> this if it is invalid. Is this just one of those things where we say
> only touch the timeout if you know what you are doing, or is it that the
> lld
> should not have that power, or sometihng else?
>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2004-11-24 21:30 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-11-23 22:07 Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraidPERC 3/Di Container goes offline Salyzyn, Mark
2004-11-23 22:15 ` Ryan Anderson
2004-11-23 22:35 ` Christoph Hellwig
-- strict thread matches above, loose matches on Subject: below --
2004-11-24 12:59 Salyzyn, Mark
2004-11-24 13:09 ` Christoph Hellwig
2004-11-24 14:58 ` Brian King
2004-11-24 20:29 ` Mike Christie
2004-11-24 20:28 ` Brian King
2004-11-24 20:31 ` Mike Christie
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox