* 2.5.59-dcl2
@ 2003-01-28 22:51 Stephen Hemminger
2003-01-28 23:10 ` 2.5.59-dcl2 Christoph Hellwig
2003-01-29 0:07 ` 2.5.59-dcl2 Stephen Hemminger
0 siblings, 2 replies; 16+ messages in thread
From: Stephen Hemminger @ 2003-01-28 22:51 UTC (permalink / raw)
To: Linux Kernel Mailing List
Update to the OSDL DCL patch set.
The OSDL common includes RAS related enhancements, bugfixes, and
latest version of drivers for the OSDL server test machines.
Linux Trace Toolkit (LTT) (Karim Yaghmour)
Linux Kernel Crash Dump (LKCD) (Matt Robinson, LKCD team)
Kernel Probes (kprobes) (Rusty Russell)
Megaraid 2 driver (Matt Domsch)
DAC960 driver (Dave Olien)
The goal is to make these projects more robust and resolve potential
overlaps. Also this set keeps up to date with the latest version of
drivers for the disk devices that are present on the OSDL test
machines.
The DCL-only patch contains performance and tuning related
enhancements. The goal is to use these for database performance
tuning and give these projects more testing.
2.5.59-osdl2:
. Dac960 error retry (Dave Olien)
2.5.59-dcl2:
. Lost timer tick compensation (John Stultz)
. Improved boot time TSC synchronization (Jim Houston)
. Lockless gettimeofday (Andi Kleen, me)
. Performance monitoring counters for x86 (Mikael Pettersson)
2.5.59-osdl1:
. Bug fix for vmlinux.ld.S (Kai Germaschewski)
. Update to LKCD for multiple schemes (Bharata B Rao)
. Bug fixes for LKCD locking (me)
. Improved i386 fatal event notifiers (me)
. Kprobe using notify_die (me)
2.5.59-dcl1:
. RCU statistics (Dipankar Sarma)
. Scheduler tunables (Robert Love)
The latest release is available in downloadable patches from
http://sourceforge.net/projects/osdldcl
or public BitKeeper repositories
Common code: bk://bk.osdl.org/linux-2.5-osdl
Common code + CGL: bk://bk.osdl.org/linux-2.5-cgl
Common code + DCL: bk://bk.osdl.org/linux-2.5-dcl
Getting Involved
----------------
If interested in development of DCL, please subscribe to the mailing
list at http://lists.osdl.org/mailman/listinfo/dcl_discussion.
Developers are encouraged to send any enhancements or bug fix
patches. Patches should be tested by using the OSDL Scalable Test
Platform (STP) and Patch Lifecycle Manager (PLM) facilities.
Project information:
http://www.osdl.org/projects/dcl/
http://osdldcl.sourceforge.net
http://sourceforge.net/projects/osdldcl
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.5.59-dcl2
2003-01-28 22:51 2.5.59-dcl2 Stephen Hemminger
@ 2003-01-28 23:10 ` Christoph Hellwig
2003-01-28 23:24 ` 2.5.59-dcl2 Stephen Hemminger
2003-01-29 0:13 ` 2.5.59-dcl2 Alan Cox
2003-01-29 0:07 ` 2.5.59-dcl2 Stephen Hemminger
1 sibling, 2 replies; 16+ messages in thread
From: Christoph Hellwig @ 2003-01-28 23:10 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Linux Kernel Mailing List
On Tue, Jan 28, 2003 at 02:51:38PM -0800, Stephen Hemminger wrote:
> Megaraid 2 driver (Matt Domsch)
> DAC960 driver (Dave Olien)
Is there a reason these aren't submitted to Linus?
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: 2.5.59-dcl2
@ 2003-01-28 23:17 Matt_Domsch
0 siblings, 0 replies; 16+ messages in thread
From: Matt_Domsch @ 2003-01-28 23:17 UTC (permalink / raw)
To: hch, shemminger; +Cc: linux-kernel
> > Megaraid 2 driver (Matt Domsch)
> Is there a reason these aren't submitted to Linus?
Just timing on this one. It's been in Stephen's tree for a while, and in
2.5.50-ac. I've not heard of negative feedback, nor really any positive
feedback either. I'll be happy to submit this.
Thanks,
Matt
--
Matt Domsch
Sr. Software Engineer, Lead Engineer, Architect
Dell Linux Solutions www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.5.59-dcl2
2003-01-28 23:10 ` 2.5.59-dcl2 Christoph Hellwig
@ 2003-01-28 23:24 ` Stephen Hemminger
2003-01-29 0:13 ` 2.5.59-dcl2 Alan Cox
1 sibling, 0 replies; 16+ messages in thread
From: Stephen Hemminger @ 2003-01-28 23:24 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Linux Kernel Mailing List
On Tue, 2003-01-28 at 15:10, Christoph Hellwig wrote:
> On Tue, Jan 28, 2003 at 02:51:38PM -0800, Stephen Hemminger wrote:
> > Megaraid 2 driver (Matt Domsch)
> > DAC960 driver (Dave Olien)
>
> Is there a reason these aren't submitted to Linus?
There in process, just don't know where.
Dave has submitted he DAC960 driver several times;
the version in 2.5.59 is only one rev behind.
I don't know the status of the Megaraid 2 driver.
If you are interested check the megaraid mailing
list.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.5.59-dcl2
2003-01-28 22:51 2.5.59-dcl2 Stephen Hemminger
2003-01-28 23:10 ` 2.5.59-dcl2 Christoph Hellwig
@ 2003-01-29 0:07 ` Stephen Hemminger
2003-01-29 0:17 ` 2.5.59-dcl2 Andrew Morton
1 sibling, 1 reply; 16+ messages in thread
From: Stephen Hemminger @ 2003-01-29 0:07 UTC (permalink / raw)
To: Linux Kernel Mailing List
Missed one item in the credits.
Also, added the Nick Piggin's anticipaatory i/o scheduler (via -mm5)
to 2.5.59-dcl2 to evaluate the performance impact under different loads.
> 2.5.59-osdl2:
> . Dac960 error retry (Dave Olien)
>
> 2.5.59-dcl2:
> . Lost timer tick compensation (John Stultz)
> . Improved boot time TSC synchronization (Jim Houston)
> . Lockless gettimeofday (Andi Kleen, me)
> . Performance monitoring counters for x86 (Mikael Pettersson)
>
> 2.5.59-osdl1:
> . Bug fix for vmlinux.ld.S (Kai Germaschewski)
> . Update to LKCD for multiple schemes (Bharata B Rao)
> . Bug fixes for LKCD locking (me)
> . Improved i386 fatal event notifiers (me)
> . Kprobe using notify_die (me)
>
> 2.5.59-dcl1:
> . RCU statistics (Dipankar Sarma)
> . Scheduler tunables (Robert Love)
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.5.59-dcl2
2003-01-28 23:10 ` 2.5.59-dcl2 Christoph Hellwig
2003-01-28 23:24 ` 2.5.59-dcl2 Stephen Hemminger
@ 2003-01-29 0:13 ` Alan Cox
1 sibling, 0 replies; 16+ messages in thread
From: Alan Cox @ 2003-01-29 0:13 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Stephen Hemminger, Linux Kernel Mailing List
On Tue, 2003-01-28 at 23:10, Christoph Hellwig wrote:
> On Tue, Jan 28, 2003 at 02:51:38PM -0800, Stephen Hemminger wrote:
> > Megaraid 2 driver (Matt Domsch)
> > DAC960 driver (Dave Olien)
>
> Is there a reason these aren't submitted to Linus?
>
Wrong question 8) You need to ask Linus why he didnt take them
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.5.59-dcl2
2003-01-29 0:07 ` 2.5.59-dcl2 Stephen Hemminger
@ 2003-01-29 0:17 ` Andrew Morton
0 siblings, 0 replies; 16+ messages in thread
From: Andrew Morton @ 2003-01-29 0:17 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Linux Kernel Mailing List
Stephen Hemminger wrote:
>
> Missed one item in the credits.
>
> Also, added the Nick Piggin's anticipaatory i/o scheduler (via -mm5)
> to 2.5.59-dcl2 to evaluate the performance impact under different loads.
>
It caused regression in David Mansfield's database test. That was
recovered in -mm6.
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: 2.5.59-dcl2
@ 2003-01-29 5:35 Matt_Domsch
2003-01-29 6:51 ` 2.5.59-dcl2 Mike Anderson
2003-01-29 17:02 ` 2.5.59-dcl2 Mark Haverkamp
0 siblings, 2 replies; 16+ messages in thread
From: Matt_Domsch @ 2003-01-29 5:35 UTC (permalink / raw)
To: markh; +Cc: andmike, linux-scsi, atulm
> From: Mark Haverkamp [mailto:markh@osdl.org]
> I sent a bug report the the kernel bugzilla a while ago. It
> got assigned
> to someone at IBM by default. This is a problem we have been having
> between our megaraid cards and an IBM enclosure for some time. I get
> around it by disabling report luns in my kernel configuration.
>
> Could you take a look at the bug and let me know if I've
> included enough
> information to make sense. I included debug output for the
> 1.18 and 2.0 driver.
>
> http://bugme.osdl.org/show_bug.cgi?id=183
>
> or
>
> http://bugzilla.kernel.org/show_bug.cgi?id=183
Mark, your description in the bug appears accurate. Here's what happens:
1) REPORT_LUNS is sent to the IBM enclosure, command times out.
2) scsi_unjam_host() runs, trying the error handlers...
3) megaraid abort handler, then the reset handler called (3 times), none of
which do anything at all because the command has already been issued to the
firmware, and the card itself isn't getting reset. Both those routines
return failure.
4) so scsi_unjam_host() offlines the device
5) scsi_decide_disposition() sees device is offline and returns success,
freeing the command for future use again.
6) command eventually completes and the mega_rundoneq() calls
cmd->scsi_done() on it, but cmd has been filled with 0x5a's so it Oopses.
The scsi mid-layer shouldn't free a command that hasn't actually been
aborted/reset because it *could* come back from the firmware after the
timeout has expired, and the driver has a reference to it (need
refcounting...) This could potentially lead to an exhaustion of the command
pool though, if a command *never* comes back.
How long does it take for the IBM enclosure to return REPORT LUNS? Since
this works on aic7xxx within the timeout period, I'm guessing the megaraid
firmware takes a long time to deal with it since it's a pass-through device.
I believe there is a way to issue an adapter reset command to the megaraid
firmware, though neither driver series 1.18 or 2.00 do so presently.
Copying Atul for insight as to what effects this would have on the
controller and commands in flight...
Thanks,
Matt
--
Matt Domsch
Sr. Software Engineer, Lead Engineer, Architect
Dell Linux Solutions www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.5.59-dcl2
2003-01-29 5:35 2.5.59-dcl2 Matt_Domsch
@ 2003-01-29 6:51 ` Mike Anderson
2003-01-29 18:05 ` 2.5.59-dcl2 Luben Tuikov
2003-01-29 17:02 ` 2.5.59-dcl2 Mark Haverkamp
1 sibling, 1 reply; 16+ messages in thread
From: Mike Anderson @ 2003-01-29 6:51 UTC (permalink / raw)
To: Matt_Domsch; +Cc: markh, linux-scsi, atulm
Matt_Domsch@Dell.com [Matt_Domsch@Dell.com] wrote:
> > From: Mark Haverkamp [mailto:markh@osdl.org]
> > I sent a bug report the the kernel bugzilla a while ago. It
> > got assigned
> > to someone at IBM by default. This is a problem we have been having
> > between our megaraid cards and an IBM enclosure for some time. I get
> > around it by disabling report luns in my kernel configuration.
> >
> > Could you take a look at the bug and let me know if I've
> > included enough
> > information to make sense. I included debug output for the
> > 1.18 and 2.0 driver.
> >
> > http://bugme.osdl.org/show_bug.cgi?id=183
> >
> > or
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=183
>
> Mark, your description in the bug appears accurate. Here's what happens:
>
> 1) REPORT_LUNS is sent to the IBM enclosure, command times out.
> 2) scsi_unjam_host() runs, trying the error handlers...
> 3) megaraid abort handler, then the reset handler called (3 times), none of
> which do anything at all because the command has already been issued to the
> firmware, and the card itself isn't getting reset. Both those routines
> return failure.
> 4) so scsi_unjam_host() offlines the device
> 5) scsi_decide_disposition() sees device is offline and returns success,
> freeing the command for future use again.
> 6) command eventually completes and the mega_rundoneq() calls
> cmd->scsi_done() on it, but cmd has been filled with 0x5a's so it Oopses.
>
>
> The scsi mid-layer shouldn't free a command that hasn't actually been
> aborted/reset because it *could* come back from the firmware after the
> timeout has expired, and the driver has a reference to it (need
> refcounting...) This could potentially lead to an exhaustion of the command
> pool though, if a command *never* comes back.
>
This is bad to return the command if the eh cannot cancel it from
the LLDD. Then again the eh routines provided by the driver are not very
useful. We could change the error handler to mark cmds that have and
have not been canceled and then at the end do something with the
un-canceled ones. The do something is unclear.
> How long does it take for the IBM enclosure to return REPORT LUNS? Since
> this works on aic7xxx within the timeout period, I'm guessing the megaraid
> firmware takes a long time to deal with it since it's a pass-through device.
>
> I believe there is a way to issue an adapter reset command to the megaraid
> firmware, though neither driver series 1.18 or 2.00 do so presently.
> Copying Atul for insight as to what effects this would have on the
> controller and commands in flight...
>
It would be good to add a eh function that could cancel a command.
-andmike
--
Michael Anderson
andmike@us.ibm.com
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.5.59-dcl2
[not found] ` <1043798822.10150.318.camel@dell_ss3.pdx.osdl.net.suse.lists.linux.kernel>
@ 2003-01-29 12:35 ` Andi Kleen
2003-01-29 12:55 ` 2.5.59-dcl2 Mikael Pettersson
2003-01-29 16:53 ` 2.5.59-dcl2 Stephen Hemminger
0 siblings, 2 replies; 16+ messages in thread
From: Andi Kleen @ 2003-01-29 12:35 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: linux-kernel
Stephen Hemminger <shemminger@osdl.org> writes:
> > . Lockless gettimeofday (Andi Kleen, me)
The original algorithm actually came from Andrea Arcangeli,
I just ported it from vsyscalls to do_gettimeofday.
> > . Performance monitoring counters for x86 (Mikael Pettersson)
Isn't that slightly redundant with oprofile?
They have different capabilities, but there is still much overlap.
-Andi
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.5.59-dcl2
2003-01-29 12:35 ` 2.5.59-dcl2 Andi Kleen
@ 2003-01-29 12:55 ` Mikael Pettersson
2003-01-29 16:53 ` 2.5.59-dcl2 Stephen Hemminger
1 sibling, 0 replies; 16+ messages in thread
From: Mikael Pettersson @ 2003-01-29 12:55 UTC (permalink / raw)
To: Andi Kleen; +Cc: Stephen Hemminger, linux-kernel
Andi Kleen writes:
> Stephen Hemminger <shemminger@osdl.org> writes:
>
> > > . Lockless gettimeofday (Andi Kleen, me)
>
> The original algorithm actually came from Andrea Arcangeli,
> I just ported it from vsyscalls to do_gettimeofday.
>
> > > . Performance monitoring counters for x86 (Mikael Pettersson)
>
> Isn't that slightly redundant with oprofile?
> They have different capabilities, but there is still much overlap.
They're _completely_ different. The overlap, if any, is miniscule. Trust me :-)
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.5.59-dcl2
2003-01-29 12:35 ` 2.5.59-dcl2 Andi Kleen
2003-01-29 12:55 ` 2.5.59-dcl2 Mikael Pettersson
@ 2003-01-29 16:53 ` Stephen Hemminger
1 sibling, 0 replies; 16+ messages in thread
From: Stephen Hemminger @ 2003-01-29 16:53 UTC (permalink / raw)
To: Andi Kleen; +Cc: Linux Kernel Mailing List
On Wed, 2003-01-29 at 04:35, Andi Kleen wrote:
> Stephen Hemminger <shemminger@osdl.org> writes:
>
> > > . Lockless gettimeofday (Andi Kleen, me)
>
> The original algorithm actually came from Andrea Arcangeli,
> I just ported it from vsyscalls to do_gettimeofday.
>
> > > . Performance monitoring counters for x86 (Mikael Pettersson)
>
> Isn't that slightly redundant with oprofile?
> They have different capabilities, but there is still much overlap.
They are different. My point was to have both and let the performance
team at OSDL try both, and comment on what works/doesn't work
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: 2.5.59-dcl2
2003-01-29 5:35 2.5.59-dcl2 Matt_Domsch
2003-01-29 6:51 ` 2.5.59-dcl2 Mike Anderson
@ 2003-01-29 17:02 ` Mark Haverkamp
1 sibling, 0 replies; 16+ messages in thread
From: Mark Haverkamp @ 2003-01-29 17:02 UTC (permalink / raw)
To: Matt_Domsch; +Cc: andmike, linux-scsi, atulm
On Tue, 2003-01-28 at 21:35, Matt_Domsch@Dell.com wrote:
> > From: Mark Haverkamp [mailto:markh@osdl.org]
> > I sent a bug report the the kernel bugzilla a while ago. It
> > got assigned
> > to someone at IBM by default. This is a problem we have been having
> > between our megaraid cards and an IBM enclosure for some time. I get
> > around it by disabling report luns in my kernel configuration.
> >
> > Could you take a look at the bug and let me know if I've
> > included enough
> > information to make sense. I included debug output for the
> > 1.18 and 2.0 driver.
> >
> > http://bugme.osdl.org/show_bug.cgi?id=183
> >
> > or
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=183
>
> Mark, your description in the bug appears accurate. Here's what happens:
>
> 1) REPORT_LUNS is sent to the IBM enclosure, command times out.
> 2) scsi_unjam_host() runs, trying the error handlers...
> 3) megaraid abort handler, then the reset handler called (3 times), none of
> which do anything at all because the command has already been issued to the
> firmware, and the card itself isn't getting reset. Both those routines
> return failure.
> 4) so scsi_unjam_host() offlines the device
> 5) scsi_decide_disposition() sees device is offline and returns success,
> freeing the command for future use again.
> 6) command eventually completes and the mega_rundoneq() calls
> cmd->scsi_done() on it, but cmd has been filled with 0x5a's so it Oopses.
>
>
> The scsi mid-layer shouldn't free a command that hasn't actually been
> aborted/reset because it *could* come back from the firmware after the
> timeout has expired, and the driver has a reference to it (need
> refcounting...) This could potentially lead to an exhaustion of the command
> pool though, if a command *never* comes back.
>
> How long does it take for the IBM enclosure to return REPORT LUNS? Since
> this works on aic7xxx within the timeout period, I'm guessing the megaraid
> firmware takes a long time to deal with it since it's a pass-through device.
I'm not sure how long the report luns takes on the megaraid card. As an
experiment I made the timeout about 30 seconds and the command still
timed out. Is there a chance that the response is getting lost in the
adapter and doesn't get returned to the driver until another command is
issued that does complete?
>
> I believe there is a way to issue an adapter reset command to the megaraid
> firmware, though neither driver series 1.18 or 2.00 do so presently.
> Copying Atul for insight as to what effects this would have on the
> controller and commands in flight...
>
> Thanks,
> Matt
>
> --
> Matt Domsch
> Sr. Software Engineer, Lead Engineer, Architect
> Dell Linux Solutions www.dell.com/linux
> Linux on Dell mailing lists @ http://lists.us.dell.com
--
Mark Haverkamp <markh@osdl.org>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 2.5.59-dcl2
2003-01-29 6:51 ` 2.5.59-dcl2 Mike Anderson
@ 2003-01-29 18:05 ` Luben Tuikov
0 siblings, 0 replies; 16+ messages in thread
From: Luben Tuikov @ 2003-01-29 18:05 UTC (permalink / raw)
To: Mike Anderson; +Cc: linux-scsi
Mike Anderson wrote:
>>
>>The scsi mid-layer shouldn't free a command that hasn't actually been
>>aborted/reset because it *could* come back from the firmware after the
>>timeout has expired, and the driver has a reference to it (need
>>refcounting...) This could potentially lead to an exhaustion of the command
>>pool though, if a command *never* comes back.
>>
>
>
> This is bad to return the command if the eh cannot cancel it from
> the LLDD. Then again the eh routines provided by the driver are not very
> useful. We could change the error handler to mark cmds that have and
> have not been canceled and then at the end do something with the
> un-canceled ones. The do something is unclear.
1. Right, if the eh_abort_handler() (read: eh_cancel_command())
could *not* complete cancellation of the command, SCSI Core
should *not* repossess it.
OTOH, after successfull eh_device_reset() (use your imagination: UYI),
or eh_host_reset() (again UYI), LLDD is in its full right to *repossess*
all pending LLDD status commands. (See bottom of text.)
2. There are two ways to implement object states. One is to ``mark them''
via an object variable which represents the state and the other is to
use containers* /representing/ the different states.
* Queues, lists, etc.
The first method is quite, quite trivial and worth no discussion except
to mention that it just adds more cruft to the object itself.
The other is what would normally be implemented in any self respecting program,
and has many, many advantages, e.g. one can use a priority queues, on top of it
being a state queue, etc.
For this reason what you'd want to do is use a per device variable
struct scsi_device::struct list_head lost_cmds;
where
struct scsi_cmnd::struct list_head list; /* scsi_cmnd participates in queue lists */
is hooked. (The comment is copied verbatim from my patches, now in scsi-combined-2.5.)
When a command couldn't be cancelled (eh_* returned failure), you *could* move
the command into that queue and make SCSI Core be aware of it, e.g. when freeing
the device. I said ``could'' since this is a policy issue and will have to be
decided by more than one knowlegable scsi developer of linux-scsi.
As to LLDD: after ABORT TASK TMF, no further responses should be sent from
the target device to the initiator port regarding that task. The initiator port
is the LLDD or SCSI Core *if* LLDD does its own queuing. ((Strictly speaking
the Initiator Port is always the LLDD, but if it does its own queuing it will have
to notice the ABORT TASK TMF.)) What this means is that SCSI Core *can* make some
assumptions.
--
Luben
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: 2.5.59-dcl2
@ 2003-01-29 19:41 Mukker, Atul
2003-01-29 20:09 ` 2.5.59-dcl2 Mark Haverkamp
0 siblings, 1 reply; 16+ messages in thread
From: Mukker, Atul @ 2003-01-29 19:41 UTC (permalink / raw)
To: 'Matt_Domsch@Dell.com', markh; +Cc: andmike, linux-scsi, Mukker, Atul
What might be happening here is, FW is retrying the command for some
iterations and each of them times out. The total timeout is greater than the
expected window.
A SCSI trace would tell us exactly what is going on there.
Thanks
-Atul Mukker
-----Original Message-----
From: Matt_Domsch@Dell.com [mailto:Matt_Domsch@Dell.com]
Sent: Wednesday, January 29, 2003 12:35 AM
To: markh@osdl.org
Cc: andmike@us.ibm.com; linux-scsi@vger.kernel.org; atulm@lsil.com
Subject: RE: 2.5.59-dcl2
> From: Mark Haverkamp [mailto:markh@osdl.org]
> I sent a bug report the the kernel bugzilla a while ago. It
> got assigned
> to someone at IBM by default. This is a problem we have been having
> between our megaraid cards and an IBM enclosure for some time. I get
> around it by disabling report luns in my kernel configuration.
>
> Could you take a look at the bug and let me know if I've
> included enough
> information to make sense. I included debug output for the
> 1.18 and 2.0 driver.
>
> http://bugme.osdl.org/show_bug.cgi?id=183
>
> or
>
> http://bugzilla.kernel.org/show_bug.cgi?id=183
Mark, your description in the bug appears accurate. Here's what happens:
1) REPORT_LUNS is sent to the IBM enclosure, command times out.
2) scsi_unjam_host() runs, trying the error handlers...
3) megaraid abort handler, then the reset handler called (3 times), none of
which do anything at all because the command has already been issued to the
firmware, and the card itself isn't getting reset. Both those routines
return failure.
4) so scsi_unjam_host() offlines the device
5) scsi_decide_disposition() sees device is offline and returns success,
freeing the command for future use again.
6) command eventually completes and the mega_rundoneq() calls
cmd->scsi_done() on it, but cmd has been filled with 0x5a's so it Oopses.
The scsi mid-layer shouldn't free a command that hasn't actually been
aborted/reset because it *could* come back from the firmware after the
timeout has expired, and the driver has a reference to it (need
refcounting...) This could potentially lead to an exhaustion of the command
pool though, if a command *never* comes back.
How long does it take for the IBM enclosure to return REPORT LUNS? Since
this works on aic7xxx within the timeout period, I'm guessing the megaraid
firmware takes a long time to deal with it since it's a pass-through device.
I believe there is a way to issue an adapter reset command to the megaraid
firmware, though neither driver series 1.18 or 2.00 do so presently.
Copying Atul for insight as to what effects this would have on the
controller and commands in flight...
Thanks,
Matt
--
Matt Domsch
Sr. Software Engineer, Lead Engineer, Architect
Dell Linux Solutions www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: 2.5.59-dcl2
2003-01-29 19:41 2.5.59-dcl2 Mukker, Atul
@ 2003-01-29 20:09 ` Mark Haverkamp
0 siblings, 0 replies; 16+ messages in thread
From: Mark Haverkamp @ 2003-01-29 20:09 UTC (permalink / raw)
To: Mukker, Atul; +Cc: 'Matt_Domsch@Dell.com', andmike, linux-scsi
On Wed, 2003-01-29 at 11:41, Mukker, Atul wrote:
> What might be happening here is, FW is retrying the command for some
> iterations and each of them times out. The total timeout is greater than the
> expected window.
>
> A SCSI trace would tell us exactly what is going on there.
Unfortunately we don't have a SCSI analyzer here.
Mark.
>
> Thanks
> -Atul Mukker
>
> -----Original Message-----
> From: Matt_Domsch@Dell.com [mailto:Matt_Domsch@Dell.com]
> Sent: Wednesday, January 29, 2003 12:35 AM
> To: markh@osdl.org
> Cc: andmike@us.ibm.com; linux-scsi@vger.kernel.org; atulm@lsil.com
> Subject: RE: 2.5.59-dcl2
>
>
> > From: Mark Haverkamp [mailto:markh@osdl.org]
> > I sent a bug report the the kernel bugzilla a while ago. It
> > got assigned
> > to someone at IBM by default. This is a problem we have been having
> > between our megaraid cards and an IBM enclosure for some time. I get
> > around it by disabling report luns in my kernel configuration.
> >
> > Could you take a look at the bug and let me know if I've
> > included enough
> > information to make sense. I included debug output for the
> > 1.18 and 2.0 driver.
> >
> > http://bugme.osdl.org/show_bug.cgi?id=183
> >
> > or
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=183
>
> Mark, your description in the bug appears accurate. Here's what happens:
>
> 1) REPORT_LUNS is sent to the IBM enclosure, command times out.
> 2) scsi_unjam_host() runs, trying the error handlers...
> 3) megaraid abort handler, then the reset handler called (3 times), none of
> which do anything at all because the command has already been issued to the
> firmware, and the card itself isn't getting reset. Both those routines
> return failure.
> 4) so scsi_unjam_host() offlines the device
> 5) scsi_decide_disposition() sees device is offline and returns success,
> freeing the command for future use again.
> 6) command eventually completes and the mega_rundoneq() calls
> cmd->scsi_done() on it, but cmd has been filled with 0x5a's so it Oopses.
>
>
> The scsi mid-layer shouldn't free a command that hasn't actually been
> aborted/reset because it *could* come back from the firmware after the
> timeout has expired, and the driver has a reference to it (need
> refcounting...) This could potentially lead to an exhaustion of the command
> pool though, if a command *never* comes back.
>
> How long does it take for the IBM enclosure to return REPORT LUNS? Since
> this works on aic7xxx within the timeout period, I'm guessing the megaraid
> firmware takes a long time to deal with it since it's a pass-through device.
>
> I believe there is a way to issue an adapter reset command to the megaraid
> firmware, though neither driver series 1.18 or 2.00 do so presently.
> Copying Atul for insight as to what effects this would have on the
> controller and commands in flight...
>
> Thanks,
> Matt
>
> --
> Matt Domsch
> Sr. Software Engineer, Lead Engineer, Architect
> Dell Linux Solutions www.dell.com/linux
> Linux on Dell mailing lists @ http://lists.us.dell.com
--
Mark Haverkamp <markh@osdl.org>
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2003-01-29 20:09 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-01-29 5:35 2.5.59-dcl2 Matt_Domsch
2003-01-29 6:51 ` 2.5.59-dcl2 Mike Anderson
2003-01-29 18:05 ` 2.5.59-dcl2 Luben Tuikov
2003-01-29 17:02 ` 2.5.59-dcl2 Mark Haverkamp
-- strict thread matches above, loose matches on Subject: below --
2003-01-29 19:41 2.5.59-dcl2 Mukker, Atul
2003-01-29 20:09 ` 2.5.59-dcl2 Mark Haverkamp
[not found] <1043794298.10153.241.camel@dell_ss3.pdx.osdl.net.suse.lists.linux.kernel>
[not found] ` <1043798822.10150.318.camel@dell_ss3.pdx.osdl.net.suse.lists.linux.kernel>
2003-01-29 12:35 ` 2.5.59-dcl2 Andi Kleen
2003-01-29 12:55 ` 2.5.59-dcl2 Mikael Pettersson
2003-01-29 16:53 ` 2.5.59-dcl2 Stephen Hemminger
2003-01-28 23:17 2.5.59-dcl2 Matt_Domsch
2003-01-28 22:51 2.5.59-dcl2 Stephen Hemminger
2003-01-28 23:10 ` 2.5.59-dcl2 Christoph Hellwig
2003-01-28 23:24 ` 2.5.59-dcl2 Stephen Hemminger
2003-01-29 0:13 ` 2.5.59-dcl2 Alan Cox
2003-01-29 0:07 ` 2.5.59-dcl2 Stephen Hemminger
2003-01-29 0:17 ` 2.5.59-dcl2 Andrew Morton
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.