All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: 2.5.59-dcl2
@ 2003-01-29  5:35 Matt_Domsch
  2003-01-29  6:51 ` 2.5.59-dcl2 Mike Anderson
  2003-01-29 17:02 ` 2.5.59-dcl2 Mark Haverkamp
  0 siblings, 2 replies; 16+ messages in thread
From: Matt_Domsch @ 2003-01-29  5:35 UTC (permalink / raw)
  To: markh; +Cc: andmike, linux-scsi, atulm

> From: Mark Haverkamp [mailto:markh@osdl.org]
> I sent a bug report the the kernel bugzilla a while ago. It 
> got assigned
> to someone at IBM by default.  This is a problem we have been having
> between our megaraid cards and an IBM enclosure for some time.  I get
> around it by disabling report luns in my kernel configuration.  
> 
> Could you take a look at the bug and let me know if I've 
> included enough
> information to make sense.  I included debug output for the 
> 1.18 and 2.0 driver.
> 
> http://bugme.osdl.org/show_bug.cgi?id=183
> 
> or
> 
> http://bugzilla.kernel.org/show_bug.cgi?id=183

Mark, your description in the bug appears accurate.  Here's what happens:

1) REPORT_LUNS is sent to the IBM enclosure, command times out.
2) scsi_unjam_host() runs, trying the error handlers...
3) megaraid abort handler, then the reset handler called (3 times), none of
which do anything at all because the command has already been issued to the
firmware, and the card itself isn't getting reset.  Both those routines
return failure.
4) so scsi_unjam_host() offlines the device
5) scsi_decide_disposition() sees device is offline and returns success,
freeing the command for future use again.
6) command eventually completes and the mega_rundoneq() calls
cmd->scsi_done() on it, but cmd has been filled with 0x5a's so it Oopses.


The scsi mid-layer shouldn't free a command that hasn't actually been
aborted/reset because it *could* come back from the firmware after the
timeout has expired, and the driver has a reference to it (need
refcounting...)  This could potentially lead to an exhaustion of the command
pool though, if a command *never* comes back.

How long does it take for the IBM enclosure to return REPORT LUNS?  Since
this works on aic7xxx within the timeout period, I'm guessing the megaraid
firmware takes a long time to deal with it since it's a pass-through device.

I believe there is a way to issue an adapter reset command to the megaraid
firmware, though neither driver series 1.18 or 2.00 do so presently.
Copying Atul for insight as to what effects this would have on the
controller and commands in flight...

Thanks,
Matt

--
Matt Domsch
Sr. Software Engineer, Lead Engineer, Architect
Dell Linux Solutions www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com


^ permalink raw reply	[flat|nested] 16+ messages in thread
* RE: 2.5.59-dcl2
@ 2003-01-29 19:41 Mukker, Atul
  2003-01-29 20:09 ` 2.5.59-dcl2 Mark Haverkamp
  0 siblings, 1 reply; 16+ messages in thread
From: Mukker, Atul @ 2003-01-29 19:41 UTC (permalink / raw)
  To: 'Matt_Domsch@Dell.com', markh; +Cc: andmike, linux-scsi, Mukker, Atul

What might be happening here is, FW is retrying the command for some
iterations and each of them times out. The total timeout is greater than the
expected window.

A SCSI trace would tell us exactly what is going on there.

Thanks
-Atul Mukker

-----Original Message-----
From: Matt_Domsch@Dell.com [mailto:Matt_Domsch@Dell.com]
Sent: Wednesday, January 29, 2003 12:35 AM
To: markh@osdl.org
Cc: andmike@us.ibm.com; linux-scsi@vger.kernel.org; atulm@lsil.com
Subject: RE: 2.5.59-dcl2


> From: Mark Haverkamp [mailto:markh@osdl.org]
> I sent a bug report the the kernel bugzilla a while ago. It 
> got assigned
> to someone at IBM by default.  This is a problem we have been having
> between our megaraid cards and an IBM enclosure for some time.  I get
> around it by disabling report luns in my kernel configuration.  
> 
> Could you take a look at the bug and let me know if I've 
> included enough
> information to make sense.  I included debug output for the 
> 1.18 and 2.0 driver.
> 
> http://bugme.osdl.org/show_bug.cgi?id=183
> 
> or
> 
> http://bugzilla.kernel.org/show_bug.cgi?id=183

Mark, your description in the bug appears accurate.  Here's what happens:

1) REPORT_LUNS is sent to the IBM enclosure, command times out.
2) scsi_unjam_host() runs, trying the error handlers...
3) megaraid abort handler, then the reset handler called (3 times), none of
which do anything at all because the command has already been issued to the
firmware, and the card itself isn't getting reset.  Both those routines
return failure.
4) so scsi_unjam_host() offlines the device
5) scsi_decide_disposition() sees device is offline and returns success,
freeing the command for future use again.
6) command eventually completes and the mega_rundoneq() calls
cmd->scsi_done() on it, but cmd has been filled with 0x5a's so it Oopses.


The scsi mid-layer shouldn't free a command that hasn't actually been
aborted/reset because it *could* come back from the firmware after the
timeout has expired, and the driver has a reference to it (need
refcounting...)  This could potentially lead to an exhaustion of the command
pool though, if a command *never* comes back.

How long does it take for the IBM enclosure to return REPORT LUNS?  Since
this works on aic7xxx within the timeout period, I'm guessing the megaraid
firmware takes a long time to deal with it since it's a pass-through device.

I believe there is a way to issue an adapter reset command to the megaraid
firmware, though neither driver series 1.18 or 2.00 do so presently.
Copying Atul for insight as to what effects this would have on the
controller and commands in flight...

Thanks,
Matt

--
Matt Domsch
Sr. Software Engineer, Lead Engineer, Architect
Dell Linux Solutions www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

^ permalink raw reply	[flat|nested] 16+ messages in thread
[parent not found: <1043794298.10153.241.camel@dell_ss3.pdx.osdl.net.suse.lists.linux.kernel>]
* RE: 2.5.59-dcl2
@ 2003-01-28 23:17 Matt_Domsch
  0 siblings, 0 replies; 16+ messages in thread
From: Matt_Domsch @ 2003-01-28 23:17 UTC (permalink / raw)
  To: hch, shemminger; +Cc: linux-kernel

> >    Megaraid 2 driver                    (Matt Domsch)
> Is there a reason these aren't submitted to Linus?

Just timing on this one.  It's been in Stephen's tree for a while, and in
2.5.50-ac.  I've not heard of negative feedback, nor really any positive
feedback either.  I'll be happy to submit this.

Thanks,
Matt


--
Matt Domsch
Sr. Software Engineer, Lead Engineer, Architect
Dell Linux Solutions www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com


^ permalink raw reply	[flat|nested] 16+ messages in thread
* 2.5.59-dcl2
@ 2003-01-28 22:51 Stephen Hemminger
  2003-01-28 23:10 ` 2.5.59-dcl2 Christoph Hellwig
  2003-01-29  0:07 ` 2.5.59-dcl2 Stephen Hemminger
  0 siblings, 2 replies; 16+ messages in thread
From: Stephen Hemminger @ 2003-01-28 22:51 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Update to the OSDL DCL patch set. 

The OSDL common includes RAS related enhancements, bugfixes, and 
latest version of drivers for the OSDL server test machines.
   Linux Trace Toolkit (LTT)            (Karim Yaghmour)
   Linux Kernel Crash Dump (LKCD)       (Matt Robinson, LKCD team)
   Kernel Probes (kprobes)              (Rusty Russell)
   Megaraid 2 driver                    (Matt Domsch)
   DAC960 driver                        (Dave Olien)
        
The goal is to make these projects more robust and resolve potential
overlaps. Also this set keeps up to date with the latest version of
drivers for the disk devices that are present on the OSDL test
machines.

The DCL-only patch contains performance and tuning related
enhancements.  The goal is to use these for database performance
tuning and give these projects more testing.

2.5.59-osdl2:
. Dac960 error retry                    (Dave Olien)

2.5.59-dcl2:
. Lost timer tick compensation          (John Stultz)
. Improved boot time TSC synchronization (Jim Houston)
. Lockless gettimeofday                 (Andi Kleen, me)
. Performance monitoring counters for x86 (Mikael Pettersson)

2.5.59-osdl1:
. Bug fix for vmlinux.ld.S		(Kai Germaschewski)
. Update to LKCD for multiple schemes   (Bharata B Rao)
. Bug fixes for LKCD locking            (me)
. Improved i386 fatal event notifiers   (me)
. Kprobe using notify_die               (me)

2.5.59-dcl1:
.  RCU statistics                   (Dipankar Sarma)
.  Scheduler tunables               (Robert Love)

The latest release is available in downloadable patches from
        http://sourceforge.net/projects/osdldcl

or public BitKeeper repositories
        Common code:            bk://bk.osdl.org/linux-2.5-osdl
        Common code + CGL:      bk://bk.osdl.org/linux-2.5-cgl
        Common code + DCL:      bk://bk.osdl.org/linux-2.5-dcl

Getting Involved
----------------
If interested in development of DCL, please subscribe to the mailing
list at http://lists.osdl.org/mailman/listinfo/dcl_discussion.

Developers are encouraged to send any enhancements or bug fix
patches.  Patches should be tested by using the OSDL Scalable Test
Platform (STP) and Patch Lifecycle Manager (PLM) facilities.

Project information:
        http://www.osdl.org/projects/dcl/
        http://osdldcl.sourceforge.net
        http://sourceforge.net/projects/osdldcl






^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2003-01-29 20:09 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-01-29  5:35 2.5.59-dcl2 Matt_Domsch
2003-01-29  6:51 ` 2.5.59-dcl2 Mike Anderson
2003-01-29 18:05   ` 2.5.59-dcl2 Luben Tuikov
2003-01-29 17:02 ` 2.5.59-dcl2 Mark Haverkamp
  -- strict thread matches above, loose matches on Subject: below --
2003-01-29 19:41 2.5.59-dcl2 Mukker, Atul
2003-01-29 20:09 ` 2.5.59-dcl2 Mark Haverkamp
     [not found] <1043794298.10153.241.camel@dell_ss3.pdx.osdl.net.suse.lists.linux.kernel>
     [not found] ` <1043798822.10150.318.camel@dell_ss3.pdx.osdl.net.suse.lists.linux.kernel>
2003-01-29 12:35   ` 2.5.59-dcl2 Andi Kleen
2003-01-29 12:55     ` 2.5.59-dcl2 Mikael Pettersson
2003-01-29 16:53     ` 2.5.59-dcl2 Stephen Hemminger
2003-01-28 23:17 2.5.59-dcl2 Matt_Domsch
2003-01-28 22:51 2.5.59-dcl2 Stephen Hemminger
2003-01-28 23:10 ` 2.5.59-dcl2 Christoph Hellwig
2003-01-28 23:24   ` 2.5.59-dcl2 Stephen Hemminger
2003-01-29  0:13   ` 2.5.59-dcl2 Alan Cox
2003-01-29  0:07 ` 2.5.59-dcl2 Stephen Hemminger
2003-01-29  0:17   ` 2.5.59-dcl2 Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.