2.6.15-rc4 error messages with multiple qla2300 hba ports on fabric

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* 2.6.15-rc4 error messages with multiple qla2300 hba ports on fabric
@ 2005-12-02 15:26 Michael Reed
  0 siblings, 0 replies; 7+ messages in thread
From: Michael Reed @ 2005-12-02 15:26 UTC (permalink / raw)
  To: linux-scsi, James.Smart, Christoph Hellwig, Andrew Vasquez

[-- Attachment #1: Type: text/plain, Size: 1010 bytes --]

Hello,

I've been testing with the qla2300 driver with 2.6.14.3 and 2.6.15-rc4.
I've observed two sets of error messages which are not present with
2.6.14.3.

First, the qla2300 driver is generating soft lockups.
Second, several error messages indicating that remote
ports are being deleted are being emitted.

 rport-2:0-16: blocked FC remote port time out: removing target and saving binding
 run_workqueue: recursion depth exceeded: 29

If the timing is just right, scsi errors are generated, though not evident
in the attached dmesg file.

I've observed similar behavior with my modified mpt fusion driver
when multiple hba ports are on the fabric.  The kernels tested
are as downloaded from kernel.org, without my mpt mods.

(Andrew, I'm not "blaming" your driver for the rport issues.  I chose
your driver to be the "victim" 'cause I didn't want to post this using
under development code with mpt fusion.)

Platform: SGI Altix IA64.

What additional information should I acquire?

Mike Reed
mdr@sgi.com

[-- Attachment #2: dmesg-2.6.15-rc4.bz2 --]
[-- Type: application/x-bzip2, Size: 10646 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <B179AE41C1147041AA1121F44614F0B0012AD98D@AVEXCH02.qlogic.org>]

* Re: 2.6.15-rc4 error messages with multiple qla2300 hba ports on fabric
       [not found] <B179AE41C1147041AA1121F44614F0B0012AD98D@AVEXCH02.qlogic.org>
@ 2005-12-02 16:41 ` Michael Reed
  2005-12-02 17:05   ` Michele Baldessari
  2005-12-07 23:54   ` Andrew Vasquez
  0 siblings, 2 replies; 7+ messages in thread
From: Michael Reed @ 2005-12-02 16:41 UTC (permalink / raw)
  To: Andrew Vasquez; +Cc: linux-scsi



Andrew Vasquez wrote:
> 
>> From: Michael Reed [mailto:mdr@sgi.com]
>>
> 
> Sidenote:  I'm on the east-coast until hopefully tonight -- won't
> have a chance to look at debugging this for a couple of days...
> 
>> I've been testing with the qla2300 driver with 2.6.14.3 and 2.6.15-rc4.
>> I've observed two sets of error messages which are not present with
>> 2.6.14.3.
>>
>> First, the qla2300 driver is generating soft lockups.
> 
> Have a backtrace?

In the previously attached dmesg file, extracted below.  Is the driver polling
for mailbox interrupt?

QLogic Fibre Channel HBA Driver
ACPI: PCI Interrupt 0002:01:04.0[A]: no GSI
qla2300 0002:01:04.0: Found an ISP2312, irq 74, iobase 0xc00000080fd00000
qla2300 0002:01:04.0: Configuring PCI space...
PCI: slot 0002:01:04.0 has incorrect PCI cache line size of 0 bytes, correcting to 128
qla2300 0002:01:04.0: Configure NVRAM parameters...
qla2300 0002:01:04.0: Verifying loaded RISC code...
qla2300 0002:01:04.0: Waiting for LIP to complete...
qla2300 0002:01:04.0: LIP reset occured (f800).
qla2300 0002:01:04.0: LOOP UP detected (2 Gbps).
qla2300 0002:01:04.0: Topology - (F_Port), Host Loop address 0xffff
BUG: soft lockup detected on CPU#0!
Modules linked in:

Pid: 1, CPU 0, comm:              swapper
psr : 00001010081a6018 ifs : 8000000000000a98 ip  : [<a000000100580380>]    Not tainted
ip is at qla2x00_mailbox_command+0x860/0xc40
unat: 0000000000000000 pfs : 0000000000000a98 rsc : 0000000000000003
rnat: aaaaa665556a9a55 bsps: a000000100589470 pr  : aaaaa655556a5555
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000001005802f0 b6  : e0000030023e5490 b7  : a000000100589260
f6  : 0fff6fffffffff0000000 f7  : 1003e0000000000002710
f8  : 1003e00000000000003e8 f9  : 100078000000000000000
f10 : 10000bffffffff4000000 f11 : 1003e0000000000000003
r1  : a000000100c5a350 r2  : 0000000000000001 r3  : e00000b07bb20db0
r8  : 0000000000000001 r9  : 0000000000004000 r10 : 0000000000000000
r11 : 0000000000000002 r12 : e00000b07bb27af0 r13 : e00000b07bb20000
r14 : 0000000000000012 r15 : 0000000000002710 r16 : 0000000c610edd5d
r17 : 0000000000000001 r18 : 0000000000000002 r19 : 0000000000000000
r20 : 0000000000000000 r21 : ffffffffffff0028 r22 : e000003003104234
r23 : e000003003104231 r24 : e0000034f79183f8 r25 : 000000000000000d
r26 : 0000000c610edd6a r27 : 00000000000003e8 r28 : a0000001007b8ec0
r29 : 00000000000001fb r30 : 0000000c610edf7e r31 : a000000100a5aa58

Call Trace:
 [<a000000100013760>] show_stack+0x80/0xa0
                                sp=e00000b07bb27730 bsp=e00000b07bb21788
 [<a000000100013fd0>] show_regs+0x850/0x880
                                sp=e00000b07bb27900 bsp=e00000b07bb21728
 [<a000000100104600>] softlockup_tick+0x160/0x180
                                sp=e00000b07bb27910 bsp=e00000b07bb216f8
 [<a0000001000cb0c0>] do_timer+0x6a0/0x9a0
                                sp=e00000b07bb27920 bsp=e00000b07bb21688
 [<a000000100037100>] timer_interrupt+0x260/0x3e0
                                sp=e00000b07bb27920 bsp=e00000b07bb21638
 [<a0000001001047b0>] handle_IRQ_event+0x90/0x120
                                sp=e00000b07bb27920 bsp=e00000b07bb215f0
 [<a000000100104930>] __do_IRQ+0xf0/0x360
                                sp=e00000b07bb27920 bsp=e00000b07bb21598
 [<a000000100010400>] ia64_handle_irq+0xc0/0x160
                                sp=e00000b07bb27920 bsp=e00000b07bb21558
 [<a00000010000bce0>] ia64_leave_kernel+0x0/0x290
                                sp=e00000b07bb27920 bsp=e00000b07bb21558
 [<a000000100580380>] qla2x00_mailbox_command+0x860/0xc40
                                sp=e00000b07bb27af0 bsp=e00000b07bb21498
 [<a000000100581070>] qla2x00_login_fabric+0x150/0x260
                                sp=e00000b07bb27b40 bsp=e00000b07bb21440
 [<a000000100575580>] qla2x00_fabric_login+0xc0/0x3a0
                                sp=e00000b07bb27ba0 bsp=e00000b07bb21388
 [<a0000001005761f0>] qla2x00_fabric_dev_login+0x30/0x180
                                sp=e00000b07bb27be0 bsp=e00000b07bb21358
 [<a000000100578ac0>] qla2x00_configure_loop+0x2280/0x2a80
                                sp=e00000b07bb27be0 bsp=e00000b07bb21258
 [<a00000010057a320>] qla2x00_initialize_adapter+0x440/0x6c0
                                sp=e00000b07bb27ca0 bsp=e00000b07bb211c8
 [<a00000010056f4e0>] qla2x00_probe_one+0x1200/0x1fc0
                                sp=e00000b07bb27ca0 bsp=e00000b07bb21078
 [<a0000001005a15f0>] qla2300_probe_one+0x30/0x60
                                sp=e00000b07bb27d50 bsp=e00000b07bb21050
 [<a000000100420c70>] pci_device_probe+0x2d0/0x4a0
                                sp=e00000b07bb27d50 bsp=e00000b07bb21008
 [<a0000001004d05d0>] driver_probe_device+0xb0/0x1a0
                                sp=e00000b07bb27da0 bsp=e00000b07bb20fc8
 [<a0000001004d0730>] __driver_attach+0x70/0xc0
                                sp=e00000b07bb27da0 bsp=e00000b07bb20f98
 [<a0000001004cf880>] bus_for_each_dev+0xc0/0x140
                                sp=e00000b07bb27da0 bsp=e00000b07bb20f58
 [<a0000001004d02b0>] driver_attach+0x30/0x60
                                sp=e00000b07bb27dc0 bsp=e00000b07bb20f38
 [<a0000001004cec70>] bus_add_driver+0xf0/0x300
                                sp=e00000b07bb27dc0 bsp=e00000b07bb20ef8
 [<a0000001004d0940>] driver_register+0xa0/0xc0
                                sp=e00000b07bb27dc0 bsp=e00000b07bb20ed8
 [<a000000100420640>] __pci_register_driver+0x140/0x1e0
                                sp=e00000b07bb27dd0 bsp=e00000b07bb20ea8
 [<a000000100870fd0>] qla2300_init+0x30/0x60
                                sp=e00000b07bb27de0 bsp=e00000b07bb20e90
 [<a000000100009730>] init+0x470/0x920
                                sp=e00000b07bb27de0 bsp=e00000b07bb20e28
 [<a0000001000119e0>] kernel_thread_helper+0xe0/0x100
                                sp=e00000b07bb27e30 bsp=e00000b07bb20e00
 [<a000000100009120>] start_kernel_thread+0x20/0x40
                                sp=e00000b07bb27e30 bsp=e00000b07bb20e00

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.15-rc4 error messages with multiple qla2300 hba ports on fabric
  2005-12-02 16:41 ` Michael Reed
@ 2005-12-02 17:05   ` Michele Baldessari
  2005-12-07 23:54   ` Andrew Vasquez
  1 sibling, 0 replies; 7+ messages in thread
From: Michele Baldessari @ 2005-12-02 17:05 UTC (permalink / raw)
  To: Michael Reed; +Cc: Andrew Vasquez, linux-scsi

* Michael Reed (mdr@sgi.com) wrote:
> > 
> > Sidenote:  I'm on the east-coast until hopefully tonight -- won't
> > have a chance to look at debugging this for a couple of days...
> > 
> >> I've been testing with the qla2300 driver with 2.6.14.3 and 2.6.15-rc4.
> >> I've observed two sets of error messages which are not present with
> >> 2.6.14.3.
> >>
> >> First, the qla2300 driver is generating soft lockups.
> > 
> > Have a backtrace?
> 
> In the previously attached dmesg file, extracted below.  Is the driver polling
> for mailbox interrupt?

Just adding a me too, the platform here is a dual P4-Xeon IBM
box (happens on both 2.6.14.x and 2.6.15rc1).

My report is here : http://lkml.org/lkml/mbox/2005/11/16/262

(Trace is quite the same as Michael's)

thanks,
Michele

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.15-rc4 error messages with multiple qla2300 hba ports on fabric
  2005-12-02 16:41 ` Michael Reed
  2005-12-02 17:05   ` Michele Baldessari
@ 2005-12-07 23:54   ` Andrew Vasquez
  1 sibling, 0 replies; 7+ messages in thread
From: Andrew Vasquez @ 2005-12-07 23:54 UTC (permalink / raw)
  To: Michael Reed; +Cc: linux-scsi

On Fri, 02 Dec 2005, Michael Reed wrote:

> Andrew Vasquez wrote:
> > 
> >> From: Michael Reed [mailto:mdr@sgi.com]
> >>
> > 
> > Sidenote:  I'm on the east-coast until hopefully tonight -- won't
> > have a chance to look at debugging this for a couple of days...
> > 
> >> I've been testing with the qla2300 driver with 2.6.14.3 and 2.6.15-rc4.
> >> I've observed two sets of error messages which are not present with
> >> 2.6.14.3.
> >>
> >> First, the qla2300 driver is generating soft lockups.
> > 
> > Have a backtrace?
> 
> In the previously attached dmesg file, extracted below.  Is the driver polling
> for mailbox interrupt?

Yes during init-time, the driver polls for mailbox completions, in
this case it's for a PLOGI completion...

...
> qla2300 0002:01:04.0: Topology - (F_Port), Host Loop address 0xffff
> BUG: soft lockup detected on CPU#0!
> Modules linked in:
...
>  [<a00000010000bce0>] ia64_leave_kernel+0x0/0x290
>                                 sp=e00000b07bb27920 bsp=e00000b07bb21558
>  [<a000000100580380>] qla2x00_mailbox_command+0x860/0xc40
>                                 sp=e00000b07bb27af0 bsp=e00000b07bb21498
>  [<a000000100581070>] qla2x00_login_fabric+0x150/0x260
>                                 sp=e00000b07bb27b40 bsp=e00000b07bb21440
>  [<a000000100575580>] qla2x00_fabric_login+0xc0/0x3a0
>                                 sp=e00000b07bb27ba0 bsp=e00000b07bb21388

I'm wondering if something like the following is more appropriate.

---

Slight variant from a patch submitted by Jeff Layton
<jlayton@redhat.com>.

diff --git a/drivers/scsi/qla2xxx/qla_mbx.c b/drivers/scsi/qla2xxx/qla_mbx.c
index 9746cd1..3de8fee 100644
--- a/drivers/scsi/qla2xxx/qla_mbx.c
+++ b/drivers/scsi/qla2xxx/qla_mbx.c
@@ -196,7 +196,9 @@ qla2x00_mailbox_command(scsi_qla_host_t 
 			/* Check for pending interrupts. */
 			qla2x00_poll(ha);
 
-			udelay(10); /* v4.27 */
+			if (command != MBC_LOAD_RISC_RAM_EXTENDED &&
+			    !ha->flags.mbox_int)
+				msleep(10);
 		} /* while */
 	}
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* RE: 2.6.15-rc4 error messages with multiple qla2300 hba ports on fabric
@ 2005-12-02 16:47 James.Smart
  2005-12-05 19:06 ` Michael Reed
  0 siblings, 1 reply; 7+ messages in thread
From: James.Smart @ 2005-12-02 16:47 UTC (permalink / raw)
  To: andrew.vasquez, mdr, linux-scsi, hch

We recently saw this as well. It's related to the number of targets
that go away simultaneously.

There ends up being many delete rport items on the work queue, and 
when the 1st one stalls to flush the work queues, it starts the 2nd,
which stops to flush, and so on.

My inclination is to look at what we have on the work queue and see if we can
circumvent some of the flush calls.

-- james s

Here's a backtrace:
rport-4:0-37: blocked FC remote port time out: removing target and saving binding
rport-4:0-42: blocked FC remote port time out: removing target and saving binding
rport-4:0-55: blocked FC remote port time out: removing target and saving binding
run_workqueue: recursion depth exceeded: 4
Call Trace:<ffffffff80146270>{flush_cpu_workqueue+96} <ffffffff8036a9b0>{_spin_lock_irqsave+32}
<ffffffff80146473>{flush_workqueue+115} <ffffffff880a05b5>{:scsi_transport_fc:fc_rport_tgt_remove+85}
<ffffffff801462e0>{flush_cpu_workqueue+208} <ffffffff880a0fc0>{:scsi_transport_fc:fc_timeout_deleted_rport+0}
<ffffffff8036a9b0>{_spin_lock_irqsave+32} <ffffffff80146473>{flush_workqueue+115}
<ffffffff880a05b5>{:scsi_transport_fc:fc_rport_tgt_remove+85}
<ffffffff801462e0>{flush_cpu_workqueue+208} <ffffffff880a0fc0>{:scsi_transport_fc:fc_timeout_deleted_rport+0}
<ffffffff880a0fc0>{:scsi_transport_fc:fc_timeout_deleted_rport+0}
<ffffffff8036a9b0>{_spin_lock_irqsave+32} <ffffffff880a0fc0>{:scsi_transport_fc:fc_timeout_deleted_rport+0}
<ffffffff80146473>{flush_workqueue+115} <ffffffff880a05b5>{:scsi_transport_fc:fc_rport_tgt_remove+85}
<ffffffff8014617c>{worker_thread+508} <ffffffff8012f0e0>{default_wake_function+0}
<ffffffff8012f0e0>{default_wake_function+0} <ffffffff80145f80>{worker_thread+0}
<ffffffff8014a9c9>{kthread+217} <ffffffff8010edbe>{child_rip+8}
<ffffffff8014a8f0>{kthread+0} <ffffffff8010edb6>{child_rip+0}
rport-4:0-53: blocked FC remote port time out: removing target and
saving binding
run_workqueue: recursion depth exceeded: 5
Call Trace:<ffffffff80146270>{flush_cpu_workqueue+96} <ffffffff8036a9b0>{_spin_lock_irqsave+32}
<ffffffff80146473>{flush_workqueue+115} <ffffffff880a05b5>{:scsi_transport_fc:fc_rport_tgt_remove+85}
<ffffffff801462e0>{flush_cpu_workqueue+208} <ffffffff880a0fc0>{:scsi_transport_fc:fc_timeout_deleted_rport+0}
<ffffffff8036a9b0>{_spin_lock_irqsave+32} <ffffffff80146473>{flush_workqueue+115}
<ffffffff880a05b5>{:scsi_transport_fc:fc_rport_tgt_remove+85}
<ffffffff801462e0>{flush_cpu_workqueue+208} <ffffffff880a0fc0>{:scsi_transport_fc:fc_timeout_deleted_rport+0}
<ffffffff8036a9b0>{_spin_lock_irqsave+32} <ffffffff80146473>{flush_workqueue+115}
<ffffffff880a05b5>{:scsi_transport_fc:fc_rport_tgt_remove+85}
<ffffffff801462e0>{flush_cpu_workqueue+208} <ffffffff880a0fc0>{:scsi_transport_fc:fc_timeout_deleted_rport+0}
<ffffffff880a0fc0>{:scsi_transport_fc:fc_timeout_deleted_rport+0}
<ffffffff8036a9b0>{_spin_lock_irqsave+32} <ffffffff880a0fc0>{:scsi_transport_fc:fc_timeout_deleted_rport+0}
<ffffffff80146473>{flush_workqueue+115} <ffffffff880a05b5>{:scsi_transport_fc:fc_rport_tgt_remove+85}
<ffffffff8014617c>{worker_thread+508} <ffffffff8012f0e0>{default_wake_function+0}
<ffffffff8012f0e0>{default_wake_function+0} <ffffffff80145f80>{worker_thread+0}
<ffffffff8014a9c9>{kthread+217} <ffffffff8010edbe>{child_rip+8}
<ffffffff8014a8f0>{kthread+0} <ffffffff8010edb6>{child_rip+0}
...
<ffffffff8014a9c9>{kthread+217} <ffffffff8010edbe>{child_rip+8}
<ffffffff8014a8f0>{kthread+0} <ffffffff8010edb6>{child_rip+0}
rport-4:0-38: blocked FC remote port time out: removing target and saving binding
run_workqueue: recursion depth exceeded: 30

-----Original Message-----
From: Andrew Vasquez [mailto:andrew.vasquez@qlogic.com]
Sent: Friday, December 02, 2005 11:29 AM
To: Michael Reed; linux-scsi@vger.kernel.org; Smart, James; Christoph Hellwig
Subject: RE: 2.6.15-rc4 error messages with multiple qla2300 hba ports on fabric

> From: Michael Reed [mailto:mdr@sgi.com]
>

Sidenote:  I'm on the east-coast until hopefully tonight -- won't
have a chance to look at debugging this for a couple of days...

> I've been testing with the qla2300 driver with 2.6.14.3 and 2.6.15-rc4.
> I've observed two sets of error messages which are not present with
> 2.6.14.3.
>
> First, the qla2300 driver is generating soft lockups.

Have a backtrace?

> Second, several error messages indicating that remote
> ports are being deleted are being emitted.
>
>  rport-2:0-16: blocked FC remote port time out: removing target and saving binding
>  run_workqueue: recursion depth exceeded: 29
>
> If the timing is just right, scsi errors are generated, though not evident
> in the attached dmesg file.
>
> I've observed similar behavior with my modified mpt fusion driver
> when multiple hba ports are on the fabric.  The kernels tested
> are as downloaded from kernel.org, without my mpt mods.
>
> (Andrew, I'm not "blaming" your driver for the rport issues.  I chose
> your driver to be the "victim" 'cause I didn't want to post this using
> under development code with mpt fusion.)
>
> Platform: SGI Altix IA64.
>
> What additional information should I acquire?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.15-rc4 error messages with multiple qla2300 hba ports on fabric
  2005-12-02 16:47 James.Smart
@ 2005-12-05 19:06 ` Michael Reed
  0 siblings, 0 replies; 7+ messages in thread
From: Michael Reed @ 2005-12-05 19:06 UTC (permalink / raw)
  To: James.Smart; +Cc: andrew.vasquez, linux-scsi, hch



James.Smart@Emulex.Com wrote:
> We recently saw this as well. It's related to the number of targets
> that go away simultaneously.
> 
> There ends up being many delete rport items on the work queue, and 
> when the 1st one stalls to flush the work queues, it starts the 2nd,
> which stops to flush, and so on.
> 
> My inclination is to look at what we have on the work queue and see if we can
> circumvent some of the flush calls.

Snooping the work queue?  That sounds a little, um, like a hack?
If the code is correct, and the end result is correct, is the
test for recursion level and the associated dump_stack() necessary?
(Yeah, I know, newbie questions. :)


Mike

...snip...


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: 2.6.15-rc4 error messages with multiple qla2300 hba ports on fabric
@ 2005-12-07 21:41 James.Smart
  0 siblings, 0 replies; 7+ messages in thread
From: James.Smart @ 2005-12-07 21:41 UTC (permalink / raw)
  To: mdr; +Cc: andrew.vasquez, linux-scsi, hch

> > My inclination is to look at what we have on the work queue 
> and see if we can
> > circumvent some of the flush calls.
> 
> Snooping the work queue?  That sounds a little, um, like a hack?
> If the code is correct, and the end result is correct, is the
> test for recursion level and the associated dump_stack() necessary?
> (Yeah, I know, newbie questions. :)

Well, my thinking is along the same lines... Also too much other work on
the sdevs, etc to really get a good feel for things. Plus that should be
handled by default in the other layers. However, I wouldn't address
it via eliminating the recursion level check.

We're testing a patch that deals with the recursion by not doing it. Oldest
trick in the book :)   Will keep you posted once we know the results.

-- james s

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-12-07 23:54 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-02 15:26 2.6.15-rc4 error messages with multiple qla2300 hba ports on fabric Michael Reed
     [not found] <B179AE41C1147041AA1121F44614F0B0012AD98D@AVEXCH02.qlogic.org>
2005-12-02 16:41 ` Michael Reed
2005-12-02 17:05   ` Michele Baldessari
2005-12-07 23:54   ` Andrew Vasquez
  -- strict thread matches above, loose matches on Subject: below --
2005-12-02 16:47 James.Smart
2005-12-05 19:06 ` Michael Reed
2005-12-07 21:41 James.Smart

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).