All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Vasquez <andrew.vasquez@qlogic.com>
To: Tore Anderson <tore@linpro.no>
Cc: Linux SCSI Mailing List <linux-scsi@vger.kernel.org>
Subject: Re: Recurring qla2xxx crashes (maybe APIC related)
Date: Fri, 25 Apr 2008 08:50:18 -0700	[thread overview]
Message-ID: <20080425155018.GG8849@plap4> (raw)
In-Reply-To: <4811ACAC.7060808@linpro.no>

On Fri, 25 Apr 2008, Tore Anderson wrote:

> Hi.  I've been having recurring problems with the qla2xxx driver or
> firmware lockups.  Seems to happen out of the blue, with nothing special 
> going on on the SAN.
> 
> The servers are IBM BladeCenter HS21 8853A2Gs, with dual-port QLA2422
> cards connected to a dual-fabric topology.  They are running Ubuntu
> 6.06, kernel 2.6.22.19 with some OCFS2 patches applied.  qla2xxx driver
> version is 8.01.07-k7, loaded with params qlport_down_retry=35 and
> ql2xextended_error_logging=1.  Firmware is the latest from QLogic's FTP.
> 
> When they crash, the following is logged:
> 
> Apr 21 09:50:33 xander kernel: APIC error on CPU3: 00(40)

There's a slew of problem reports noted on the web with this 'APIC
error' signature...  From the qla2xxx driver perspective the following
logs show the classic 'no interrupts being routed' failures:

I/O needs to be aborted, request issued:

> Apr 21 09:51:18 xander kernel: qla2xxx_eh_abort(1): aborting sp ffff81010ae4c7c0 from RISC. pid=1024761.

Request times out, only recourse is for the driver to perform a full
RISC reset:

> Apr 21 09:51:48 xander kernel: qla2x00_mailbox_command(1): timeout calling abort_isp
> Apr 21 09:51:48 xander kernel: qla2x00_mailbox_command(1): timeout calling abort_isp
> Apr 21 09:51:48 xander kernel: qla2xxx 0000:08:01.0: Mailbox command timeout occured. Issuing ISP abort.
> Apr 21 09:51:48 xander kernel: qla2xxx 0000:08:01.0: Performing ISP error recovery - ha= ffff81021f5ec530.
> Apr 21 09:51:48 xander kernel: scsi(1): **** Load RISC code ****
> Apr 21 09:51:48 xander kernel: scsi(1): Verifying Checksum of loaded RISC code.
> Apr 21 09:51:48 xander kernel: scsi(1): Checksum OK, start firmware.
> Apr 21 09:51:48 xander kernel: scsi(1): Issue init firmware.
> Apr 21 09:51:49 xander kernel: scsi(1): fcport-0 - port retry count: 34 remaining
> Apr 21 09:51:49 xander kernel: scsi(1): fcport-1 - port retry count: 34 remaining
> Apr 21 09:51:49 xander kernel: scsi(1): Asynchronous P2P MODE received.
> Apr 21 09:51:49 xander kernel: scsi(1): Asynchronous LOOP UP (4 Gbps).
> Apr 21 09:51:49 xander kernel: qla2xxx 0000:08:01.0: LOOP UP detected (4 Gbps).
> Apr 21 09:51:49 xander kernel: scsi(1): F/W Ready - OK 
> Apr 21 09:51:49 xander kernel: scsi(1): Asynchronous PORT UPDATE.

Note, the driver is in 'polling' mode during error recovery...

> Apr 21 09:51:49 xander kernel: scsi(1): Port database changed ffff 0006 0000.
> Apr 21 09:51:49 xander kernel: scsi(1): Asynchronous PORT UPDATE ignored 0000/0007/0b00.
> Apr 21 09:51:49 xander kernel: scsi(1): Asynchronous PORT UPDATE ignored 0001/0007/0b00.
> Apr 21 09:51:49 xander kernel: scsi(1): Asynchronous PORT UPDATE ignored 0002/0004/0600.
> Apr 21 09:51:49 xander kernel: scsi(1): Asynchronous PORT UPDATE ignored 0002/0007/0b00.
> Apr 21 09:51:49 xander kernel: scsi(1): fw_state=3 curr time=102a04c2d.
> Apr 21 09:51:49 xander kernel: qla2x00_restart_isp(): Start configure loop, status = 0
> Apr 21 09:51:49 xander kernel: scsi(1): Configure loop -- dpc flags =0x4080048
> Apr 21 09:51:49 xander kernel: scsi(1): RSCN queue entry[0] = [00/000000].
> Apr 21 09:51:49 xander kernel: scsi(1): device_resync: rscn overflow.
> Apr 21 09:51:50 xander kernel: scsi(1): RFT_ID failed, completion status (280).
> Apr 21 09:51:50 xander kernel: scsi(1): Register FC-4 TYPE failed.
> Apr 21 09:51:50 xander kernel: scsi(1): RFF_ID failed, completion status (280).
> Apr 21 09:51:50 xander kernel: scsi(1): fcport-0 - port retry count: 33 remaining
> Apr 21 09:51:50 xander kernel: scsi(1): fcport-1 - port retry count: 33 remaining
> Apr 21 09:51:50 xander kernel: scsi(1): Register FC-4 Features failed.
> Apr 21 09:51:50 xander kernel: scsi(1): RNN_ID failed, completion status (280).
> Apr 21 09:51:50 xander kernel: scsi(1): Register Node Name failed.
> Apr 21 09:51:50 xander kernel: scsi(1): GID_PT failed, completion status (6380).
> Apr 21 09:51:50 xander kernel: scsi(1): GA_NXT failed, rejected request:
> Apr 21 09:51:50 xander kernel: 0   1   2   3   4   5   6   7   8   9  Ah  Bh  Ch  Dh  Eh  Fh
> Apr 21 09:51:50 xander kernel: --------------------------------------------------------------
> Apr 21 09:51:50 xander kernel: 14  00  00  00  00  70  26  1f  02  00  00  00  10  08  00  00
> Apr 21 09:51:50 xander kernel: qla2xxx 0000:08:01.0: SNS scan failed -- assuming zero-entry result...
> Apr 21 09:51:50 xander kernel: qla24xx_fabric_logout(1): failed to complete IOCB -- completion status (2)  ioparam=0/810031.
> Apr 21 09:51:50 xander kernel: scsi(1): LOOP READY
> Apr 21 09:51:50 xander kernel: qla2x00_restart_isp(): Configure loop done, status = 0x0
> Apr 21 09:51:50 xander kernel: APIC error on CPU4: 00(40)
> Apr 21 09:51:50 xander kernel: qla2x00_abort_isp(1): exiting.
> Apr 21 09:51:50 xander kernel: qla2x00_mailbox_command(1): finished abort_isp
> Apr 21 09:51:50 xander kernel: qla2x00_mailbox_command(1): finished abort_isp
> Apr 21 09:51:50 xander kernel: qla2x00_mailbox_command(1): **** FAILED. mbx0=54, mbx1=0, mbx2=1f58, cmd=54 ****
> Apr 21 09:51:50 xander kernel: qla2x00_issue_iocb(1): failed rval 0x100
> Apr 21 09:51:50 xander kernel: qla2x00_issue_iocb(1): failed rval 0x100
> Apr 21 09:51:50 xander kernel: qla24xx_abort_command(1): failed to issue IOCB (100).
> Apr 21 09:51:50 xander kernel: qla2xxx_eh_abort(1): abort_command mbx failed.

Error recovery (RISC reset) completes, transition to normal INTx
processing and continue.  Next abort request comes down:

> Apr 21 09:51:50 xander kernel: qla2xxx 0000:08:01.0: scsi(1:1:5): Abort command issued -- 0 fa2f9 2002.
> Apr 21 09:51:51 xander kernel: scsi(1): fcport-0 - port retry count: 32 remaining
...
> Apr 21 09:52:24 xander kernel: scsi(1): fcport-0 - port retry count: 0 remaining
> Apr 21 09:52:24 xander kernel: scsi(1): fcport-1 - port retry count: 0 remaining

So do the abort requests fail with the similar signature (timeout)?

> It varies on which CPU the APIC error happens, but after that it's
> always the same:  qla2xxx complaining and attempting to restart the
> firmware without any success, and I/O service never recovers.  Soon
> thereafter other cluster members fences out the problematic machine by
> rebooting it.
> 
> Any ideas on what could cause this, or how to track down the problem?

There's a blanket suggestion that has helped others (perhaps by
ignoring the problem), disable the APIC:

	apm=force noapic acpi=off pci=noacpi

but that seems like a bandaid.  I'd suggest you work this through your
IBM support contract, if possible.

BTW: I'd like to take a look at several failure iterations, could you
send the messages file during the failures...

  reply	other threads:[~2008-04-25 15:50 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-04-25 10:04 Recurring qla2xxx crashes (maybe APIC related) Tore Anderson
2008-04-25 15:50 ` Andrew Vasquez [this message]
2008-04-25 17:06   ` Tore Anderson
2008-04-25 17:18     ` Andrew Vasquez
2008-04-28  6:37       ` Tore Anderson
2008-04-29 21:16         ` Andrew Vasquez
2008-04-29 21:45           ` Tore Anderson
2008-04-29 22:29             ` Andrew Vasquez
2008-04-30  8:32               ` Tore Anderson
2008-04-30 17:18                 ` Andrew Vasquez
2008-05-05  7:48                   ` Tore Anderson
2008-05-05 20:00                     ` Tore Anderson
2008-05-06 14:02                       ` Andrew Vasquez

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080425155018.GG8849@plap4 \
    --to=andrew.vasquez@qlogic.com \
    --cc=linux-scsi@vger.kernel.org \
    --cc=tore@linpro.no \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.