Poisoning of Linux initiators on SCST reboot.

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* Poisoning of Linux initiators on SCST reboot.
@ 2008-07-24 17:50 greg
  2008-07-25 13:11 ` Stanislaw Gruszka
  2008-07-25 13:45 ` Andrew Vasquez
  0 siblings, 2 replies; 15+ messages in thread
From: greg @ 2008-07-24 17:50 UTC (permalink / raw)
  To: scst-devel; +Cc: linux-driver, linux-scsi, stf_xl, vst

Good morning to everyone, hope your respective days are going well.
Sorry for the wide cast on this but I wanted to get what would seem to
be the concerned parties on this issue in the loop.

We have been putting SCST through an extensive round of pre-production
testing.  I wanted to start following up on some of the issues we have
noted.

We will be putting SCST into service to support mirrored storage from
client initiators to two separate data-centers.  The filesystems on
the client initiators access storage at the two data-centers via a
Linux MD RAID1 device.  The SAN architecture is based on Cisco
MDS-9509 switches.

Just as an aside for people considering use of SCST.  The core engine
has been rock solid.  Our testing rounds consists of driving around
1/8 of a petabyte of widely disparate I/O types from multiple
initiators to a pair of targets in the two data-centers.  SCST hasn't
missed a beat so far, so kudos to Vlad and everyone involved in its
development.

As we began forced failure testing one issue has come up that I wanted
to advise people of.  A hard reboot of an SCST target server results
in the 'poisoning' of Linux based initiators.  We verified the issue
as being present on client initiators running the stock RHEL5 kernel
up through 2.6.26.

The targets are using Qlogic 2462 cards using the isp_mod driver.  The
client initiators are using Qlogic 2342 cards with the qla2xxx driver.

Failure mode is as follows:

        1.) Configure SCST based storage for an initiator (vdisk
            based).

        2.) Activate initiator.  Initiator logs into fabric and
            discovers SCST based storage.

        3.) Force SCST target failure by rebooting or pulling power.

        4.) SCST target returns to service and logs into zone.

        5.) Initiator picks up RSCN but re-activates the rport for
            SCST server as an INITIATOR rather than TARGET role.

After this point in time the initiator is effectively 'poisoned'.

Nothing short of unloading and reloading the Qlogic 2xxx driver on the
client initiator will allow the initiator to recognize the SCST server
as a target device.  A driver unload/reload of course is not an option
to restore connectivity since it would take the remaining live side of
the mirror off-line as well.

We finally figured out what seems to be happening by watching the logs
on the client and comparing what was going on there to the FLOGI login
status on the fabric.

When the SCST target server reboots the initiator times out the remote
port and places it into 'unknown' state.  The qla2xxx driver,
according to the source code, maintains the previous rport state in
driver internal data.

The 2462 card in the target on boot logs into the fabric with an
initiator role, I'm assuming in support of BIOS based SAN booting. The
client initiator picks up on this and re-activates the rport as being
in an INITIATOR role.

Loading the isp_mod driver causes the 2462 card in the target to be
shutdown.  The client initiator picks up on this and times out the
rport retaining the last rport state as INITIATOR.

Enabling target mode on the 2462 causes it to log back into the
fabric.  The client initiator picks up on the RSCN but refuses to
transition the rport from INITIATOR to TARGET state.  Without going
into TARGET state the remote port won't have SCSI device discovery
initiated against it and hence the SCST based storage is inaccessible.

Activating a LIP on the client initiates a new fabric login attempt
which completes with the following message:

Jul 24 02:53:59 init-test kernel: rport-2:0-0: blocked FC remote port
time out: no longer a FCP target, removing starget

Which from a review of the source code seems consistent with our
analysis of the problem.

The culprit is the following code from drivers/scsi/scsi_transport_fc.c:

        if ((rport->port_state == FC_PORTSTATE_ONLINE) &&
            (rport->scsi_target_id != -1) &&
            !(rport->roles & FC_PORT_ROLE_FCP_TARGET)) {
                dev_printk(KERN_ERR, &rport->dev,
                        "blocked FC remote port time out: no longer"
                        " a FCP target, removing starget\n");
                spin_unlock_irqrestore(shost->host_lock, flags);
                scsi_target_unblock(&rport->dev);
                fc_queue_work(shost, &rport->stgt_delete_work);
                return;
        }

The above gets executed in response to the LIP on the initiator.  The
value in rport->roles is being populated with what the remote target
was INITIATOR rather than its current TARGET state.

Windows client initiators running against the SCST targets get the
transition and login sequence correct.  When the SCST target is
re-activated after the cold boot those clients immediately re-discover
their storage while the Linux clients issue error messages about loss
of the remote target.

While all this doesn't seem to be technically a bug with SCST it
certainly is a problematic usage scenario.  It may also explain why
some individuals may have had problems getting SCST clients to access
their storage.

If a test SCST server was plugged into an active zone and turned on it
would immediately poison any Linux clients.  No amount of proper
configuration on the target would allow the client to access storage
until the client was rebooted or its drivers reloaded.

Any suggestions on how to move forward would be appreciated.  We've
got a pretty extensive test environment and would be happy to test run
any suggested changes or patches.

Once again a thank you to everyone who has contributed to SCST
development.  Other than this and a few additional glitches I will
follow up with via additional e-mails it is presenting itself as a
very solid platform for storage delivery.

Best wishes for a pleasant weekend to everyone.

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"Much work remains to be done before we can announce our total failure
 to make any progress."
                                -- Mike Kelly

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Poisoning of Linux initiators on SCST reboot.
  2008-07-24 17:50 greg
@ 2008-07-25 13:11 ` Stanislaw Gruszka
  2008-07-25 13:45 ` Andrew Vasquez
  1 sibling, 0 replies; 15+ messages in thread
From: Stanislaw Gruszka @ 2008-07-25 13:11 UTC (permalink / raw)
  To: greg; +Cc: scst-devel, linux-driver, linux-scsi, vst

Thursday 24 July 2008 19:50:51 greg@enjellic.com napisał(a):
> As we began forced failure testing one issue has come up that I wanted
> to advise people of.  A hard reboot of an SCST target server results
> in the 'poisoning' of Linux based initiators.  We verified the issue
> as being present on client initiators running the stock RHEL5 kernel
> up through 2.6.26.
>
> The targets are using Qlogic 2462 cards using the isp_mod driver.  The
> client initiators are using Qlogic 2342 cards with the qla2xxx driver.
[snip]
> Any suggestions on how to move forward would be appreciated.  We've
> got a pretty extensive test environment and would be happy to test run
> any suggested changes or patches.
You may try isp_mod driver working in initiator mode (best compiled in 
initiator role only) to see how it cope with such situation.

Cheers
Stanislaw Gruszka
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Poisoning of Linux initiators on SCST reboot.
  2008-07-24 17:50 greg
  2008-07-25 13:11 ` Stanislaw Gruszka
@ 2008-07-25 13:45 ` Andrew Vasquez
  1 sibling, 0 replies; 15+ messages in thread
From: Andrew Vasquez @ 2008-07-25 13:45 UTC (permalink / raw)
  To: greg; +Cc: scst-devel, linux-driver, linux-scsi, stf_xl, vst

On Thu, 24 Jul 2008, greg@enjellic.com wrote:

> Good morning to everyone, hope your respective days are going well.
> Sorry for the wide cast on this but I wanted to get what would seem to
> be the concerned parties on this issue in the loop.
<snip>
> The targets are using Qlogic 2462 cards using the isp_mod driver.  The
> client initiators are using Qlogic 2342 cards with the qla2xxx driver.
> 
> Failure mode is as follows:
> 
>         1.) Configure SCST based storage for an initiator (vdisk
>             based).
> 
>         2.) Activate initiator.  Initiator logs into fabric and
>             discovers SCST based storage.
> 
>         3.) Force SCST target failure by rebooting or pulling power.
> 
>         4.) SCST target returns to service and logs into zone.
> 
>         5.) Initiator picks up RSCN but re-activates the rport for
>             SCST server as an INITIATOR rather than TARGET role.
> 
> After this point in time the initiator is effectively 'poisoned'.
> 
> Nothing short of unloading and reloading the Qlogic 2xxx driver on the
> client initiator will allow the initiator to recognize the SCST server
> as a target device.  A driver unload/reload of course is not an option
> to restore connectivity since it would take the remaining live side of
> the mirror off-line as well.
> 
> We finally figured out what seems to be happening by watching the logs
> on the client and comparing what was going on there to the FLOGI login
> status on the fabric.
> 
> When the SCST target server reboots the initiator times out the remote
> port and places it into 'unknown' state.  The qla2xxx driver,
> according to the source code, maintains the previous rport state in
> driver internal data.

Ok, thanks for the detailed description of the problem...

> The 2462 card in the target on boot logs into the fabric with an
> initiator role, I'm assuming in support of BIOS based SAN booting. The
> client initiator picks up on this and re-activates the rport as being
> in an INITIATOR role.

Yes, BIOS would FLOGI into the switch...  RSCN received on initiator
side, and the role registered for the rport would have been migrated
from target->initiator... (Step 1)

> Loading the isp_mod driver causes the 2462 card in the target to be
> shutdown.  The client initiator picks up on this and times out the
> rport retaining the last rport state as INITIATOR.

Ok, I would have expected this to at least start when the BIOS FLOGI'd
into the switch above...

> Enabling target mode on the 2462 causes it to log back into the
> fabric.  The client initiator picks up on the RSCN but refuses to
> transition the rport from INITIATOR to TARGET state.

Ok, so on the initiator side, I'd expect and RSCN, PLOGI and PRLI to
target side, the bits processed from the PRLI response, role migrated
from UNKNOWN during fc_remote_port_add(), then to TARGET during
fc_remote_port_rolechg(). (Step 2)

> Without going
> into TARGET state the remote port won't have SCSI device discovery
> initiated against it and hence the SCST based storage is inaccessible.

Ok, could you provide the kernel log of the full failure with the
qla2xxx driver loaded with the ql2xextended_error_logging module
parameter set to 1.

> Activating a LIP on the client initiates a new fabric login attempt
> which completes with the following message:
> 
> Jul 24 02:53:59 init-test kernel: rport-2:0-0: blocked FC remote port
> time out: no longer a FCP target, removing starget
> 
> Which from a review of the source code seems consistent with our
> analysis of the problem.
> 
> The culprit is the following code from drivers/scsi/scsi_transport_fc.c:
> 
>         if ((rport->port_state == FC_PORTSTATE_ONLINE) &&
>             (rport->scsi_target_id != -1) &&
>             !(rport->roles & FC_PORT_ROLE_FCP_TARGET)) {
>                 dev_printk(KERN_ERR, &rport->dev,
>                         "blocked FC remote port time out: no longer"
>                         " a FCP target, removing starget\n");
>                 spin_unlock_irqrestore(shost->host_lock, flags);
>                 scsi_target_unblock(&rport->dev);
>                 fc_queue_work(shost, &rport->stgt_delete_work);
>                 return;
>         }

I would have expected this during step 1 (guess it depends on timing
latency during reboot/BIOS-flogi_plogi/isp_mod-load)...

> The above gets executed in response to the LIP on the initiator.  The
> value in rport->roles is being populated with what the remote target
> was INITIATOR rather than its current TARGET state.

Ok, so at step-1, the fc_remote_port_add() should have fallen into
this code:

      ...
      /* was a target, not in roles */
	if ((rport->scsi_target_id != -1) &&
	    (!(ids->roles & FC_PORT_ROLE_FCP_TARGET)))
		return rport;

since role is unknown... then transitioned to initiator during
rolechg().

then at step-2, the same process during fc_remote_port_add() (since
the transition to target is again deferred to rolechg()), then during
rolechg() I'd expect the transport would fall into the 'else if' here:

        spin_lock_irqsave(shost->host_lock, flags);
        if (roles & FC_PORT_ROLE_FCP_TARGET) {
                if (rport->scsi_target_id == -1) {
                        rport->scsi_target_id = fc_host->next_target_id++;
                        create = 1;
                } else if (!(rport->roles & FC_PORT_ROLE_FCP_TARGET))
                        create = 1;

Hmm, but that doesn't seem to be the case here...

Let's start with the driver logs, just I get full picture of at least
what's happending with qla2xxx at the wire-side.

Thanks, AV

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Poisoning of Linux initiators on SCST reboot.
@ 2008-07-28 20:40 greg
  2008-07-29  1:34 ` Andrew Vasquez
  0 siblings, 1 reply; 15+ messages in thread
From: greg @ 2008-07-28 20:40 UTC (permalink / raw)
  To: Andrew Vasquez; +Cc: scst-devel, linux-driver, linux-scsi, stf_xl, vst

On Jul 25,  6:45am, Andrew Vasquez wrote:
} Subject: Re: Poisoning of Linux initiators on SCST reboot.

Good afternoon to everyone, hope your week is starting out well.

First off thanks to Andrew for the very helpful reply.  Hopefully this
mail will provide additional information which is helpful in getting
to the bottom of these issues.

> On Thu, 24 Jul 2008, greg@enjellic.com wrote:

> > When the SCST target server reboots the initiator times out the remote
> > port and places it into 'unknown' state.  The qla2xxx driver,
> > according to the source code, maintains the previous rport state in
> > driver internal data.

> Ok, thanks for the detailed description of the problem...

> > The 2462 card in the target on boot logs into the fabric with an
> > initiator role, I'm assuming in support of BIOS based SAN booting. The
> > client initiator picks up on this and re-activates the rport as being
> > in an INITIATOR role.

> Yes, BIOS would FLOGI into the switch...  RSCN received on initiator
> side, and the role registered for the rport would have been migrated
> from target->initiator... (Step 1)

Vlad and Stanislaw had suggested disabling SAN based booting on the
cards as a solution to the problem.  It would seem you agree this is
the culprit with respect to the client initiator seeing the target in
an initiator role.

We currently have the BIOS disabled on the 2462 card in the target
box.  We also have 'Selective SAN boot' disabled and we do not see any
other options which may be relevant to keeping the card from issueing
a FLOGI into the fabric on power up.

We've also had a go round with the scli utility and didn't see
anything in there which would shutdown this behavior.  So Andrew let
us know if we are missing something with respect to how the card's
behavior can be modified in this respect.

It doesn't address the rport behavior but may be useful to others who
pick up on this thread in the future.

.. [ things deleted ] ...

> then at step-2, the same process during fc_remote_port_add() (since
> the transition to target is again deferred to rolechg()), then during
> rolechg() I'd expect the transport would fall into the 'else if' here:
> 
>         spin_lock_irqsave(shost->host_lock, flags);
>         if (roles & FC_PORT_ROLE_FCP_TARGET) {
>                 if (rport->scsi_target_id == -1) {
>                         rport->scsi_target_id = fc_host->next_target_id++;
>                         create = 1;
>                 } else if (!(rport->roles & FC_PORT_ROLE_FCP_TARGET))
>                         create = 1;
> 
> Hmm, but that doesn't seem to be the case here...
> 
> Let's start with the driver logs, just I get full picture of at least
> what's happending with qla2xxx at the wire-side.

Thanks for your continued interest in following up on this Andrew.

We executed a full test cycle starting from a freshly rebooted client
and server.

Initiator client is stock 2.6.26 freshly downloaded from kernel.org.
The client module was loaded with the option to enable extended
debugging information.

SCST target code and ismp_mod driver is from the current head of the
1.0.0.x branch in SVN.

Annotations were injected in the logs with the logger utility.  A
search for the **** string will find the annotations, ie. target
reboot etc.

SCSI host 4 is being used to access the target which is undergoing the
reboot test.  PWWN of the target is 210000e08b8665b5.

---------------------------------------------------------------------------
Jul 28 14:39:00 init-test syslogd 1.4.1: restart.
Jul 28 14:40:01 init-test crond[4312]: (root) CMD (/usr/lib/sa/sa1 1 1)
Jul 28 14:40:49 init-test kernel: QLogic Fibre Channel HBA Driver: 8.02.01-k4-debug
Jul 28 14:40:49 init-test kernel: ACPI: PCI Interrupt 0000:04:00.0[A] -> GSI 16 (level, low) -> IRQ 16
Jul 28 14:40:49 init-test kernel: qla2xxx 0000:04:00.0: Found an ISP2432, irq 16, iobase 0xf8858000
Jul 28 14:40:49 init-test kernel: qla2xxx 0000:04:00.0: Configuring PCI space...
Jul 28 14:40:49 init-test kernel: qla2xxx 0000:04:00.0: Configure NVRAM parameters...
Jul 28 14:40:49 init-test kernel: qla2xxx 0000:04:00.0: Verifying loaded RISC code...
Jul 28 14:40:49 init-test kernel: scsi(4): **** Load RISC code ****
Jul 28 14:40:49 init-test kernel: firmware: requesting ql2400_fw.bin
Jul 28 14:40:49 init-test kernel: scsi(4): Verifying Checksum of loaded RISC code.
Jul 28 14:40:49 init-test kernel: scsi(4): Checksum OK, start firmware.
Jul 28 14:40:49 init-test kernel: qla2xxx 0000:04:00.0: Allocated (64 KB) for EFT...
Jul 28 14:40:49 init-test kernel: qla2xxx 0000:04:00.0: Allocated (1413 KB) for firmware dump...
Jul 28 14:40:49 init-test kernel: scsi(4): Issue init firmware.
Jul 28 14:40:49 init-test kernel: qla2xxx 0000:04:00.0: MSI-X: Unsupported ISP2432 (0x2, 0x482).
Jul 28 14:40:49 init-test kernel: qla2xxx 0000:04:00.0: MSI: Enabled.
Jul 28 14:40:49 init-test kernel: DEBUG: detect hba 4 at address = f7bda33c
Jul 28 14:40:49 init-test kernel: scsi4 : qla2xxx
Jul 28 14:40:49 init-test kernel: scsi(4): Asynchronous LIP RESET (f7f7).
Jul 28 14:40:49 init-test kernel: qla2xxx 0000:04:00.0: LIP reset occured (f7f7).
Jul 28 14:40:50 init-test kernel: scsi(4): Asynchronous P2P MODE received.
Jul 28 14:40:50 init-test kernel: scsi(4): Asynchronous LOOP UP (4 Gbps).
Jul 28 14:40:50 init-test kernel: qla2xxx 0000:04:00.0: LOOP UP detected (4 Gbps).
Jul 28 14:40:50 init-test kernel: scsi(4): Asynchronous PORT UPDATE.
Jul 28 14:40:50 init-test kernel: scsi(4): Port database changed ffff 0006 0000.
Jul 28 14:40:50 init-test kernel: scsi(4): qla2x00_reset_marker()
Jul 28 14:40:50 init-test kernel: scsi(4): qla2x00_loop_resync()
Jul 28 14:40:50 init-test kernel: scsi(4): F/W Ready - OK 
Jul 28 14:40:50 init-test kernel: scsi(4): fw_state=3 curr time=3052.
Jul 28 14:40:50 init-test kernel: scsi(4): Configure loop -- dpc flags =0x40800e0
Jul 28 14:40:50 init-test kernel: scsi(4): RSCN queue entry[0] = [00/000000].
Jul 28 14:40:50 init-test kernel: scsi(4): device_resync: rscn overflow.
Jul 28 14:40:50 init-test kernel: scsi(4): RHBA exiting normally.
Jul 28 14:40:50 init-test kernel: scsi(4): RPA failed, rejected request:
Jul 28 14:40:50 init-test kernel:  0   1   2   3   4   5   6   7   8   9  Ah  Bh  Ch  Dh  Eh  Fh
Jul 28 14:40:50 init-test kernel: --------------------------------------------------------------
Jul 28 14:40:50 init-test kernel: 01  00  00  00  fa  10  00  00  80  01  00  00  00  09  00  00
Jul 28 14:40:50 init-test kernel: scsi(4): RFT_ID exiting normally.
Jul 28 14:40:50 init-test kernel: scsi(4): RFF_ID exiting normally.
Jul 28 14:40:50 init-test kernel: scsi(4): RNN_ID exiting normally.
Jul 28 14:40:50 init-test kernel: scsi(4): RSNN_NN exiting normally.
Jul 28 14:40:50 init-test kernel: scsi(4): GID_PT entry - nn 200000e08b8665b5 pn 210000e08b8665b5 portid=d60400.
Jul 28 14:40:50 init-test kernel: scsi(4): GID_PT entry - nn 200000e08b9db189 pn 210000e08b9db189 portid=d61d00.
Jul 28 14:40:50 init-test kernel: scsi(4): GPSC ext entry - fpn 21c7000dbd5c2d40 speeds=0000 speed=4000.
Jul 28 14:40:50 init-test kernel: scsi(4): GPSC ext entry - fpn 218a000dbd5c2d40 speeds=0000 speed=4000.
Jul 28 14:40:50 init-test kernel: scsi(4): device wrap (d61d00)
Jul 28 14:40:50 init-test kernel: scsi(4): Trying Fabric Login w/loop id 0x0081 for port d60400.
Jul 28 14:40:50 init-test kernel: qla2xxx 0000:04:00.0: iIDMA adjusted to 2 GB/s on 210000e08b8665b5.
Jul 28 14:40:50 init-test kernel: scsi(4): LOOP READY
Jul 28 14:40:50 init-test kernel: scsi(4): qla2x00_loop_resync - end
Jul 28 14:40:50 init-test kernel: scsi 4:0:0:0: Direct-Access     SCST_BIO ia_3:bbfs2_ndsu   096 PQ: 0 ANSI: 4
Jul 28 14:40:50 init-test kernel: sd 4:0:0:0: [sdc] 2147483648 512-byte hardware sectors (1099512 MB)
Jul 28 14:40:50 init-test kernel: qla2xxx 0000:04:00.0: 
Jul 28 14:40:50 init-test kernel:  QLogic Fibre Channel HBA Driver: 8.02.01-k4-debug
Jul 28 14:40:50 init-test kernel:   QLogic QLE2462 - PCI-Express to 4Gb FC, Dual Channel
Jul 28 14:40:50 init-test kernel:   ISP2432: PCIe (2.5GT/s x4) @ 0000:04:00.0 hdma-, host#=4, fw=4.04.00 [IP] [84XX] 
Jul 28 14:40:50 init-test kernel: ACPI: PCI Interrupt 0000:04:00.1[B] -> GSI 17 (level, low) -> IRQ 17
Jul 28 14:40:50 init-test kernel: qla2xxx 0000:04:00.1: Found an ISP2432, irq 17, iobase 0xf8b4a000
Jul 28 14:40:50 init-test kernel: sd 4:0:0:0: [sdc] Write Protect is off
Jul 28 14:40:50 init-test kernel: sd 4:0:0:0: [sdc] Write cache: enabled, read cache: enabled, supports DPO and FUA
Jul 28 14:40:50 init-test kernel: qla2xxx 0000:04:00.1: Configuring PCI space...
Jul 28 14:40:50 init-test kernel: sd 4:0:0:0: [sdc] 2147483648 512-byte hardware sectors (1099512 MB)
Jul 28 14:40:50 init-test kernel: sd 4:0:0:0: [sdc] Write Protect is off
Jul 28 14:40:50 init-test kernel: sd 4:0:0:0: [sdc] Write cache: enabled, read cache: enabled, supports DPO and FUA
Jul 28 14:40:50 init-test kernel:  sdc: sdc1
Jul 28 14:40:50 init-test kernel: sd 4:0:0:0: [sdc] Attached SCSI disk
Jul 28 14:40:50 init-test kernel: sd 4:0:0:0: Attached scsi generic sg3 type 0
Jul 28 14:40:50 init-test kernel: qla2xxx 0000:04:00.1: Configure NVRAM parameters...
Jul 28 14:40:50 init-test kernel: qla2xxx 0000:04:00.1: Verifying loaded RISC code...
Jul 28 14:40:50 init-test kernel: scsi(5): **** Load RISC code ****
Jul 28 14:40:50 init-test kernel: scsi(5): Verifying Checksum of loaded RISC code.
Jul 28 14:40:50 init-test kernel: scsi(5): Checksum OK, start firmware.
Jul 28 14:40:50 init-test kernel: qla2xxx 0000:04:00.1: Allocated (64 KB) for EFT...
Jul 28 14:40:50 init-test kernel: qla2xxx 0000:04:00.1: Allocated (1413 KB) for firmware dump...
Jul 28 14:40:50 init-test kernel: scsi(5): Issue init firmware.
Jul 28 14:40:50 init-test kernel: qla2xxx 0000:04:00.1: MSI-X: Unsupported ISP2432 (0x2, 0x482).
Jul 28 14:40:50 init-test kernel: qla2xxx 0000:04:00.1: MSI: Enabled.
Jul 28 14:40:50 init-test kernel: DEBUG: detect hba 5 at address = f7be033c
Jul 28 14:40:50 init-test kernel: scsi5 : qla2xxx
Jul 28 14:40:50 init-test kernel: scsi(5): Asynchronous LIP RESET (f7f7).
Jul 28 14:40:50 init-test kernel: qla2xxx 0000:04:00.1: LIP reset occured (f7f7).
Jul 28 14:40:51 init-test kernel: scsi(5): qla2x00_reset_marker()
Jul 28 14:40:51 init-test kernel: scsi(5): qla2x00_loop_resync()
Jul 28 14:40:52 init-test kernel: scsi(5): Asynchronous P2P MODE received.
Jul 28 14:40:52 init-test kernel: scsi(5): Asynchronous LOOP UP (4 Gbps).
Jul 28 14:40:52 init-test kernel: qla2xxx 0000:04:00.1: LOOP UP detected (4 Gbps).
Jul 28 14:40:52 init-test kernel: scsi(5): Asynchronous PORT UPDATE.
Jul 28 14:40:52 init-test kernel: scsi(5): Port database changed ffff 0006 0000.
Jul 28 14:40:52 init-test kernel: scsi(5): F/W Ready - OK 
Jul 28 14:40:52 init-test kernel: scsi(5): fw_state=3 curr time=331a.
Jul 28 14:40:52 init-test kernel: scsi(5): Configure loop -- dpc flags =0x40800e1
Jul 28 14:40:52 init-test kernel: scsi(5): RSCN queue entry[0] = [00/000000].
Jul 28 14:40:52 init-test kernel: scsi(5): device_resync: rscn overflow.
Jul 28 14:40:52 init-test kernel: scsi(5): RHBA exiting normally.
Jul 28 14:40:52 init-test kernel: scsi(5): RPA failed, rejected request:
Jul 28 14:40:52 init-test kernel:  0   1   2   3   4   5   6   7   8   9  Ah  Bh  Ch  Dh  Eh  Fh
Jul 28 14:40:52 init-test kernel: --------------------------------------------------------------
Jul 28 14:40:52 init-test kernel: 01  00  00  00  fa  10  00  00  80  01  00  00  00  09  00  00
Jul 28 14:40:52 init-test kernel: scsi(5): RFT_ID exiting normally.
Jul 28 14:40:52 init-test kernel: scsi(5): RFF_ID exiting normally.
Jul 28 14:40:52 init-test kernel: scsi(5): RNN_ID exiting normally.
Jul 28 14:40:52 init-test kernel: scsi(5): RSNN_NN exiting normally.
Jul 28 14:40:52 init-test kernel: scsi(5): GID_PT entry - nn 200100e08ba7c394 pn 210100e08ba7c394 portid=6c0700.
Jul 28 14:40:52 init-test kernel: scsi(5): GID_PT entry - nn 200000e08b059b49 pn 210000e08b059b49 portid=6c0e00.
Jul 28 14:40:52 init-test kernel: scsi(5): GID_PT entry - nn 200100e08bbee923 pn 210100e08bbee923 portid=6c0f00.
Jul 28 14:40:52 init-test kernel: scsi(5): GID_PT entry - nn 200100e08bbdb189 pn 210100e08bbdb189 portid=cd0e00.
Jul 28 14:40:52 init-test kernel: scsi(5): GPSC ext entry - fpn 204f000dec2e62c0 speeds=0000 speed=4000.
Jul 28 14:40:52 init-test kernel: scsi(5): GPSC ext entry - fpn 2006000dec2e62c0 speeds=0000 speed=1000.
Jul 28 14:40:52 init-test kernel: scsi(5): GPSC ext entry - fpn 2050000dec2e62c0 speeds=0000 speed=4000.
Jul 28 14:40:52 init-test kernel: scsi(5): GPSC ext entry - fpn 21ca000dbd5c2d40 speeds=0000 speed=4000.
Jul 28 14:40:52 init-test kernel: scsi(5): device wrap (cd0e00)
Jul 28 14:40:53 init-test kernel: scsi(5): Trying Fabric Login w/loop id 0x0081 for port 6c0700.
Jul 28 14:40:53 init-test kernel: qla2xxx 0000:04:00.1: iIDMA adjusted to 2 GB/s on 210100e08ba7c394.
Jul 28 14:40:53 init-test kernel: scsi(5): Trying Fabric Login w/loop id 0x0082 for port 6c0e00.
Jul 28 14:40:53 init-test kernel: scsi 5:0:0:0: Direct-Access     SCST_BIO r1_3:bbfs2_ndsu   096 PQ: 0 ANSI: 4
Jul 28 14:40:53 init-test kernel: sd 5:0:0:0: [sdd] 2147483648 512-byte hardware sectors (1099512 MB)
Jul 28 14:40:53 init-test kernel: scsi(5): Trying Fabric Login w/loop id 0x0083 for port 6c0f00.
Jul 28 14:40:53 init-test kernel: sd 5:0:0:0: [sdd] Write Protect is off
Jul 28 14:40:53 init-test kernel: sd 5:0:0:0: [sdd] Write cache: enabled, read cache: enabled, supports DPO and FUA
Jul 28 14:40:53 init-test kernel: qla2xxx 0000:04:00.1: iIDMA adjusted to 2 GB/s on 210100e08bbee923.
Jul 28 14:40:53 init-test kernel: scsi(5): LOOP READY
Jul 28 14:40:53 init-test kernel: scsi(5): qla2x00_loop_resync - end
Jul 28 14:40:53 init-test kernel: sd 5:0:0:0: [sdd] 2147483648 512-byte hardware sectors (1099512 MB)
Jul 28 14:40:53 init-test kernel: sd 5:0:0:0: [sdd] Write Protect is off
Jul 28 14:40:53 init-test kernel: sd 5:0:0:0: [sdd] Write cache: enabled, read cache: enabled, supports DPO and FUA
Jul 28 14:40:53 init-test kernel:  sdd:<6>qla2xxx 0000:04:00.1: 
Jul 28 14:40:53 init-test kernel:  QLogic Fibre Channel HBA Driver: 8.02.01-k4-debug
Jul 28 14:40:53 init-test kernel:   QLogic QLE2462 - PCI-Express to 4Gb FC, Dual Channel
Jul 28 14:40:53 init-test kernel:   ISP2432: PCIe (2.5GT/s x4) @ 0000:04:00.1 hdma-, host#=5, fw=4.04.00 [IP] [84XX] 
Jul 28 14:40:53 init-test kernel:  sdd1
Jul 28 14:40:53 init-test kernel: sd 5:0:0:0: [sdd] Attached SCSI disk
Jul 28 14:40:53 init-test kernel: sd 5:0:0:0: Attached scsi generic sg4 type 0
Jul 28 14:40:53 init-test kernel: scsi(5): qla2x00_reset_marker()
Jul 28 14:41:55 init-test wettstei: **** Rebooting SCST arget.
Jul 28 14:42:35 init-test kernel: scsi(4): Asynchronous RSCR UPDATE.
Jul 28 14:42:35 init-test kernel: scsi(4): RSCN database changed -- 00d6 0400 0000.
Jul 28 14:42:35 init-test kernel: scsi(4): qla2x00_loop_resync()
Jul 28 14:42:35 init-test kernel: scsi(4): F/W Ready - OK 
Jul 28 14:42:35 init-test kernel: scsi(4): fw_state=3 curr time=9718.
Jul 28 14:42:35 init-test kernel: scsi(4): Configure loop -- dpc flags =0xa0
Jul 28 14:42:35 init-test kernel: scsi(4): RSCN queue entry[0] = [00/d60400].
Jul 28 14:42:35 init-test kernel: scsi(4): GID_PT entry - nn 200000e08b9db189 pn 210000e08b9db189 portid=d61d00.
Jul 28 14:42:35 init-test kernel: scsi(4): GPSC ext entry - fpn 218a000dbd5c2d40 speeds=0000 speed=4000.
Jul 28 14:42:35 init-test kernel: scsi(4): device wrap (d61d00)
Jul 28 14:42:35 init-test kernel: scsi(4): LOOP READY
Jul 28 14:42:35 init-test kernel: scsi(4): qla2x00_loop_resync - end
Jul 28 14:42:35 init-test kernel: scsi(4): fcport-0 - port retry count: 29 remaining
Jul 28 14:42:36 init-test kernel: scsi(4): fcport-0 - port retry count: 28 remaining
Jul 28 14:42:37 init-test kernel: scsi(4): fcport-0 - port retry count: 27 remaining
Jul 28 14:42:38 init-test kernel: scsi(4): fcport-0 - port retry count: 26 remaining
Jul 28 14:42:39 init-test kernel: scsi(4): fcport-0 - port retry count: 25 remaining
Jul 28 14:42:40 init-test kernel: scsi(4): fcport-0 - port retry count: 24 remaining
Jul 28 14:42:41 init-test kernel: scsi(4): fcport-0 - port retry count: 23 remaining
Jul 28 14:42:42 init-test kernel: scsi(4): fcport-0 - port retry count: 22 remaining
Jul 28 14:42:43 init-test kernel: scsi(4): fcport-0 - port retry count: 21 remaining
Jul 28 14:42:44 init-test kernel: scsi(4): fcport-0 - port retry count: 20 remaining
Jul 28 14:42:45 init-test kernel: scsi(4): fcport-0 - port retry count: 19 remaining
Jul 28 14:42:46 init-test kernel: scsi(4): fcport-0 - port retry count: 18 remaining
Jul 28 14:42:47 init-test kernel: scsi(4): fcport-0 - port retry count: 17 remaining
Jul 28 14:42:48 init-test kernel: scsi(4): fcport-0 - port retry count: 16 remaining
Jul 28 14:42:49 init-test kernel: scsi(4): fcport-0 - port retry count: 15 remaining
Jul 28 14:42:50 init-test kernel: scsi(4): fcport-0 - port retry count: 14 remaining
Jul 28 14:42:51 init-test kernel: scsi(4): fcport-0 - port retry count: 13 remaining
Jul 28 14:42:52 init-test kernel: scsi(4): fcport-0 - port retry count: 12 remaining
Jul 28 14:42:53 init-test kernel: scsi(4): fcport-0 - port retry count: 11 remaining
Jul 28 14:42:54 init-test kernel: scsi(4): fcport-0 - port retry count: 10 remaining
Jul 28 14:42:55 init-test kernel: scsi(4): fcport-0 - port retry count: 9 remaining
Jul 28 14:42:56 init-test kernel: scsi(4): fcport-0 - port retry count: 8 remaining
Jul 28 14:42:57 init-test kernel: scsi(4): fcport-0 - port retry count: 7 remaining
Jul 28 14:42:58 init-test kernel: scsi(4): fcport-0 - port retry count: 6 remaining
Jul 28 14:42:59 init-test kernel: scsi(4): fcport-0 - port retry count: 5 remaining
Jul 28 14:43:00 init-test kernel: scsi(4): fcport-0 - port retry count: 4 remaining
Jul 28 14:43:01 init-test kernel: scsi(4): fcport-0 - port retry count: 3 remaining
Jul 28 14:43:02 init-test kernel: scsi(4): fcport-0 - port retry count: 2 remaining
Jul 28 14:43:03 init-test kernel: scsi(4): fcport-0 - port retry count: 1 remaining
Jul 28 14:43:04 init-test kernel: scsi(4): fcport-0 - port retry count: 0 remaining
Jul 28 14:43:10 init-test kernel:  rport-4:0-0: blocked FC remote port time out: removing target and saving binding
Jul 28 14:43:10 init-test kernel: sd 4:0:0:0: [sdc] Synchronizing SCSI cache
Jul 28 14:43:10 init-test kernel: sd 4:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Jul 28 14:43:16 init-test kernel: scsi(4): Asynchronous RSCR UPDATE.
Jul 28 14:43:16 init-test kernel: scsi(4): RSCN database changed -- 00d6 0400 0000.
Jul 28 14:43:16 init-test kernel: scsi(4): qla2x00_loop_resync()
Jul 28 14:43:16 init-test kernel: scsi(4): F/W Ready - OK 
Jul 28 14:43:16 init-test kernel: scsi(4): fw_state=3 curr time=bf65.
Jul 28 14:43:16 init-test kernel: scsi(4): Configure loop -- dpc flags =0xa0
Jul 28 14:43:16 init-test kernel: scsi(4): RSCN queue entry[1] = [00/d60400].
Jul 28 14:43:16 init-test kernel: scsi(4): GID_PT entry - nn 200000e08b8665b5 pn 210000e08b8665b5 portid=d60400.
Jul 28 14:43:16 init-test kernel: scsi(4): GID_PT entry - nn 200000e08b9db189 pn 210000e08b9db189 portid=d61d00.
Jul 28 14:43:16 init-test kernel: scsi(4): GPSC ext entry - fpn 21c7000dbd5c2d40 speeds=0000 speed=4000.
Jul 28 14:43:16 init-test kernel: scsi(4): GPSC ext entry - fpn 218a000dbd5c2d40 speeds=0000 speed=4000.
Jul 28 14:43:16 init-test kernel: scsi(4): device wrap (d61d00)
Jul 28 14:43:16 init-test kernel: scsi(4): Trying Fabric Login w/loop id 0x0081 for port d60400.
Jul 28 14:43:16 init-test kernel: qla2xxx 0000:04:00.0: iIDMA adjusted to 2 GB/s on 210000e08b8665b5.
Jul 28 14:43:16 init-test kernel: scsi(4): LOOP READY
Jul 28 14:43:16 init-test kernel: scsi(4): qla2x00_loop_resync - end
Jul 28 14:43:38 init-test kernel: scsi(4): Asynchronous PORT UPDATE ignored 0081/0004/0600.
Jul 28 14:43:38 init-test kernel: scsi(4): Asynchronous PORT UPDATE ignored 0081/0004/0600.
Jul 28 14:45:49 init-test wettstei: **** Loading isp_mod driver on target.
Jul 28 14:46:18 init-test kernel: scsi(4): Asynchronous RSCR UPDATE.
Jul 28 14:46:18 init-test kernel: scsi(4): RSCN database changed -- 00d6 0400 0000.
Jul 28 14:46:18 init-test kernel: scsi(4): qla2x00_loop_resync()
Jul 28 14:46:18 init-test kernel: scsi(4): F/W Ready - OK 
Jul 28 14:46:18 init-test kernel: scsi(4): fw_state=3 curr time=170a3.
Jul 28 14:46:18 init-test kernel: scsi(4): Configure loop -- dpc flags =0xa0
Jul 28 14:46:18 init-test kernel: scsi(4): RSCN queue entry[2] = [00/d60400].
Jul 28 14:46:18 init-test kernel: scsi(4): GID_PT entry - nn 200000e08b9db189 pn 210000e08b9db189 portid=d61d00.
Jul 28 14:46:18 init-test kernel: scsi(4): GPSC ext entry - fpn 218a000dbd5c2d40 speeds=0000 speed=4000.
Jul 28 14:46:18 init-test kernel: scsi(4): device wrap (d61d00)
Jul 28 14:46:18 init-test kernel: scsi(4): LOOP READY
Jul 28 14:46:18 init-test kernel: scsi(4): qla2x00_loop_resync - end
Jul 28 14:46:45 init-test wettstei: **** Enabling target mode in isp_mod driver on target.
Jul 28 14:46:53 init-test kernel:  rport-4:0-0: blocked FC remote port time out: removing target and saving binding
Jul 28 14:47:10 init-test kernel: scsi(4): Asynchronous RSCR UPDATE.
Jul 28 14:47:10 init-test kernel: scsi(4): RSCN database changed -- 00d6 0400 0000.
Jul 28 14:47:10 init-test kernel: scsi(4): qla2x00_loop_resync()
Jul 28 14:47:10 init-test kernel: scsi(4): F/W Ready - OK 
Jul 28 14:47:10 init-test kernel: scsi(4): fw_state=3 curr time=1a43a.
Jul 28 14:47:10 init-test kernel: scsi(4): Configure loop -- dpc flags =0xa0
Jul 28 14:47:10 init-test kernel: scsi(4): RSCN queue entry[3] = [00/d60400].
Jul 28 14:47:10 init-test kernel: scsi(4): GID_PT entry - nn 200000e08b8665b5 pn 210000e08b8665b5 portid=d60400.
Jul 28 14:47:10 init-test kernel: scsi(4): GID_PT entry - nn 200000e08b9db189 pn 210000e08b9db189 portid=d61d00.
Jul 28 14:47:10 init-test kernel: scsi(4): GPSC ext entry - fpn 21c7000dbd5c2d40 speeds=0000 speed=4000.
Jul 28 14:47:10 init-test kernel: scsi(4): GPSC ext entry - fpn 218a000dbd5c2d40 speeds=0000 speed=4000.
Jul 28 14:47:10 init-test kernel: scsi(4): device wrap (d61d00)
Jul 28 14:47:10 init-test kernel: scsi(4): Trying Fabric Login w/loop id 0x0081 for port d60400.
Jul 28 14:47:10 init-test kernel: qla2xxx 0000:04:00.0: iIDMA adjusted to 2 GB/s on 210000e08b8665b5.
Jul 28 14:47:10 init-test kernel: scsi(4): LOOP READY
Jul 28 14:47:10 init-test kernel: scsi(4): qla2x00_loop_resync - end
Jul 28 14:48:00 init-test wettstei: **** Activating LIP.
Jul 28 14:48:19 init-test kernel: scsi(4): Asynchronous LIP RESET (f7f7).
Jul 28 14:48:19 init-test kernel: qla2xxx 0000:04:00.0: LIP reset occured (f7f7).
Jul 28 14:48:19 init-test kernel: scsi(4): qla2x00_reset_marker()
Jul 28 14:48:19 init-test kernel: scsi(4): Asynchronous P2P MODE received.
Jul 28 14:48:19 init-test kernel: scsi(4): Asynchronous PORT UPDATE.
Jul 28 14:48:19 init-test kernel: scsi(4): Port database changed ffff 0006 0000.
Jul 28 14:48:19 init-test kernel: scsi(4): qla2x00_reset_marker()
Jul 28 14:48:19 init-test kernel: scsi(4): qla2x00_loop_resync()
Jul 28 14:48:19 init-test kernel: scsi(4): F/W Ready - OK 
Jul 28 14:48:19 init-test kernel: scsi(4): fw_state=3 curr time=1e71d.
Jul 28 14:48:19 init-test kernel: scsi(4): Configure loop -- dpc flags =0x4080060
Jul 28 14:48:19 init-test kernel: scsi(4): RSCN queue entry[4] = [00/000000].
Jul 28 14:48:19 init-test kernel: scsi(4): device_resync: rscn overflow.
Jul 28 14:48:19 init-test kernel: scsi(4): RHBA exiting normally.
Jul 28 14:48:19 init-test kernel: scsi(4): RPA failed, rejected request:
Jul 28 14:48:19 init-test kernel:  0   1   2   3   4   5   6   7   8   9  Ah  Bh  Ch  Dh  Eh  Fh
Jul 28 14:48:19 init-test kernel: --------------------------------------------------------------
Jul 28 14:48:19 init-test kernel: 01  00  00  00  fa  10  00  00  80  01  00  00  00  09  00  00
Jul 28 14:48:19 init-test kernel: scsi(4): RFT_ID exiting normally.
Jul 28 14:48:19 init-test kernel: scsi(4): RFF_ID exiting normally.
Jul 28 14:48:19 init-test kernel: scsi(4): RNN_ID exiting normally.
Jul 28 14:48:19 init-test kernel: scsi(4): RSNN_NN exiting normally.
Jul 28 14:48:19 init-test kernel: scsi(4): GID_PT entry - nn 200000e08b8665b5 pn 210000e08b8665b5 portid=d60400.
Jul 28 14:48:19 init-test kernel: scsi(4): GID_PT entry - nn 200000e08b9db189 pn 210000e08b9db189 portid=d61d00.
Jul 28 14:48:19 init-test kernel: scsi(4): GPSC ext entry - fpn 21c7000dbd5c2d40 speeds=0000 speed=4000.
Jul 28 14:48:19 init-test kernel: scsi(4): GPSC ext entry - fpn 218a000dbd5c2d40 speeds=0000 speed=4000.
Jul 28 14:48:19 init-test kernel: scsi(4): device wrap (d61d00)
Jul 28 14:48:19 init-test kernel: scsi(4): Trying Fabric Login w/loop id 0x0081 for port d60400.
Jul 28 14:48:19 init-test kernel: qla2xxx 0000:04:00.0: iIDMA adjusted to 2 GB/s on 210000e08b8665b5.
Jul 28 14:48:19 init-test kernel: scsi(4): LOOP READY
Jul 28 14:48:19 init-test kernel: scsi(4): qla2x00_loop_resync - end
Jul 28 14:48:54 init-test kernel:  rport-4:0-0: blocked FC remote port time out: no longer a FCP target, removing starget
---------------------------------------------------------------------------

> Thanks, AV

Hopefully the above information is useful.  Let me know if additional
information or testing is needed.

Have a good evening.

}-- End of excerpt from Andrew Vasquez

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"Given a choice between a complex, difficult-to-understand,
 disconcerting explanation and a simplistic, comforting one, many
 prefer simplistic comfort if it's remotely plausible, especially if it
 involves blaming someone else for their problems."
                                -- Bob Lewis
                                   _Infoworld_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Poisoning of Linux initiators on SCST reboot.
  2008-07-28 20:40 greg
@ 2008-07-29  1:34 ` Andrew Vasquez
  2008-07-29 18:19   ` Andrew Vasquez
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Vasquez @ 2008-07-29  1:34 UTC (permalink / raw)
  To: greg; +Cc: scst-devel, linux-driver, linux-scsi, stf_xl, vst

On Mon, 28 Jul 2008, greg@enjellic.com wrote:

....
> SCSI host 4 is being used to access the target which is undergoing the
> reboot test.  PWWN of the target is 210000e08b8665b5.
> 
> ---------------------------------------------------------------------------
> Jul 28 14:39:00 init-test syslogd 1.4.1: restart.
> Jul 28 14:40:01 init-test crond[4312]: (root) CMD (/usr/lib/sa/sa1 1 1)
> Jul 28 14:40:49 init-test kernel: QLogic Fibre Channel HBA Driver: 8.02.01-k4-debug
> Jul 28 14:40:49 init-test kernel: ACPI: PCI Interrupt 0000:04:00.0[A] -> GSI 16 (level, low) -> IRQ 16
> Jul 28 14:40:49 init-test kernel: qla2xxx 0000:04:00.0: Found an ISP2432, irq 16, iobase 0xf8858000
...

These logs are good, I don't have a (formal) fix yet, but I believe
we've reproduced this locally based on your detailed description and
logs...  I hope to have some more details tomorrow...

Thanks,
Andrew Vasquez

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Poisoning of Linux initiators on SCST reboot.
  2008-07-29  1:34 ` Andrew Vasquez
@ 2008-07-29 18:19   ` Andrew Vasquez
  0 siblings, 0 replies; 15+ messages in thread
From: Andrew Vasquez @ 2008-07-29 18:19 UTC (permalink / raw)
  To: greg; +Cc: scst-devel, linux-driver, linux-scsi, stf_xl, vst

On Mon, 28 Jul 2008, Andrew Vasquez wrote:

> On Mon, 28 Jul 2008, greg@enjellic.com wrote:
> 
> ....
> > SCSI host 4 is being used to access the target which is undergoing the
> > reboot test.  PWWN of the target is 210000e08b8665b5.
> > 
> > ---------------------------------------------------------------------------
> > Jul 28 14:39:00 init-test syslogd 1.4.1: restart.
> > Jul 28 14:40:01 init-test crond[4312]: (root) CMD (/usr/lib/sa/sa1 1 1)
> > Jul 28 14:40:49 init-test kernel: QLogic Fibre Channel HBA Driver: 8.02.01-k4-debug
> > Jul 28 14:40:49 init-test kernel: ACPI: PCI Interrupt 0000:04:00.0[A] -> GSI 16 (level, low) -> IRQ 16
> > Jul 28 14:40:49 init-test kernel: qla2xxx 0000:04:00.0: Found an ISP2432, irq 16, iobase 0xf8858000
> ...
> 
> These logs are good, I don't have a (formal) fix yet, but I believe
> we've reproduced this locally based on your detailed description and
> logs...  I hope to have some more details tomorrow...

Ok, here's a patch against Linus' latest git tree which corrects the
issue locally.  Could you try it out on your test configuration?  We'd
like to perform some additional sanity-testing locally before we
submit the fix formally for upstream consumption.

Sidenote: we'll also need to backport some form of this fix to stable,
but qla2xxx in kernels upto 2.6.26 didn't utilize the
terminate-rport-io infrastructure:

	commit 5f3a9a207f1fccde476dd31b4c63ead2967d934f
	Refs: v2.6.26-7259-g5f3a9a2
	Author:     Seokmann Ju <seokmann.ju@qlogic.com>
	AuthorDate: Thu Jul 10 16:55:47 2008 -0700
	Commit:     James Bottomley <James.Bottomley@HansenPartnership.com>
	CommitDate: Sat Jul 26 15:14:36 2008 -0400

	    [SCSI] qla2xxx: Add dev_loss_tmo_callbk/terminate_rport_io callback support.

	    Signed-off-by: Seokmann Ju <seokmann.ju@qlogic.com>
	    Signed-off-by: Andrew Vasquez <andrew.vasquez@qlogic.com>
	    Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>

so the stable changes won't be as 'straight-forward'...

In any case, here's the patch...

thanks, av

---

diff --git a/drivers/scsi/qla2xxx/qla_attr.c b/drivers/scsi/qla2xxx/qla_attr.c
index 2e2bf53..690a4ef 100644
--- a/drivers/scsi/qla2xxx/qla_attr.c
+++ b/drivers/scsi/qla2xxx/qla_attr.c
@@ -1082,6 +1082,17 @@ qla2x00_terminate_rport_io(struct fc_rport *rport)
 {
 	fc_port_t *fcport = *(fc_port_t **)rport->dd_data;
 
+	/*
+	 * At this point all fcport's software-states are cleared.  Perform any
+	 * final firmware cleanup of resources (PCBs and XCBs).
+	 */
+	if (fcport->loop_id != FC_NO_LOOP_ID) {
+		fcport->ha->isp_ops->fabric_logout(fcport->ha, fcport->loop_id,
+		    fcport->d_id.b.domain, fcport->d_id.b.area,
+		    fcport->d_id.b.al_pa);
+		fcport->loop_id = FC_NO_LOOP_ID;
+	}
+
 	qla2x00_abort_fcport_cmds(fcport);
 	scsi_target_unblock(&rport->dev);
 }

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Poisoning of Linux initiators on SCST reboot.
@ 2008-08-01 20:22 greg
  2008-08-05 13:52 ` Andrew Vasquez
  0 siblings, 1 reply; 15+ messages in thread
From: greg @ 2008-08-01 20:22 UTC (permalink / raw)
  To: Andrew Vasquez; +Cc: scst-devel, linux-driver, linux-scsi, stf_xl, vst

On Jul 29, 11:19am, Andrew Vasquez wrote:
} Subject: Re: Poisoning of Linux initiators on SCST reboot.

Good afternoon to everyone, hope the week is ending well.

> Ok, here's a patch against Linus' latest git tree which corrects the
> issue locally.  Could you try it out on your test configuration?
> We'd like to perform some additional sanity-testing locally before
> we submit the fix formally for upstream consumption.

The patch worked perfectly on our test configuration.  After the SCST
target reboot the client initiator picked up on the presence of the
storage and properly discovered it.

Evidently the qla2xxx driver has changed significantly from where
Linus' tree was at when you sent the patch.  My local git clone is up
to date with Linus' git tree which has
94ad374a0751f40d25e22e036c37f7263569d24c at its head and the patch
didn't apply.

The kernel I tested against was from my local git clone of Linus' tree
which has 94ad374a0751f40d25e22e036c37f7263569d24c at its head.  The
qla2xxx driver has evidently changed significantly since the patch
didn't apply cleanly against this tree.  It was trivial, however, to
update the patch.

> Sidenote: we'll also need to backport some form of this fix to
> stable, but qla2xxx in kernels upto 2.6.26 didn't utilize the
> terminate-rport-io infrastructure:

That would seem to be the case.  It doesn't appear 2.6.26 even has the
necessary infra-structure to support the patch.

Any idea on how tough a backport to 2.6.26 or earlier would be?
Running 'head-of-git' from Linus' tree would be somewhat of a stretch
for production applications.... :-)(

Thanks for the timely response.  Let me know if we can help out with
any additional testing.

Best wishes for a pleasant weekend.

}-- End of excerpt from Andrew Vasquez

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"There's no such thing as a wizard who minds his own business."
                                -- Berengis the Black Court Mage
                                   to the Earls Caeline

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Poisoning of Linux initiators on SCST reboot.
  2008-08-01 20:22 greg
@ 2008-08-05 13:52 ` Andrew Vasquez
  0 siblings, 0 replies; 15+ messages in thread
From: Andrew Vasquez @ 2008-08-05 13:52 UTC (permalink / raw)
  To: greg; +Cc: scst-devel, linux-driver, linux-scsi, stf_xl, vst

On Fri, 01 Aug 2008, greg@enjellic.com wrote:

> On Jul 29, 11:19am, Andrew Vasquez wrote:
> } Subject: Re: Poisoning of Linux initiators on SCST reboot.
> 
> Good afternoon to everyone, hope the week is ending well.
> 
> > Ok, here's a patch against Linus' latest git tree which corrects the
> > issue locally.  Could you try it out on your test configuration?
> > We'd like to perform some additional sanity-testing locally before
> > we submit the fix formally for upstream consumption.
> 
> The patch worked perfectly on our test configuration.  After the SCST
> target reboot the client initiator picked up on the presence of the
> storage and properly discovered it.
> 
> Evidently the qla2xxx driver has changed significantly from where
> Linus' tree was at when you sent the patch.  My local git clone is up
> to date with Linus' git tree which has
> 94ad374a0751f40d25e22e036c37f7263569d24c at its head and the patch
> didn't apply.
> 
> The kernel I tested against was from my local git clone of Linus' tree
> which has 94ad374a0751f40d25e22e036c37f7263569d24c at its head.  The
> qla2xxx driver has evidently changed significantly since the patch
> didn't apply cleanly against this tree.  It was trivial, however, to
> update the patch.

Thank you for the update.  The patch is already in our patch-queue for
our next -rcX submission.

> > Sidenote: we'll also need to backport some form of this fix to
> > stable, but qla2xxx in kernels upto 2.6.26 didn't utilize the
> > terminate-rport-io infrastructure:
> 
> That would seem to be the case.  It doesn't appear 2.6.26 even has the
> necessary infra-structure to support the patch.
> 
> Any idea on how tough a backport to 2.6.26 or earlier would be?
> Running 'head-of-git' from Linus' tree would be somewhat of a stretch
> for production applications.... :-)(

We are still looking to finalize a migration path.  As it stands now,
the least invasive mechanism looks like backporting:

[SCSI] qla2xxx: Add dev_loss_tmo_callbk/terminate_rport_io callback support.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5f3a9a207f1fccde476dd31b4c63ead2967d934f

[SCSI] qla2xxx: Set an rport's dev_loss_tmo value in a consistent manner.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=85821c906cf3563a00a3d98fa380a2581a7a5ff1

then adding this patch should suffice.  We'll do some more testing
locally.

Regards,
Andrew Vasquez

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Poisoning of Linux initiators on SCST reboot.
@ 2008-08-08 20:10 greg
  2008-08-14  5:28 ` Andrew Vasquez
  0 siblings, 1 reply; 15+ messages in thread
From: greg @ 2008-08-08 20:10 UTC (permalink / raw)
  To: Andrew Vasquez; +Cc: scst-devel, linux-driver, linux-scsi, stf_xl, vst

On Aug 5,  6:52am, Andrew Vasquez wrote:
} Subject: Re: Poisoning of Linux initiators on SCST reboot.

Hi Andrew, thanks for the follow up.

> > Any idea on how tough a backport to 2.6.26 or earlier would be?
> > Running 'head-of-git' from Linus' tree would be somewhat of a stretch
> > for production applications.... :-)(
> 
> We are still looking to finalize a migration path.  As it stands now,
> the least invasive mechanism looks like backporting:
> 
> [SCSI] qla2xxx: Add dev_loss_tmo_callbk/terminate_rport_io callback support.
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5f3a9a207f1fccde476dd31b4c63ead2967d934f
> 
> [SCSI] qla2xxx: Set an rport's dev_loss_tmo value in a consistent manner.
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=85821c906cf3563a00a3d98fa380a2581a7a5ff1
> 
> then adding this patch should suffice.  We'll do some more testing
> locally.

I checked out the v2.6.26 tag into a branch and ported the patch
series.  Everything went in with minimal difficulties, I'm including
a composite patch below

The modules seem to build fine but I haven't had time to hitch them up
to the harness to see whether they explode or not.  Will let everyone
know what we find.

---------------------------------------------------------------------------
diff --git a/drivers/scsi/qla2xxx/qla_attr.c b/drivers/scsi/qla2xxx/qla_attr.c
index 8dd88fc..dd9d493 100644
--- a/drivers/scsi/qla2xxx/qla_attr.c
+++ b/drivers/scsi/qla2xxx/qla_attr.c
@@ -972,26 +972,50 @@ qla2x00_get_starget_port_id(struct scsi_target *starget)
 }
 
 static void
-qla2x00_get_rport_loss_tmo(struct fc_rport *rport)
+qla2x00_set_rport_loss_tmo(struct fc_rport *rport, uint32_t timeout)
 {
-	struct Scsi_Host *host = rport_to_shost(rport);
-	scsi_qla_host_t *ha = shost_priv(host);
-
-	rport->dev_loss_tmo = ha->port_down_retry_count + 5;
+	if (timeout)
+		rport->dev_loss_tmo = timeout;
+	else
+		rport->dev_loss_tmo = 1;
 }
 
 static void
-qla2x00_set_rport_loss_tmo(struct fc_rport *rport, uint32_t timeout)
+qla2x00_dev_loss_tmo_callbk(struct fc_rport *rport)
 {
 	struct Scsi_Host *host = rport_to_shost(rport);
-	scsi_qla_host_t *ha = shost_priv(host);
+	fc_port_t *fcport = *(fc_port_t **)rport->dd_data;
+
+	qla2x00_abort_fcport_cmds(fcport);
+
+	/*
+	 * Transport has effectively 'deleted' the rport, clear
+	 * all local references.
+	 */
+	spin_lock_irq(host->host_lock);
+	fcport->rport = NULL;
+	*((fc_port_t **)rport->dd_data) = NULL;
+	spin_unlock_irq(host->host_lock);
+}
 
-	if (timeout)
-		ha->port_down_retry_count = timeout;
-	else
-		ha->port_down_retry_count = 1;
+static void
+qla2x00_terminate_rport_io(struct fc_rport *rport)
+{
+	fc_port_t *fcport = *(fc_port_t **)rport->dd_data;
+
+        /*
+         * At this point all fcport's software-states are cleared.  Perform any
+         * final firmware cleanup of resources (PCBs and XCBs).
+         */
+        if (fcport->loop_id != FC_NO_LOOP_ID) {
+                fcport->ha->isp_ops->fabric_logout(fcport->ha, fcport->loop_id,
+		   fcport->d_id.b.domain, fcport->d_id.b.area,
+		   fcport->d_id.b.al_pa);
+                fcport->loop_id = FC_NO_LOOP_ID;
+        }
 
-	rport->dev_loss_tmo = ha->port_down_retry_count + 5;
+	qla2x00_abort_fcport_cmds(fcport);
+	scsi_target_unblock(&rport->dev);
 }
 
 static int
@@ -1248,11 +1272,12 @@ struct fc_function_template qla2xxx_transport_functions = {
 	.get_starget_port_id  = qla2x00_get_starget_port_id,
 	.show_starget_port_id = 1,
 
-	.get_rport_dev_loss_tmo = qla2x00_get_rport_loss_tmo,
 	.set_rport_dev_loss_tmo = qla2x00_set_rport_loss_tmo,
 	.show_rport_dev_loss_tmo = 1,
 
 	.issue_fc_host_lip = qla2x00_issue_lip,
+	.dev_loss_tmo_callbk = qla2x00_dev_loss_tmo_callbk,
+	.terminate_rport_io = qla2x00_terminate_rport_io,
 	.get_fc_host_stats = qla2x00_get_fc_host_stats,
 
 	.vport_create = qla24xx_vport_create,
@@ -1291,11 +1316,12 @@ struct fc_function_template qla2xxx_transport_vport_functions = {
 	.get_starget_port_id  = qla2x00_get_starget_port_id,
 	.show_starget_port_id = 1,
 
-	.get_rport_dev_loss_tmo = qla2x00_get_rport_loss_tmo,
 	.set_rport_dev_loss_tmo = qla2x00_set_rport_loss_tmo,
 	.show_rport_dev_loss_tmo = 1,
 
 	.issue_fc_host_lip = qla2x00_issue_lip,
+	.dev_loss_tmo_callbk = qla2x00_dev_loss_tmo_callbk,
+	.terminate_rport_io = qla2x00_terminate_rport_io,
 	.get_fc_host_stats = qla2x00_get_fc_host_stats,
 };
 
diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h
index 8dd6000..7b0ddc8 100644
--- a/drivers/scsi/qla2xxx/qla_def.h
+++ b/drivers/scsi/qla2xxx/qla_def.h
@@ -1544,7 +1544,6 @@ typedef struct fc_port {
 	int login_retry;
 	atomic_t port_down_timer;
 
-	spinlock_t rport_lock;
 	struct fc_rport *rport, *drport;
 	u32 supported_classes;
 
diff --git a/drivers/scsi/qla2xxx/qla_gbl.h b/drivers/scsi/qla2xxx/qla_gbl.h
index 9b4bebe..5a50fb7 100644
--- a/drivers/scsi/qla2xxx/qla_gbl.h
+++ b/drivers/scsi/qla2xxx/qla_gbl.h
@@ -71,6 +71,8 @@ extern int qla2x00_post_aen_work(struct scsi_qla_host *, enum
 extern int qla2x00_post_hwe_work(struct scsi_qla_host *, uint16_t , uint16_t,
     uint16_t, uint16_t);
 
+extern void qla2x00_abort_fcport_cmds(fc_port_t *);
+
 /*
  * Global Functions in qla_mid.c source file.
  */
diff --git a/drivers/scsi/qla2xxx/qla_init.c b/drivers/scsi/qla2xxx/qla_init.c
index bbbc5a6..c7388fa 100644
--- a/drivers/scsi/qla2xxx/qla_init.c
+++ b/drivers/scsi/qla2xxx/qla_init.c
@@ -1864,12 +1864,11 @@ qla2x00_rport_del(void *data)
 {
 	fc_port_t *fcport = data;
 	struct fc_rport *rport;
-	unsigned long flags;
 
-	spin_lock_irqsave(&fcport->rport_lock, flags);
+	spin_lock_irq(fcport->ha->host->host_lock);
 	rport = fcport->drport;
 	fcport->drport = NULL;
-	spin_unlock_irqrestore(&fcport->rport_lock, flags);
+	spin_unlock_irq(fcport->ha->host->host_lock);
 	if (rport)
 		fc_remote_port_delete(rport);
 }
@@ -1898,7 +1897,6 @@ qla2x00_alloc_fcport(scsi_qla_host_t *ha, gfp_t flags)
 	atomic_set(&fcport->state, FCS_UNCONFIGURED);
 	fcport->flags = FCF_RLC_SUPPORT;
 	fcport->supported_classes = FC_COS_UNSPECIFIED;
-	spin_lock_init(&fcport->rport_lock);
 
 	return fcport;
 }
@@ -2243,28 +2241,24 @@ qla2x00_reg_remote_port(scsi_qla_host_t *ha, fc_port_t *fcport)
 {
 	struct fc_rport_identifiers rport_ids;
 	struct fc_rport *rport;
-	unsigned long flags;
 
 	if (fcport->drport)
 		qla2x00_rport_del(fcport);
-	if (fcport->rport)
-		return;
 
 	rport_ids.node_name = wwn_to_u64(fcport->node_name);
 	rport_ids.port_name = wwn_to_u64(fcport->port_name);
 	rport_ids.port_id = fcport->d_id.b.domain << 16 |
 	    fcport->d_id.b.area << 8 | fcport->d_id.b.al_pa;
 	rport_ids.roles = FC_RPORT_ROLE_UNKNOWN;
-	rport = fc_remote_port_add(ha->host, 0, &rport_ids);
+	fcport->rport = rport = fc_remote_port_add(ha->host, 0, &rport_ids);
 	if (!rport) {
 		qla_printk(KERN_WARNING, ha,
 		    "Unable to allocate fc remote port!\n");
 		return;
 	}
-	spin_lock_irqsave(&fcport->rport_lock, flags);
-	fcport->rport = rport;
+	spin_lock_irq(fcport->ha->host->host_lock);
 	*((fc_port_t **)rport->dd_data) = fcport;
-	spin_unlock_irqrestore(&fcport->rport_lock, flags);
+	spin_unlock_irq(fcport->ha->host->host_lock);
 
 	rport->supported_classes = fcport->supported_classes;
 
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 48eaa3b..047ee64 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -388,7 +388,7 @@ qla2x00_queuecommand(struct scsi_cmnd *cmd, void (*done)(struct scsi_cmnd *))
 	}
 
 	/* Close window on fcport/rport state-transitioning. */
-	if (!*(fc_port_t **)rport->dd_data) {
+	if (fcport->drport) {
 		cmd->result = DID_IMM_RETRY << 16;
 		goto qc_fail_command;
 	}
@@ -455,7 +455,7 @@ qla24xx_queuecommand(struct scsi_cmnd *cmd, void (*done)(struct scsi_cmnd *))
 	}
 
 	/* Close window on fcport/rport state-transitioning. */
-	if (!*(fc_port_t **)rport->dd_data) {
+	if (fcport->drport) {
 		cmd->result = DID_IMM_RETRY << 16;
 		goto qc24_fail_command;
 	}
@@ -617,6 +617,40 @@ qla2x00_wait_for_loop_ready(scsi_qla_host_t *ha)
 	return (return_status);
 }
 
+void
+qla2x00_abort_fcport_cmds(fc_port_t *fcport)
+{
+	int cnt;
+	unsigned long flags;
+	srb_t *sp;
+	scsi_qla_host_t *ha = fcport->ha;
+	scsi_qla_host_t *pha = to_qla_parent(ha);
+
+	spin_lock_irqsave(&pha->hardware_lock, flags);
+	for (cnt = 1; cnt < MAX_OUTSTANDING_COMMANDS; cnt++) {
+		sp = pha->outstanding_cmds[cnt];
+		if (!sp)
+			continue;
+		if (sp->fcport != fcport)
+			continue;
+
+		spin_unlock_irqrestore(&pha->hardware_lock, flags);
+		if (ha->isp_ops->abort_command(ha, sp)) {
+			DEBUG2(qla_printk(KERN_WARNING, ha,
+			    "Abort failed --  %lx\n", sp->cmd->serial_number));
+		} else {
+			if (qla2x00_eh_wait_on_command(ha, sp->cmd) !=
+			    QLA_SUCCESS)
+				DEBUG2(qla_printk(KERN_WARNING, ha,
+				    "Abort failed while waiting --  %lx\n",
+				    sp->cmd->serial_number));
+
+		}
+		spin_lock_irqsave(&pha->hardware_lock, flags);
+	}
+	spin_unlock_irqrestore(&pha->hardware_lock, flags);
+}
+
 static void
 qla2x00_block_error_handler(struct scsi_cmnd *cmnd)
 {
@@ -1073,7 +1107,7 @@ qla2xxx_slave_configure(struct scsi_device *sdev)
 	else
 		scsi_deactivate_tcq(sdev, ha->max_q_depth);
 
-	rport->dev_loss_tmo = ha->port_down_retry_count + 5;
+	rport->dev_loss_tmo = ha->port_down_retry_count;
 
 	return 0;
 }
@@ -1813,7 +1847,6 @@ static inline void
 qla2x00_schedule_rport_del(struct scsi_qla_host *ha, fc_port_t *fcport,
     int defer)
 {
-	unsigned long flags;
 	struct fc_rport *rport;
 
 	if (!fcport->rport)
@@ -1821,19 +1854,13 @@ qla2x00_schedule_rport_del(struct scsi_qla_host *ha, fc_port_t *fcport,
 
 	rport = fcport->rport;
 	if (defer) {
-		spin_lock_irqsave(&fcport->rport_lock, flags);
+		spin_lock_irq(ha->host->host_lock);
 		fcport->drport = rport;
-		fcport->rport = NULL;
-		*(fc_port_t **)rport->dd_data = NULL;
-		spin_unlock_irqrestore(&fcport->rport_lock, flags);
+		spin_unlock_irq(ha->host->host_lock);
 		set_bit(FCPORT_UPDATE_NEEDED, &ha->dpc_flags);
-	} else {
-		spin_lock_irqsave(&fcport->rport_lock, flags);
-		fcport->rport = NULL;
-		*(fc_port_t **)rport->dd_data = NULL;
-		spin_unlock_irqrestore(&fcport->rport_lock, flags);
+		qla2xxx_wake_dpc(ha);
+	} else
 		fc_remote_port_delete(rport);
-	}
 }
 
 /*
---------------------------------------------------------------------------

> Regards,
> Andrew Vasquez

Best wishes for a pleasant weekend to everyone.

}-- End of excerpt from Andrew Vasquez

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"... then the day came when the risk to remain tight in a bud was more
 painful than the risk it took to blossom."
                                -- Anais Nin

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Poisoning of Linux initiators on SCST reboot.
  2008-08-08 20:10 Poisoning of Linux initiators on SCST reboot greg
@ 2008-08-14  5:28 ` Andrew Vasquez
  2008-08-14 13:53   ` James Bottomley
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Vasquez @ 2008-08-14  5:28 UTC (permalink / raw)
  To: greg, James Bottomley; +Cc: scst-devel, linux-driver, linux-scsi, stf_xl, vst

On Fri, 08 Aug 2008, greg@enjellic.com wrote:

> On Aug 5,  6:52am, Andrew Vasquez wrote:
> } Subject: Re: Poisoning of Linux initiators on SCST reboot.
> 
> Hi Andrew, thanks for the follow up.
> 
> > > Any idea on how tough a backport to 2.6.26 or earlier would be?
> > > Running 'head-of-git' from Linus' tree would be somewhat of a stretch
> > > for production applications.... :-)(
> > 
> > We are still looking to finalize a migration path.  As it stands now,
> > the least invasive mechanism looks like backporting:
> > 
> > [SCSI] qla2xxx: Add dev_loss_tmo_callbk/terminate_rport_io callback support.
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5f3a9a207f1fccde476dd31b4c63ead2967d934f
> > 
> > [SCSI] qla2xxx: Set an rport's dev_loss_tmo value in a consistent manner.
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=85821c906cf3563a00a3d98fa380a2581a7a5ff1
> > 
> > then adding this patch should suffice.  We'll do some more testing
> > locally.
> 
> I checked out the v2.6.26 tag into a branch and ported the patch
> series.  Everything went in with minimal difficulties, I'm including
> a composite patch below
> 
> The modules seem to build fine but I haven't had time to hitch them up
> to the harness to see whether they explode or not.  Will let everyone
> know what we find.

Ok, we've verified and backported the three changes through to 2.6.24.
The patches in this order:

 [SCSI] qla2xxx: Add dev_loss_tmo_callbk/terminate_rport_io callback support.
 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5f3a9a207f1fccde476dd31b4c63ead2967d934f

 [SCSI] qla2xxx: Set an rport's dev_loss_tmo value in a consistent manner.
 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=85821c906cf3563a00a3d98fa380a2581a7a5ff1

 [PATCH 2/8] qla2xxx: Correct synchronization of software/firmware fcport states.
 http://article.gmane.org/gmane.linux.scsi/43971

apply cleanly to 2.6.26 (git-am clean), and with minor 'fuzz' (git-am
warns) while applying the first patch against 2.6.25 and 2.6.24.

James B., quick question, from what I recall of the 'stable' process,
I'll need to wait till the 3rd patch hits Linus' tree before I can
submit to stable@kernel.org.  Since I can't 'CC to stable' the first
two patches (they are already in Linus' tree), what's the best way to
accomidate this disjoint submission?

Thanks, av

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Poisoning of Linux initiators on SCST reboot.
  2008-08-14  5:28 ` Andrew Vasquez
@ 2008-08-14 13:53   ` James Bottomley
  2008-08-14 16:14     ` Andrew Vasquez
  0 siblings, 1 reply; 15+ messages in thread
From: James Bottomley @ 2008-08-14 13:53 UTC (permalink / raw)
  To: Andrew Vasquez; +Cc: greg, scst-devel, linux-driver, linux-scsi, stf_xl, vst

On Wed, 2008-08-13 at 22:28 -0700, Andrew Vasquez wrote:
> On Fri, 08 Aug 2008, greg@enjellic.com wrote:
> 
> > On Aug 5,  6:52am, Andrew Vasquez wrote:
> > } Subject: Re: Poisoning of Linux initiators on SCST reboot.
> > 
> > Hi Andrew, thanks for the follow up.
> > 
> > > > Any idea on how tough a backport to 2.6.26 or earlier would be?
> > > > Running 'head-of-git' from Linus' tree would be somewhat of a stretch
> > > > for production applications.... :-)(
> > > 
> > > We are still looking to finalize a migration path.  As it stands now,
> > > the least invasive mechanism looks like backporting:
> > > 
> > > [SCSI] qla2xxx: Add dev_loss_tmo_callbk/terminate_rport_io callback support.
> > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5f3a9a207f1fccde476dd31b4c63ead2967d934f
> > > 
> > > [SCSI] qla2xxx: Set an rport's dev_loss_tmo value in a consistent manner.
> > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=85821c906cf3563a00a3d98fa380a2581a7a5ff1
> > > 
> > > then adding this patch should suffice.  We'll do some more testing
> > > locally.
> > 
> > I checked out the v2.6.26 tag into a branch and ported the patch
> > series.  Everything went in with minimal difficulties, I'm including
> > a composite patch below
> > 
> > The modules seem to build fine but I haven't had time to hitch them up
> > to the harness to see whether they explode or not.  Will let everyone
> > know what we find.
> 
> Ok, we've verified and backported the three changes through to 2.6.24.
> The patches in this order:
> 
>  [SCSI] qla2xxx: Add dev_loss_tmo_callbk/terminate_rport_io callback support.
>  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5f3a9a207f1fccde476dd31b4c63ead2967d934f
> 
>  [SCSI] qla2xxx: Set an rport's dev_loss_tmo value in a consistent manner.
>  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=85821c906cf3563a00a3d98fa380a2581a7a5ff1
> 
>  [PATCH 2/8] qla2xxx: Correct synchronization of software/firmware fcport states.
>  http://article.gmane.org/gmane.linux.scsi/43971
> 
> apply cleanly to 2.6.26 (git-am clean), and with minor 'fuzz' (git-am
> warns) while applying the first patch against 2.6.25 and 2.6.24.
> 
> James B., quick question, from what I recall of the 'stable' process,
> I'll need to wait till the 3rd patch hits Linus' tree before I can
> submit to stable@kernel.org.  Since I can't 'CC to stable' the first
> two patches (they are already in Linus' tree), what's the best way to
> accomidate this disjoint submission?

Assuming they're OK to apply individually, you just send the first two
to stable now quoting the upstream commit id and I'll add the CC: stable
tag on the third which should cause it to sumbit automatically as soon
as Linus pulls it.

James



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Poisoning of Linux initiators on SCST reboot.
  2008-08-14 13:53   ` James Bottomley
@ 2008-08-14 16:14     ` Andrew Vasquez
  0 siblings, 0 replies; 15+ messages in thread
From: Andrew Vasquez @ 2008-08-14 16:14 UTC (permalink / raw)
  To: James Bottomley; +Cc: greg, scst-devel, linux-driver, linux-scsi, stf_xl, vst

On Thu, 14 Aug 2008, James Bottomley wrote:

> On Wed, 2008-08-13 at 22:28 -0700, Andrew Vasquez wrote:
> > Ok, we've verified and backported the three changes through to 2.6.24.
> > The patches in this order:
> > 
> >  [SCSI] qla2xxx: Add dev_loss_tmo_callbk/terminate_rport_io callback support.
> >  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5f3a9a207f1fccde476dd31b4c63ead2967d934f
> > 
> >  [SCSI] qla2xxx: Set an rport's dev_loss_tmo value in a consistent manner.
> >  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=85821c906cf3563a00a3d98fa380a2581a7a5ff1
> > 
> >  [PATCH 2/8] qla2xxx: Correct synchronization of software/firmware fcport states.
> >  http://article.gmane.org/gmane.linux.scsi/43971
> > 
> > apply cleanly to 2.6.26 (git-am clean), and with minor 'fuzz' (git-am
> > warns) while applying the first patch against 2.6.25 and 2.6.24.
> > 
> > James B., quick question, from what I recall of the 'stable' process,
> > I'll need to wait till the 3rd patch hits Linus' tree before I can
> > submit to stable@kernel.org.  Since I can't 'CC to stable' the first
> > two patches (they are already in Linus' tree), what's the best way to
> > accomidate this disjoint submission?
> 
> Assuming they're OK to apply individually, you just send the first two
> to stable now quoting the upstream commit id

As I mentioned, the first two do in fact apply cleanly to 2.6.26.
There's a bit of fuzz with .25/24 application of the first patch, so a
basic git-am may fail, whereas a 'patch -p1' works...

> and I'll add the CC: stable
> tag on the third which should cause it to sumbit automatically as soon
> as Linus pulls it.

Ok, thanks.  I'll send the patches over to stable@...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Poisoning of Linux initiators on SCST reboot.
@ 2008-08-20 21:13 greg
  2008-08-20 22:30 ` Andrew Vasquez
  0 siblings, 1 reply; 15+ messages in thread
From: greg @ 2008-08-20 21:13 UTC (permalink / raw)
  To: Andrew Vasquez, greg, James Bottomley
  Cc: scst-devel, linux-driver, linux-scsi, vst

On Aug 13, 10:28pm, Andrew Vasquez wrote:
} Subject: Re: Poisoning of Linux initiators on SCST reboot.

Good afternoon to everyone, hope the day is going well.

> Ok, we've verified and backported the three changes through to 2.6.24.
> The patches in this order:
> 
>  [SCSI] qla2xxx: Add dev_loss_tmo_callbk/terminate_rport_io callback support.
>  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5f3a9a207f1fccde476dd31b4c63ead2967d934f
> 
>  [SCSI] qla2xxx: Set an rport's dev_loss_tmo value in a consistent manner.
>  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=85821c906cf3563a00a3d98fa380a2581a7a5ff1
> 
>  [PATCH 2/8] qla2xxx: Correct synchronization of software/firmware fcport states.
>  http://article.gmane.org/gmane.linux.scsi/43971
> 
> apply cleanly to 2.6.26 (git-am clean), and with minor 'fuzz' (git-am
> warns) while applying the first patch against 2.6.25 and 2.6.24.

We ran into an issue today which I wanted to bounce off everyone since
it may be related.  If not there may be another issue to look at.

We were transitioning storage on a pair of our production boxes from
an existing Linux SCSI target solution to SCST.  Previously the
storage was being accessed as target 0/LUN1.  Under SCST the storage
would be accessed as target 0/LUN0.

The target machine was upgraded and rebooted.  SCST loaded and
initialized.  The MDS indicated the initiator and target were both
logged into the zone.   So there would seem to be connectivity at the
link layer between the initiator/target and the switch.

Unfortunately we cannot get a session established on the target for
the initiator(s).  The initiators are running stock RHEL5 2.6.18
kernels.

Enabling/disabling the interface on the target server results in the
following messages on the initiators:

Aug 20 14:54:27 initiator kernel: rport-4:0-1: blocked FC remote port
time out: saving binding

The following are also noted in the output of dmesg on the initiators:

scsi 4:0:0:0: timing out command, waited 22s

There is a remote port defined for the target server.  The port WWN
and FCID match previous values.  The only difference is the LUN on
which the storage is being delivered.

We tore down the SCST storage definition on the target and re-mapped
the storage as LUN 1 but this had no affect on the situation.  That
isn't really surprising since the problem appears be secondary to the
initiator and target being unable to establish an N_PORT relationship.

I would be interested in any thoughts the group might have.  From the
perspective of the initiators the behavior seems somewhat identical to
what we experienced earlier.  The Qlogic driver is essentially
'poisoned' with respect to its ability to access the remote port which
has seen a change in configuration.

I should note that it doesn't appear there was an attempt by the
target's HBA to log into the fabric as an initiator.  So this would
seem to be a different scenario than what we noted before when the
target transitioned to an initiator role and back to a target role
from the perspective of the initiator.

> Thanks, av

We are in the process of scheduling an outage to reboot the initiators
to see if we can clear the situation.  Holler quickly if anyone has
any additional testing they would like conducted and I will try to get
that done before the outage.

Have a good evening.

}-- End of excerpt from Andrew Vasquez

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"Any intelligent fool can make things bigger and more complex... It
 takes a touch of genius - and a lot of courage to move in the opposite
 direction."
                                -- Albert Einstein

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Poisoning of Linux initiators on SCST reboot.
  2008-08-20 21:13 greg
@ 2008-08-20 22:30 ` Andrew Vasquez
  0 siblings, 0 replies; 15+ messages in thread
From: Andrew Vasquez @ 2008-08-20 22:30 UTC (permalink / raw)
  To: greg
  Cc: James Bottomley, scst-devel, linux-driver, linux-scsi, vst,
	Marcus Barrow

On Wed, 20 Aug 2008, greg@enjellic.com wrote:

> 
> On Aug 13, 10:28pm, Andrew Vasquez wrote:
> } Subject: Re: Poisoning of Linux initiators on SCST reboot.
> 
> Good afternoon to everyone, hope the day is going well.
> 
> > Ok, we've verified and backported the three changes through to 2.6.24.
> > The patches in this order:
> > 
> >  [SCSI] qla2xxx: Add dev_loss_tmo_callbk/terminate_rport_io callback support.
> >  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5f3a9a207f1fccde476dd31b4c63ead2967d934f
> > 
> >  [SCSI] qla2xxx: Set an rport's dev_loss_tmo value in a consistent manner.
> >  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=85821c906cf3563a00a3d98fa380a2581a7a5ff1
> > 
> >  [PATCH 2/8] qla2xxx: Correct synchronization of software/firmware fcport states.
> >  http://article.gmane.org/gmane.linux.scsi/43971
> > 
> > apply cleanly to 2.6.26 (git-am clean), and with minor 'fuzz' (git-am
> > warns) while applying the first patch against 2.6.25 and 2.6.24.
> 
> We ran into an issue today which I wanted to bounce off everyone since
> it may be related.  If not there may be another issue to look at.
> 
> We were transitioning storage on a pair of our production boxes from
> an existing Linux SCSI target solution to SCST.  Previously the
> storage was being accessed as target 0/LUN1.  Under SCST the storage
> would be accessed as target 0/LUN0.
> 
> The target machine was upgraded and rebooted.  SCST loaded and
> initialized.  The MDS indicated the initiator and target were both
> logged into the zone.   So there would seem to be connectivity at the
> link layer between the initiator/target and the switch.
> 
> Unfortunately we cannot get a session established on the target for
> the initiator(s).  The initiators are running stock RHEL5 2.6.18
> kernels.
> 
> Enabling/disabling the interface on the target server results in the
> following messages on the initiators:
> 
> Aug 20 14:54:27 initiator kernel: rport-4:0-1: blocked FC remote port
> time out: saving binding
> 
> The following are also noted in the output of dmesg on the initiators:
> 
> scsi 4:0:0:0: timing out command, waited 22s
> 
> There is a remote port defined for the target server.  The port WWN
> and FCID match previous values.  The only difference is the LUN on
> which the storage is being delivered.
> 
> We tore down the SCST storage definition on the target and re-mapped
> the storage as LUN 1 but this had no affect on the situation.  That
> isn't really surprising since the problem appears be secondary to the
> initiator and target being unable to establish an N_PORT relationship.
> 
> I would be interested in any thoughts the group might have.  From the
> perspective of the initiators the behavior seems somewhat identical to
> what we experienced earlier.  The Qlogic driver is essentially
> 'poisoned' with respect to its ability to access the remote port which
> has seen a change in configuration.

These upstream changes are in the queue of updates to be pushed for
RHEL5.3.

Regards,
Andrew Vasquez

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Poisoning of Linux initiators on SCST reboot.
@ 2008-08-20 23:44 greg
  0 siblings, 0 replies; 15+ messages in thread
From: greg @ 2008-08-20 23:44 UTC (permalink / raw)
  To: Andrew Vasquez
  Cc: James Bottomley, scst-devel, linux-driver, linux-scsi, vst,
	Marcus Barrow

On Aug 20,  3:30pm, Andrew Vasquez wrote:
} Subject: Re: Poisoning of Linux initiators on SCST reboot.

Hi Andrew, thanks for the note.

> > There is a remote port defined for the target server.  The port WWN
> > and FCID match previous values.  The only difference is the LUN on
> > which the storage is being delivered.
> > 
> > We tore down the SCST storage definition on the target and re-mapped
> > the storage as LUN 1 but this had no affect on the situation.  That
> > isn't really surprising since the problem appears be secondary to the
> > initiator and target being unable to establish an N_PORT relationship.
> > 
> > I would be interested in any thoughts the group might have.  From the
> > perspective of the initiators the behavior seems somewhat identical to
> > what we experienced earlier.  The Qlogic driver is essentially
> > 'poisoned' with respect to its ability to access the remote port which
> > has seen a change in configuration.

> These upstream changes are in the queue of updates to be pushed for
> RHEL5.3.

Very good to hear, the client guys will be glad to hear that.

In the meantime is our analysis correct that this is the same problem,
only caused by the LUN change rather than the role changes?  Knowing
that would be very helpful.

If so we are hosed and will need to proceed with our outage to recover
the initiators.  It will also impact the next round of migrations we
need to do.

> Regards,
> Andrew Vasquez

Thanks for the information, will look forward to confirmation on the above.

Have a good evening.

}-- End of excerpt from Andrew Vasquez

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"How appropriate, you fight like a cow."
                                -- Guybrush Threepwood

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-08-20 23:44 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-08 20:10 Poisoning of Linux initiators on SCST reboot greg
2008-08-14  5:28 ` Andrew Vasquez
2008-08-14 13:53   ` James Bottomley
2008-08-14 16:14     ` Andrew Vasquez
  -- strict thread matches above, loose matches on Subject: below --
2008-08-20 23:44 greg
2008-08-20 21:13 greg
2008-08-20 22:30 ` Andrew Vasquez
2008-08-01 20:22 greg
2008-08-05 13:52 ` Andrew Vasquez
2008-07-28 20:40 greg
2008-07-29  1:34 ` Andrew Vasquez
2008-07-29 18:19   ` Andrew Vasquez
2008-07-24 17:50 greg
2008-07-25 13:11 ` Stanislaw Gruszka
2008-07-25 13:45 ` Andrew Vasquez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox