Re: Poisoning of Linux initiators on SCST reboot.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrew Vasquez <andrew.vasquez@qlogic.com>
To: greg@enjellic.com
Cc: scst-devel@lists.sourceforge.net, linux-driver@qlogic.com,
	linux-scsi@vger.kernel.org, stf_xl@wp.pl, vst@vlnb.net
Subject: Re: Poisoning of Linux initiators on SCST reboot.
Date: Fri, 25 Jul 2008 06:45:43 -0700	[thread overview]
Message-ID: <20080725134543.GN14959@plap4-2.local> (raw)
In-Reply-To: <200807241750.m6OHophm028516@wind.enjellic.com>

On Thu, 24 Jul 2008, greg@enjellic.com wrote:

> Good morning to everyone, hope your respective days are going well.
> Sorry for the wide cast on this but I wanted to get what would seem to
> be the concerned parties on this issue in the loop.
<snip>
> The targets are using Qlogic 2462 cards using the isp_mod driver.  The
> client initiators are using Qlogic 2342 cards with the qla2xxx driver.
> 
> Failure mode is as follows:
> 
>         1.) Configure SCST based storage for an initiator (vdisk
>             based).
> 
>         2.) Activate initiator.  Initiator logs into fabric and
>             discovers SCST based storage.
> 
>         3.) Force SCST target failure by rebooting or pulling power.
> 
>         4.) SCST target returns to service and logs into zone.
> 
>         5.) Initiator picks up RSCN but re-activates the rport for
>             SCST server as an INITIATOR rather than TARGET role.
> 
> After this point in time the initiator is effectively 'poisoned'.
> 
> Nothing short of unloading and reloading the Qlogic 2xxx driver on the
> client initiator will allow the initiator to recognize the SCST server
> as a target device.  A driver unload/reload of course is not an option
> to restore connectivity since it would take the remaining live side of
> the mirror off-line as well.
> 
> We finally figured out what seems to be happening by watching the logs
> on the client and comparing what was going on there to the FLOGI login
> status on the fabric.
> 
> When the SCST target server reboots the initiator times out the remote
> port and places it into 'unknown' state.  The qla2xxx driver,
> according to the source code, maintains the previous rport state in
> driver internal data.

Ok, thanks for the detailed description of the problem...

> The 2462 card in the target on boot logs into the fabric with an
> initiator role, I'm assuming in support of BIOS based SAN booting. The
> client initiator picks up on this and re-activates the rport as being
> in an INITIATOR role.

Yes, BIOS would FLOGI into the switch...  RSCN received on initiator
side, and the role registered for the rport would have been migrated
from target->initiator... (Step 1)

> Loading the isp_mod driver causes the 2462 card in the target to be
> shutdown.  The client initiator picks up on this and times out the
> rport retaining the last rport state as INITIATOR.

Ok, I would have expected this to at least start when the BIOS FLOGI'd
into the switch above...

> Enabling target mode on the 2462 causes it to log back into the
> fabric.  The client initiator picks up on the RSCN but refuses to
> transition the rport from INITIATOR to TARGET state.

Ok, so on the initiator side, I'd expect and RSCN, PLOGI and PRLI to
target side, the bits processed from the PRLI response, role migrated
from UNKNOWN during fc_remote_port_add(), then to TARGET during
fc_remote_port_rolechg(). (Step 2)

> Without going
> into TARGET state the remote port won't have SCSI device discovery
> initiated against it and hence the SCST based storage is inaccessible.

Ok, could you provide the kernel log of the full failure with the
qla2xxx driver loaded with the ql2xextended_error_logging module
parameter set to 1.

> Activating a LIP on the client initiates a new fabric login attempt
> which completes with the following message:
> 
> Jul 24 02:53:59 init-test kernel: rport-2:0-0: blocked FC remote port
> time out: no longer a FCP target, removing starget
> 
> Which from a review of the source code seems consistent with our
> analysis of the problem.
> 
> The culprit is the following code from drivers/scsi/scsi_transport_fc.c:
> 
>         if ((rport->port_state == FC_PORTSTATE_ONLINE) &&
>             (rport->scsi_target_id != -1) &&
>             !(rport->roles & FC_PORT_ROLE_FCP_TARGET)) {
>                 dev_printk(KERN_ERR, &rport->dev,
>                         "blocked FC remote port time out: no longer"
>                         " a FCP target, removing starget\n");
>                 spin_unlock_irqrestore(shost->host_lock, flags);
>                 scsi_target_unblock(&rport->dev);
>                 fc_queue_work(shost, &rport->stgt_delete_work);
>                 return;
>         }

I would have expected this during step 1 (guess it depends on timing
latency during reboot/BIOS-flogi_plogi/isp_mod-load)...

> The above gets executed in response to the LIP on the initiator.  The
> value in rport->roles is being populated with what the remote target
> was INITIATOR rather than its current TARGET state.

Ok, so at step-1, the fc_remote_port_add() should have fallen into
this code:

      ...
      /* was a target, not in roles */
	if ((rport->scsi_target_id != -1) &&
	    (!(ids->roles & FC_PORT_ROLE_FCP_TARGET)))
		return rport;

since role is unknown... then transitioned to initiator during
rolechg().

then at step-2, the same process during fc_remote_port_add() (since
the transition to target is again deferred to rolechg()), then during
rolechg() I'd expect the transport would fall into the 'else if' here:

        spin_lock_irqsave(shost->host_lock, flags);
        if (roles & FC_PORT_ROLE_FCP_TARGET) {
                if (rport->scsi_target_id == -1) {
                        rport->scsi_target_id = fc_host->next_target_id++;
                        create = 1;
                } else if (!(rport->roles & FC_PORT_ROLE_FCP_TARGET))
                        create = 1;

Hmm, but that doesn't seem to be the case here...

Let's start with the driver logs, just I get full picture of at least
what's happending with qla2xxx at the wire-side.

Thanks, AV

next prev parent reply	other threads:[~2008-07-25 13:45 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-07-24 17:50 Poisoning of Linux initiators on SCST reboot greg
2008-07-25 10:40 ` [Scst-devel] " Vladislav Bolkhovitin
2008-07-25 10:43   ` Vladislav Bolkhovitin
2008-07-25 13:11 ` Stanislaw Gruszka
2008-07-25 13:45 ` Andrew Vasquez [this message]
  -- strict thread matches above, loose matches on Subject: below --
2008-07-28 20:40 greg
2008-07-29  1:34 ` Andrew Vasquez
2008-07-29 18:19   ` Andrew Vasquez
2008-08-01 20:22 greg
2008-08-05 13:52 ` Andrew Vasquez
2008-08-08 20:10 greg
2008-08-14  5:28 ` Andrew Vasquez
2008-08-14 13:53   ` James Bottomley
2008-08-14 16:14     ` Andrew Vasquez
2008-08-20 21:13 greg
2008-08-20 22:30 ` Andrew Vasquez
2008-08-20 23:44 greg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080725134543.GN14959@plap4-2.local \
    --to=andrew.vasquez@qlogic.com \
    --cc=greg@enjellic.com \
    --cc=linux-driver@qlogic.com \
    --cc=linux-scsi@vger.kernel.org \
    --cc=scst-devel@lists.sourceforge.net \
    --cc=stf_xl@wp.pl \
    --cc=vst@vlnb.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.