Poisoning of Linux initiators on SCST reboot.

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* Poisoning of Linux initiators on SCST reboot.
@ 2008-07-24 17:50 greg
  2008-07-25 10:40 ` [Scst-devel] " Vladislav Bolkhovitin
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: greg @ 2008-07-24 17:50 UTC (permalink / raw)
  To: scst-devel; +Cc: linux-driver, linux-scsi, stf_xl, vst

Good morning to everyone, hope your respective days are going well.
Sorry for the wide cast on this but I wanted to get what would seem to
be the concerned parties on this issue in the loop.

We have been putting SCST through an extensive round of pre-production
testing.  I wanted to start following up on some of the issues we have
noted.

We will be putting SCST into service to support mirrored storage from
client initiators to two separate data-centers.  The filesystems on
the client initiators access storage at the two data-centers via a
Linux MD RAID1 device.  The SAN architecture is based on Cisco
MDS-9509 switches.

Just as an aside for people considering use of SCST.  The core engine
has been rock solid.  Our testing rounds consists of driving around
1/8 of a petabyte of widely disparate I/O types from multiple
initiators to a pair of targets in the two data-centers.  SCST hasn't
missed a beat so far, so kudos to Vlad and everyone involved in its
development.

As we began forced failure testing one issue has come up that I wanted
to advise people of.  A hard reboot of an SCST target server results
in the 'poisoning' of Linux based initiators.  We verified the issue
as being present on client initiators running the stock RHEL5 kernel
up through 2.6.26.

The targets are using Qlogic 2462 cards using the isp_mod driver.  The
client initiators are using Qlogic 2342 cards with the qla2xxx driver.

Failure mode is as follows:

        1.) Configure SCST based storage for an initiator (vdisk
            based).

        2.) Activate initiator.  Initiator logs into fabric and
            discovers SCST based storage.

        3.) Force SCST target failure by rebooting or pulling power.

        4.) SCST target returns to service and logs into zone.

        5.) Initiator picks up RSCN but re-activates the rport for
            SCST server as an INITIATOR rather than TARGET role.

After this point in time the initiator is effectively 'poisoned'.

Nothing short of unloading and reloading the Qlogic 2xxx driver on the
client initiator will allow the initiator to recognize the SCST server
as a target device.  A driver unload/reload of course is not an option
to restore connectivity since it would take the remaining live side of
the mirror off-line as well.

We finally figured out what seems to be happening by watching the logs
on the client and comparing what was going on there to the FLOGI login
status on the fabric.

When the SCST target server reboots the initiator times out the remote
port and places it into 'unknown' state.  The qla2xxx driver,
according to the source code, maintains the previous rport state in
driver internal data.

The 2462 card in the target on boot logs into the fabric with an
initiator role, I'm assuming in support of BIOS based SAN booting. The
client initiator picks up on this and re-activates the rport as being
in an INITIATOR role.

Loading the isp_mod driver causes the 2462 card in the target to be
shutdown.  The client initiator picks up on this and times out the
rport retaining the last rport state as INITIATOR.

Enabling target mode on the 2462 causes it to log back into the
fabric.  The client initiator picks up on the RSCN but refuses to
transition the rport from INITIATOR to TARGET state.  Without going
into TARGET state the remote port won't have SCSI device discovery
initiated against it and hence the SCST based storage is inaccessible.

Activating a LIP on the client initiates a new fabric login attempt
which completes with the following message:

Jul 24 02:53:59 init-test kernel: rport-2:0-0: blocked FC remote port
time out: no longer a FCP target, removing starget

Which from a review of the source code seems consistent with our
analysis of the problem.

The culprit is the following code from drivers/scsi/scsi_transport_fc.c:

        if ((rport->port_state == FC_PORTSTATE_ONLINE) &&
            (rport->scsi_target_id != -1) &&
            !(rport->roles & FC_PORT_ROLE_FCP_TARGET)) {
                dev_printk(KERN_ERR, &rport->dev,
                        "blocked FC remote port time out: no longer"
                        " a FCP target, removing starget\n");
                spin_unlock_irqrestore(shost->host_lock, flags);
                scsi_target_unblock(&rport->dev);
                fc_queue_work(shost, &rport->stgt_delete_work);
                return;
        }

The above gets executed in response to the LIP on the initiator.  The
value in rport->roles is being populated with what the remote target
was INITIATOR rather than its current TARGET state.

Windows client initiators running against the SCST targets get the
transition and login sequence correct.  When the SCST target is
re-activated after the cold boot those clients immediately re-discover
their storage while the Linux clients issue error messages about loss
of the remote target.

While all this doesn't seem to be technically a bug with SCST it
certainly is a problematic usage scenario.  It may also explain why
some individuals may have had problems getting SCST clients to access
their storage.

If a test SCST server was plugged into an active zone and turned on it
would immediately poison any Linux clients.  No amount of proper
configuration on the target would allow the client to access storage
until the client was rebooted or its drivers reloaded.

Any suggestions on how to move forward would be appreciated.  We've
got a pretty extensive test environment and would be happy to test run
any suggested changes or patches.

Once again a thank you to everyone who has contributed to SCST
development.  Other than this and a few additional glitches I will
follow up with via additional e-mails it is presenting itself as a
very solid platform for storage delivery.

Best wishes for a pleasant weekend to everyone.

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"Much work remains to be done before we can announce our total failure
 to make any progress."
                                -- Mike Kelly

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Scst-devel] Poisoning of Linux initiators on SCST reboot.
  2008-07-24 17:50 Poisoning of Linux initiators on SCST reboot greg
@ 2008-07-25 10:40 ` Vladislav Bolkhovitin
  2008-07-25 10:43   ` Vladislav Bolkhovitin
  2008-07-25 13:11 ` Stanislaw Gruszka
  2008-07-25 13:45 ` Andrew Vasquez
  2 siblings, 1 reply; 5+ messages in thread
From: Vladislav Bolkhovitin @ 2008-07-25 10:40 UTC (permalink / raw)
  To: greg; +Cc: scst-devel, linux-driver, linux-scsi

greg@enjellic.com wrote:
> Good morning to everyone, hope your respective days are going well.
> Sorry for the wide cast on this but I wanted to get what would seem to
> be the concerned parties on this issue in the loop.
> 
> We have been putting SCST through an extensive round of pre-production
> testing.  I wanted to start following up on some of the issues we have
> noted.
> 
> We will be putting SCST into service to support mirrored storage from
> client initiators to two separate data-centers.  The filesystems on
> the client initiators access storage at the two data-centers via a
> Linux MD RAID1 device.  The SAN architecture is based on Cisco
> MDS-9509 switches.
> 
> Just as an aside for people considering use of SCST.  The core engine
> has been rock solid.  Our testing rounds consists of driving around
> 1/8 of a petabyte of widely disparate I/O types from multiple
> initiators to a pair of targets in the two data-centers.  SCST hasn't
> missed a beat so far, so kudos to Vlad and everyone involved in its
> development.

Thanks! Your support is very much appreciated and exactly on time, 
because I'm going to submit SCST patches for review and inclusion into 
the kernel next week.

> As we began forced failure testing one issue has come up that I wanted
> to advise people of.  A hard reboot of an SCST target server results
> in the 'poisoning' of Linux based initiators.  We verified the issue
> as being present on client initiators running the stock RHEL5 kernel
> up through 2.6.26.
> 
> The targets are using Qlogic 2462 cards using the isp_mod driver.  The
> client initiators are using Qlogic 2342 cards with the qla2xxx driver.
> 
> Failure mode is as follows:
> 
>         1.) Configure SCST based storage for an initiator (vdisk
>             based).
> 
>         2.) Activate initiator.  Initiator logs into fabric and
>             discovers SCST based storage.
> 
>         3.) Force SCST target failure by rebooting or pulling power.
> 
>         4.) SCST target returns to service and logs into zone.
> 
>         5.) Initiator picks up RSCN but re-activates the rport for
>             SCST server as an INITIATOR rather than TARGET role.
> 
> After this point in time the initiator is effectively 'poisoned'.
> 
> Nothing short of unloading and reloading the Qlogic 2xxx driver on the
> client initiator will allow the initiator to recognize the SCST server
> as a target device.  A driver unload/reload of course is not an option
> to restore connectivity since it would take the remaining live side of
> the mirror off-line as well.
> 
> We finally figured out what seems to be happening by watching the logs
> on the client and comparing what was going on there to the FLOGI login
> status on the fabric.
> 
> When the SCST target server reboots the initiator times out the remote
> port and places it into 'unknown' state.  The qla2xxx driver,
> according to the source code, maintains the previous rport state in
> driver internal data.
> 
> The 2462 card in the target on boot logs into the fabric with an
> initiator role, I'm assuming in support of BIOS based SAN booting. The
> client initiator picks up on this and re-activates the rport as being
> in an INITIATOR role.

You should be able to switch off this behavior by disabling the SAN 
booting in the card's BIOS.

> Loading the isp_mod driver causes the 2462 card in the target to be
> shutdown.  The client initiator picks up on this and times out the
> rport retaining the last rport state as INITIATOR.
> 
> Enabling target mode on the 2462 causes it to log back into the
> fabric.  The client initiator picks up on the RSCN but refuses to
> transition the rport from INITIATOR to TARGET state.  Without going
> into TARGET state the remote port won't have SCSI device discovery
> initiated against it and hence the SCST based storage is inaccessible.
> 
> Activating a LIP on the client initiates a new fabric login attempt
> which completes with the following message:
> 
> Jul 24 02:53:59 init-test kernel: rport-2:0-0: blocked FC remote port
> time out: no longer a FCP target, removing starget
> 
> Which from a review of the source code seems consistent with our
> analysis of the problem.
> 
> The culprit is the following code from drivers/scsi/scsi_transport_fc.c:
> 
>         if ((rport->port_state == FC_PORTSTATE_ONLINE) &&
>             (rport->scsi_target_id != -1) &&
>             !(rport->roles & FC_PORT_ROLE_FCP_TARGET)) {
>                 dev_printk(KERN_ERR, &rport->dev,
>                         "blocked FC remote port time out: no longer"
>                         " a FCP target, removing starget\n");
>                 spin_unlock_irqrestore(shost->host_lock, flags);
>                 scsi_target_unblock(&rport->dev);
>                 fc_queue_work(shost, &rport->stgt_delete_work);
>                 return;
>         }
> 
> The above gets executed in response to the LIP on the initiator.  The
> value in rport->roles is being populated with what the remote target
> was INITIATOR rather than its current TARGET state.
> 
> Windows client initiators running against the SCST targets get the
> transition and login sequence correct.  When the SCST target is
> re-activated after the cold boot those clients immediately re-discover
> their storage while the Linux clients issue error messages about loss
> of the remote target.
> 
> While all this doesn't seem to be technically a bug with SCST it
> certainly is a problematic usage scenario.  It may also explain why
> some individuals may have had problems getting SCST clients to access
> their storage.
> 
> If a test SCST server was plugged into an active zone and turned on it
> would immediately poison any Linux clients.  No amount of proper
> configuration on the target would allow the client to access storage
> until the client was rebooted or its drivers reloaded.
> 
> Any suggestions on how to move forward would be appreciated.  We've
> got a pretty extensive test environment and would be happy to test run
> any suggested changes or patches.

I've also many times seen how Linux Qlogic qla2xxx driver "lost" remote 
ports. But that was from the target side and I wasn't able to figure out 
the exact test case for that. Plus, we found out a suitable for target 
workaround: usage of INITIATOR PORT NAME field in ATIO IOCB for the lost 
ports.

So, qla2xxx driver definitely has problem(s) in this area. The fact that 
Windows works well in this scenario only additionally proves that. But 
I'm afraid, you have the only way to deal with it is to fix qla2xxx 
driver itself. My experience with contacts with Andrew Vasquez, the 
driver's maintainer, that you need something more valuable than problems 
with some home brewed target to make him interested. Otherwise your 
questions will be simply ignored.

> Once again a thank you to everyone who has contributed to SCST
> development.  Other than this and a few additional glitches I will
> follow up with via additional e-mails it is presenting itself as a
> very solid platform for storage delivery.
> 
> Best wishes for a pleasant weekend to everyone.
> 
> As always,
> Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
> 4206 N. 19th Ave.           Specializing in information infra-structure
> Fargo, ND  58102            development.
> PH: 701-281-1686
> FAX: 701-281-3949           EMAIL: greg@enjellic.com
> ------------------------------------------------------------------------------
> "Much work remains to be done before we can announce our total failure
>  to make any progress."
>                                 -- Mike Kelly
> 
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Scst-devel mailing list
> Scst-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scst-devel
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Scst-devel] Poisoning of Linux initiators on SCST reboot.
  2008-07-25 10:40 ` [Scst-devel] " Vladislav Bolkhovitin
@ 2008-07-25 10:43   ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 5+ messages in thread
From: Vladislav Bolkhovitin @ 2008-07-25 10:43 UTC (permalink / raw)
  To: greg; +Cc: linux-driver, scst-devel, linux-scsi

Vladislav Bolkhovitin wrote:
> greg@enjellic.com wrote:
>> Good morning to everyone, hope your respective days are going well.
>> Sorry for the wide cast on this but I wanted to get what would seem to
>> be the concerned parties on this issue in the loop.
>>
>> We have been putting SCST through an extensive round of pre-production
>> testing.  I wanted to start following up on some of the issues we have
>> noted.
>>
>> We will be putting SCST into service to support mirrored storage from
>> client initiators to two separate data-centers.  The filesystems on
>> the client initiators access storage at the two data-centers via a
>> Linux MD RAID1 device.  The SAN architecture is based on Cisco
>> MDS-9509 switches.
>>
>> Just as an aside for people considering use of SCST.  The core engine
>> has been rock solid.  Our testing rounds consists of driving around
>> 1/8 of a petabyte of widely disparate I/O types from multiple
>> initiators to a pair of targets in the two data-centers.  SCST hasn't
>> missed a beat so far, so kudos to Vlad and everyone involved in its
>> development.
> 
> Thanks! Your support is very much appreciated and exactly on time, 
> because I'm going to submit SCST patches for review and inclusion into 
> the kernel next week.
> 
>> As we began forced failure testing one issue has come up that I wanted
>> to advise people of.  A hard reboot of an SCST target server results
>> in the 'poisoning' of Linux based initiators.  We verified the issue
>> as being present on client initiators running the stock RHEL5 kernel
>> up through 2.6.26.
>>
>> The targets are using Qlogic 2462 cards using the isp_mod driver.  The
>> client initiators are using Qlogic 2342 cards with the qla2xxx driver.
>>
>> Failure mode is as follows:
>>
>>         1.) Configure SCST based storage for an initiator (vdisk
>>             based).
>>
>>         2.) Activate initiator.  Initiator logs into fabric and
>>             discovers SCST based storage.
>>
>>         3.) Force SCST target failure by rebooting or pulling power.
>>
>>         4.) SCST target returns to service and logs into zone.
>>
>>         5.) Initiator picks up RSCN but re-activates the rport for
>>             SCST server as an INITIATOR rather than TARGET role.
>>
>> After this point in time the initiator is effectively 'poisoned'.
>>
>> Nothing short of unloading and reloading the Qlogic 2xxx driver on the
>> client initiator will allow the initiator to recognize the SCST server
>> as a target device.  A driver unload/reload of course is not an option
>> to restore connectivity since it would take the remaining live side of
>> the mirror off-line as well.
>>
>> We finally figured out what seems to be happening by watching the logs
>> on the client and comparing what was going on there to the FLOGI login
>> status on the fabric.
>>
>> When the SCST target server reboots the initiator times out the remote
>> port and places it into 'unknown' state.  The qla2xxx driver,
>> according to the source code, maintains the previous rport state in
>> driver internal data.
>>
>> The 2462 card in the target on boot logs into the fabric with an
>> initiator role, I'm assuming in support of BIOS based SAN booting. The
>> client initiator picks up on this and re-activates the rport as being
>> in an INITIATOR role.
> 
> You should be able to switch off this behavior by disabling the SAN 
> booting in the card's BIOS.
> 
>> Loading the isp_mod driver causes the 2462 card in the target to be
>> shutdown.  The client initiator picks up on this and times out the
>> rport retaining the last rport state as INITIATOR.
>>
>> Enabling target mode on the 2462 causes it to log back into the
>> fabric.  The client initiator picks up on the RSCN but refuses to
>> transition the rport from INITIATOR to TARGET state.  Without going
>> into TARGET state the remote port won't have SCSI device discovery
>> initiated against it and hence the SCST based storage is inaccessible.
>>
>> Activating a LIP on the client initiates a new fabric login attempt
>> which completes with the following message:
>>
>> Jul 24 02:53:59 init-test kernel: rport-2:0-0: blocked FC remote port
>> time out: no longer a FCP target, removing starget
>>
>> Which from a review of the source code seems consistent with our
>> analysis of the problem.
>>
>> The culprit is the following code from drivers/scsi/scsi_transport_fc.c:
>>
>>         if ((rport->port_state == FC_PORTSTATE_ONLINE) &&
>>             (rport->scsi_target_id != -1) &&
>>             !(rport->roles & FC_PORT_ROLE_FCP_TARGET)) {
>>                 dev_printk(KERN_ERR, &rport->dev,
>>                         "blocked FC remote port time out: no longer"
>>                         " a FCP target, removing starget\n");
>>                 spin_unlock_irqrestore(shost->host_lock, flags);
>>                 scsi_target_unblock(&rport->dev);
>>                 fc_queue_work(shost, &rport->stgt_delete_work);
>>                 return;
>>         }
>>
>> The above gets executed in response to the LIP on the initiator.  The
>> value in rport->roles is being populated with what the remote target
>> was INITIATOR rather than its current TARGET state.
>>
>> Windows client initiators running against the SCST targets get the
>> transition and login sequence correct.  When the SCST target is
>> re-activated after the cold boot those clients immediately re-discover
>> their storage while the Linux clients issue error messages about loss
>> of the remote target.
>>
>> While all this doesn't seem to be technically a bug with SCST it
>> certainly is a problematic usage scenario.  It may also explain why
>> some individuals may have had problems getting SCST clients to access
>> their storage.
>>
>> If a test SCST server was plugged into an active zone and turned on it
>> would immediately poison any Linux clients.  No amount of proper
>> configuration on the target would allow the client to access storage
>> until the client was rebooted or its drivers reloaded.
>>
>> Any suggestions on how to move forward would be appreciated.  We've
>> got a pretty extensive test environment and would be happy to test run
>> any suggested changes or patches.
> 
> I've also many times seen how Linux Qlogic qla2xxx driver "lost" remote 
> ports. But that was from the target side and I wasn't able to figure out 
> the exact test case for that. Plus, we found out a suitable for target 
> workaround: usage of INITIATOR PORT NAME field in ATIO IOCB for the lost 
> ports.
> 
> So, qla2xxx driver definitely has problem(s) in this area. The fact that 
> Windows works well in this scenario only additionally proves that. But 
> I'm afraid, you have the only way to deal with it is to fix qla2xxx 
> driver itself.

Sorry, I meant "yourself".

> My experience with contacts with Andrew Vasquez, the 
> driver's maintainer, that you need something more valuable than problems 
> with some home brewed target to make him interested. Otherwise your 
> questions will be simply ignored.
> 
>> Once again a thank you to everyone who has contributed to SCST
>> development.  Other than this and a few additional glitches I will
>> follow up with via additional e-mails it is presenting itself as a
>> very solid platform for storage delivery.
>>
>> Best wishes for a pleasant weekend to everyone.
>>
>> As always,
>> Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
>> 4206 N. 19th Ave.           Specializing in information infra-structure
>> Fargo, ND  58102            development.
>> PH: 701-281-1686
>> FAX: 701-281-3949           EMAIL: greg@enjellic.com
>> ------------------------------------------------------------------------------
>> "Much work remains to be done before we can announce our total failure
>>  to make any progress."
>>                                 -- Mike Kelly
>>
>> -------------------------------------------------------------------------
>> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
>> Build the coolest Linux based applications with Moblin SDK & win great prizes
>> Grand prize is a trip for two to an Open Source event anywhere in the world
>> http://moblin-contest.org/redirect.php?banner_id=100&url=/
>> _______________________________________________
>> Scst-devel mailing list
>> Scst-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scst-devel
>>
> 
> 
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Scst-devel mailing list
> Scst-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scst-devel
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Poisoning of Linux initiators on SCST reboot.
  2008-07-24 17:50 Poisoning of Linux initiators on SCST reboot greg
  2008-07-25 10:40 ` [Scst-devel] " Vladislav Bolkhovitin
@ 2008-07-25 13:11 ` Stanislaw Gruszka
  2008-07-25 13:45 ` Andrew Vasquez
  2 siblings, 0 replies; 5+ messages in thread
From: Stanislaw Gruszka @ 2008-07-25 13:11 UTC (permalink / raw)
  To: greg; +Cc: scst-devel, linux-driver, linux-scsi, vst

Thursday 24 July 2008 19:50:51 greg@enjellic.com napisał(a):
> As we began forced failure testing one issue has come up that I wanted
> to advise people of.  A hard reboot of an SCST target server results
> in the 'poisoning' of Linux based initiators.  We verified the issue
> as being present on client initiators running the stock RHEL5 kernel
> up through 2.6.26.
>
> The targets are using Qlogic 2462 cards using the isp_mod driver.  The
> client initiators are using Qlogic 2342 cards with the qla2xxx driver.
[snip]
> Any suggestions on how to move forward would be appreciated.  We've
> got a pretty extensive test environment and would be happy to test run
> any suggested changes or patches.
You may try isp_mod driver working in initiator mode (best compiled in 
initiator role only) to see how it cope with such situation.

Cheers
Stanislaw Gruszka
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Poisoning of Linux initiators on SCST reboot.
  2008-07-24 17:50 Poisoning of Linux initiators on SCST reboot greg
  2008-07-25 10:40 ` [Scst-devel] " Vladislav Bolkhovitin
  2008-07-25 13:11 ` Stanislaw Gruszka
@ 2008-07-25 13:45 ` Andrew Vasquez
  2 siblings, 0 replies; 5+ messages in thread
From: Andrew Vasquez @ 2008-07-25 13:45 UTC (permalink / raw)
  To: greg; +Cc: scst-devel, linux-driver, linux-scsi, stf_xl, vst

On Thu, 24 Jul 2008, greg@enjellic.com wrote:

> Good morning to everyone, hope your respective days are going well.
> Sorry for the wide cast on this but I wanted to get what would seem to
> be the concerned parties on this issue in the loop.
<snip>
> The targets are using Qlogic 2462 cards using the isp_mod driver.  The
> client initiators are using Qlogic 2342 cards with the qla2xxx driver.
> 
> Failure mode is as follows:
> 
>         1.) Configure SCST based storage for an initiator (vdisk
>             based).
> 
>         2.) Activate initiator.  Initiator logs into fabric and
>             discovers SCST based storage.
> 
>         3.) Force SCST target failure by rebooting or pulling power.
> 
>         4.) SCST target returns to service and logs into zone.
> 
>         5.) Initiator picks up RSCN but re-activates the rport for
>             SCST server as an INITIATOR rather than TARGET role.
> 
> After this point in time the initiator is effectively 'poisoned'.
> 
> Nothing short of unloading and reloading the Qlogic 2xxx driver on the
> client initiator will allow the initiator to recognize the SCST server
> as a target device.  A driver unload/reload of course is not an option
> to restore connectivity since it would take the remaining live side of
> the mirror off-line as well.
> 
> We finally figured out what seems to be happening by watching the logs
> on the client and comparing what was going on there to the FLOGI login
> status on the fabric.
> 
> When the SCST target server reboots the initiator times out the remote
> port and places it into 'unknown' state.  The qla2xxx driver,
> according to the source code, maintains the previous rport state in
> driver internal data.

Ok, thanks for the detailed description of the problem...

> The 2462 card in the target on boot logs into the fabric with an
> initiator role, I'm assuming in support of BIOS based SAN booting. The
> client initiator picks up on this and re-activates the rport as being
> in an INITIATOR role.

Yes, BIOS would FLOGI into the switch...  RSCN received on initiator
side, and the role registered for the rport would have been migrated
from target->initiator... (Step 1)

> Loading the isp_mod driver causes the 2462 card in the target to be
> shutdown.  The client initiator picks up on this and times out the
> rport retaining the last rport state as INITIATOR.

Ok, I would have expected this to at least start when the BIOS FLOGI'd
into the switch above...

> Enabling target mode on the 2462 causes it to log back into the
> fabric.  The client initiator picks up on the RSCN but refuses to
> transition the rport from INITIATOR to TARGET state.

Ok, so on the initiator side, I'd expect and RSCN, PLOGI and PRLI to
target side, the bits processed from the PRLI response, role migrated
from UNKNOWN during fc_remote_port_add(), then to TARGET during
fc_remote_port_rolechg(). (Step 2)

> Without going
> into TARGET state the remote port won't have SCSI device discovery
> initiated against it and hence the SCST based storage is inaccessible.

Ok, could you provide the kernel log of the full failure with the
qla2xxx driver loaded with the ql2xextended_error_logging module
parameter set to 1.

> Activating a LIP on the client initiates a new fabric login attempt
> which completes with the following message:
> 
> Jul 24 02:53:59 init-test kernel: rport-2:0-0: blocked FC remote port
> time out: no longer a FCP target, removing starget
> 
> Which from a review of the source code seems consistent with our
> analysis of the problem.
> 
> The culprit is the following code from drivers/scsi/scsi_transport_fc.c:
> 
>         if ((rport->port_state == FC_PORTSTATE_ONLINE) &&
>             (rport->scsi_target_id != -1) &&
>             !(rport->roles & FC_PORT_ROLE_FCP_TARGET)) {
>                 dev_printk(KERN_ERR, &rport->dev,
>                         "blocked FC remote port time out: no longer"
>                         " a FCP target, removing starget\n");
>                 spin_unlock_irqrestore(shost->host_lock, flags);
>                 scsi_target_unblock(&rport->dev);
>                 fc_queue_work(shost, &rport->stgt_delete_work);
>                 return;
>         }

I would have expected this during step 1 (guess it depends on timing
latency during reboot/BIOS-flogi_plogi/isp_mod-load)...

> The above gets executed in response to the LIP on the initiator.  The
> value in rport->roles is being populated with what the remote target
> was INITIATOR rather than its current TARGET state.

Ok, so at step-1, the fc_remote_port_add() should have fallen into
this code:

      ...
      /* was a target, not in roles */
	if ((rport->scsi_target_id != -1) &&
	    (!(ids->roles & FC_PORT_ROLE_FCP_TARGET)))
		return rport;

since role is unknown... then transitioned to initiator during
rolechg().

then at step-2, the same process during fc_remote_port_add() (since
the transition to target is again deferred to rolechg()), then during
rolechg() I'd expect the transport would fall into the 'else if' here:

        spin_lock_irqsave(shost->host_lock, flags);
        if (roles & FC_PORT_ROLE_FCP_TARGET) {
                if (rport->scsi_target_id == -1) {
                        rport->scsi_target_id = fc_host->next_target_id++;
                        create = 1;
                } else if (!(rport->roles & FC_PORT_ROLE_FCP_TARGET))
                        create = 1;

Hmm, but that doesn't seem to be the case here...

Let's start with the driver logs, just I get full picture of at least
what's happending with qla2xxx at the wire-side.

Thanks, AV

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-07-25 13:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-24 17:50 Poisoning of Linux initiators on SCST reboot greg
2008-07-25 10:40 ` [Scst-devel] " Vladislav Bolkhovitin
2008-07-25 10:43   ` Vladislav Bolkhovitin
2008-07-25 13:11 ` Stanislaw Gruszka
2008-07-25 13:45 ` Andrew Vasquez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox