Re: [SRP] [RFC] Needed changes to support fail-over drivers

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

From: Roland Dreier <rdreier@cisco.com>
To: Ishai Rabinovitz <ishai@mellanox.co.il>
Cc: linux-scsi@vger.kernel.org, openib-general@openib.org,
	Roland Dreier <rolandd@cisco.com>,
	vu@mellanox.com
Subject: Re: [SRP] [RFC] Needed changes to support fail-over drivers
Date: Mon, 24 Jul 2006 15:34:14 -0700	[thread overview]
Message-ID: <ada8xmiiqe1.fsf@cisco.com> (raw)
In-Reply-To: <20060724165602.GA8600@mellanox.co.il> (Ishai Rabinovitz's message of "Mon, 24 Jul 2006 19:56:02 +0300")

[CC'ing linux-scsi as well -- I think we'll get better insight from there]

 > The current SRP initiator code cannot work with several fail-over mechanisms. 
 > 
 > The current srp driver's behavior when a target off-line then online:
 > 1) The target is offline.
 > 2) the initiator tries to reconnect and fails
 > 3) The initiator calls srp_remove_work that removes the scsi_host.
 > 4) The target is back online.
 > 5) the user (or the ibsrpdm daemon) is expected to execute a new add_target.
 > 6) This creates a new scsi_host (with new names to the devices and new index in
 > the scsi_host directory in sysfs) for this target.
 > 
 > Fail-over drivers (e.g., MPP that is used by Engenio and XVM that is used by
 > SGI) have problems with this behavior (item 3). They need the scsi_host to keep
 > exist and return errors in the meanwhile until the connection to the target
 > resumes.

OK, but is this a valid assumption?  What happens for iSCSI and/or iSER?

 > In addition remove/re-alloc scsi host is a "heavy" operation instead of
 > disconnect/reconnect the connection only.
 > 
 > In order to support these tools I propose the following changes that will allow
 > the user to move the srp initiator to a disconnected state (when the target
 > leaves the fabric) and reconnect it later (when the target returns to the
 > fabric).

Seems OK but see below...

 > After these changes will be in the ib_srp module, the ibsrpdm daemon will be
 > able to monitor the presence of targets in the fabric and to use this interface
 > (When targets leave or rejoin the fabric).

How does the daemon know when something is gone for good vs. when it
might come back?

 > Here is the description of the new design: (I already implemented most of the
 > code)
 > 
 > 1) Split the function srp_reconnect_target into two functions:
 > _srp_disconnect_target and _srp_reconnect_target 
 > 
 > 2) Adding two new states: SRP_TARGET_DISCONNECTED (The state after
 > _srp_disconnect_target was executed and before _srp_reconnect_target is
 > executed) and SRP_TARGET_DISCONNECTING (The state while in srp_remove_target).
 > 
 > 3) Adding new input files in sysfs:
 > /sys/class/scsi_host/host?/{disconnect_target,connect_target,erase_target}
 > 
 > 4) Writing the string "remove" to /sys/class/scsi_host/host?/disconnect_target
 > calls srp_disconnect_target that moves the corresponding target to a
 > SRP_TARGET_DISCONNECTED state (After closing the cm, and reset all pending
 > requests).  Now when the scsi performs queuecommand to this host the result is
 > DID_NO_CONNECT.  This causes the scsi mid-layer to return to the user with an
 > IO error without initiating the scsi error auto recovery chain.

Why does userspace need to be able to disconnect a connection?

 > 5) Writing anything to /sys/class/scsi_host/host?/reconnect_target calls
 > _srp_reconnect_target that move the target to SRP_TARGET_LIVE state again.
 > 
 > 6) Writing "erase" to /sys/class/scsi_host/host?/erase_target calls
 > srp_remove_work that removes the scsi_host.

Why the asymmetry here?  In other words, why does anything work for
reconnect_target but only the literal "erase" work for erase_target?

 > 7) Adding output files in sysfs to present the HCA and port that the initiator
 > used to connect to the target. Using these files and the target GUID the
 > ibsrpdm can know on which scsi_host to perform the reconnect_target.

next      parent reply	other threads:[~2006-07-24 22:34 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20060724165602.GA8600@mellanox.co.il>
2006-07-24 22:34 ` Roland Dreier [this message]
2006-07-25  2:06   ` [SRP] [RFC] Needed changes to support fail-over drivers Mike Christie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ada8xmiiqe1.fsf@cisco.com \
    --to=rdreier@cisco.com \
    --cc=ishai@mellanox.co.il \
    --cc=linux-scsi@vger.kernel.org \
    --cc=openib-general@openib.org \
    --cc=rolandd@cisco.com \
    --cc=vu@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox