From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mike Christie Subject: Re: fc_remote_port_delete and returning SCSI commands from LLD Date: Wed, 21 Oct 2009 13:11:15 -0500 Message-ID: <4ADF4EC3.6010506@cs.wisc.edu> References: <20091020144027.GA17717@schmichrtp.de.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from sabe.cs.wisc.edu ([128.105.6.20]:47222 "EHLO sabe.cs.wisc.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754057AbZJUSLS (ORCPT ); Wed, 21 Oct 2009 14:11:18 -0400 In-Reply-To: <20091020144027.GA17717@schmichrtp.de.ibm.com> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Christof Schmitt Cc: linux-scsi@vger.kernel.org Christof Schmitt wrote: > I am looking again at how and when a FC LLD should call > fc_remote_port_delete. Some help would be welcome to cover all > requirements and to plug the holes... > > One scenario i am looking at: The connection to the HBA has been > temporarily lost and the LLD has to return all pending I/O requests to > the upper layers, so they can be retried later. Now with the SCSI > device being part of a multipath device, the first failed I/O request > triggers path failover: > > multipath_end_io > do_end_io > fail_path > queue_work(kmultipathd, &pgpath->deactivate_path); > > which then marks the following returned requests as timed out: > > deactivate_path > blk_abort_queue > blk_abort_request > blk_rq_timed_out > scsi_times_out > fc_timed_out > > If the remote_port status is not BLOCKED, this will trigger the SCSI > midlayer error handling which cannot do much during the interruption > to the hardware and will mark the SCSI devices 'offline'. In order to > prevent this, the rule would be: First call fc_remote_port_delete to > set the remote port (or in the case of an HBA interruption all remote > ports) to BLOCKED, and only after this step call scsi_done to pass the > SCSI commands back to the upper layers. > One other note when doing this. For problems where you are deleting the rport, it is best to use something like DID_TRANSPORT_DISRUPTED to fail the cmd if you are failing it right away. If drivers block the rport, then fail commands immediately with DID_TRANSPORT_DISRUPTED, then they will not actually be failed to the block/mpath layer until the fast io fail timeout has fired. This will prevent very short problems from firing the mutlipath path offlining code. If your driver deletes the rport and does not fail the cmd immediately so it can recover within the command or some other reason like the fw just works that way, then when the fast io fail timer fires and the terminate_rport_io callback is run you could actually use any error code since at this time when a IO is sent to the queuecommand the driver will call fc_remote_port_chkready and IO will be failed immediately with DID_TRANSPORT_FAILFAST).