Re: [PATCH] 0/3: Fix EH problems in libsas and implement more error handling

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Douglas Gilbert <dougg@torque.net>
To: "Darrick J. Wong" <djwong@us.ibm.com>
Cc: linux-scsi <linux-scsi@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Alexis Bruemmer <alexisb@us.ibm.com>
Subject: Re: [PATCH] 0/3: Fix EH problems in libsas and implement more error handling
Date: Mon, 30 Oct 2006 20:57:01 -0500	[thread overview]
Message-ID: <4546AD6D.3080605@torque.net> (raw)
In-Reply-To: <45468845.20400@us.ibm.com>

Darrick J. Wong wrote:
> Hi all,
> 
> The following three patches are early drafts of a series of patches to
> fix error handling in libsas so that the scsi_eh_* functions are called
> so that we can attempt to retry failed commands later.  There is also a
> patch to aic94xx to make escb errors are detected correctly,
> REQ_TASK_ABORT is handled, and the beginnings of a handler for
> REQ_DEVICE_RESET.
> 
> However, there are a number of issues with these patches that I wish to
> bring to the attention of this mailing list for further input:
> 
> First, the aic94xx sequencer can send back an ESCB with an error code of
> "REQ_TASK_ABORT", which means that the kernel has to send an ABORT TASK
> TMF to sequencer to unjam things.  Until this happens, the sequencer
> neither services commands nor sends back completions.  If we want to
> wait for the error handler to send the ABORT TASK, we end up waiting for
> _all_ pending commands to time out so that the EH can wake up.  This
> effectively stalls the system for 30 seconds every time we see
> REQ_TASK_ABORT.
> 
> On the assumption that we'd like to get on with things sooner than
> later, the current iteration of these patches aborts the task as soon as
> possible so that the other pending commands will flush out on their own.
>  However, this also necessitates the addition of a new sas_task flag
> (SAS_TASK_INITIATOR_ABORTED) to indicate "Task aborted, but still
> waiting for the EH to call task_done."  From what I can tell,
> SAS_TASK_STATE_ABORTED means that the task will be lldd_abort_task'd by
> the EH at some point, but does not indicate if that has been done yet,
> and SAS_TASK_STATE_DONE is set after everything is done.
> 
> The second issue is the manual decrementing of shost->host_failed in the
> error handler.  So long as we use the scsi_eh_* commands this value is
> decremented automatically--however, it appears that sas_scsi_clear_* is
> pulling scsi_cmnds off the error queue and ... dropping them so that
> they never go through the error handler.  Is this a desirable behavior,
> or am I reading the code incorrectly?  Or...?
> 
> The third pertains to REQ_DEVICE_RESET: I've not yet figured out how to
> reset a device port as has been hinted that I must do.  I don't know if
> a phy reset is sufficient or if I'm barking up the wrong tree.

Darrick,
REQ_DEVICE_RESET would seem to translate in SAS to a
hard reset (which is a specialization of link reset).
Hard reset has the effect of resetting the target device
and all logical units attached to that target.

Hard resets are sent by telling the phy attached to
the SSP target in question to do a hard reset.
There are two cases:
  a) the SSP target device (to reset) is attached to a HBA
  b) the SSP target device is attached to an expander

In case a) you need to get a phy in the HBA to do
a hard reset. Is that functionality available in the SAS
transport layer? [If not it should be.]

In case b) you need to send a SMP PHY CONTROL function
with phy_operation=hard_reset to the appropriate phy
on the expander.

For wide links the hard reset can be sent on any phy
that is part of the wide link.


As for link resets as far as I can see they perturd
the lower level SAS state machines at both ends of
a physical link without having an impact on the higher
level SAS state machines.

Doug Gilbert

next prev parent reply	other threads:[~2006-10-31  1:57 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-10-30 23:18 [PATCH] 0/3: Fix EH problems in libsas and implement more error handling Darrick J. Wong
2006-10-31  1:57 ` Douglas Gilbert [this message]
2006-10-31  7:49 ` Luben Tuikov
2006-10-31 10:54 ` Muli Ben-Yehuda
2006-10-31 18:10   ` Darrick J. Wong
2006-10-31 18:32     ` Muli Ben-Yehuda
2006-10-31 18:38       ` Darrick J. Wong
2006-10-31 19:17         ` Muli Ben-Yehuda
2006-10-31 21:02     ` Luben Tuikov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4546AD6D.3080605@torque.net \
    --to=dougg@torque.net \
    --cc=alexisb@us.ibm.com \
    --cc=djwong@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox