Re: [PATCH] scsi: Allow error handling timeout to be specified

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

From: Jeremy Linton <jlinton@tributary.com>
To: Hannes Reinecke <hare@suse.de>
Cc: Baruch Even <baruch@ev-en.org>, emilne <emilne@redhat.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	linux-scsi <linux-scsi@vger.kernel.org>,
	michaelc <michaelc@cs.wisc.edu>
Subject: Re: [PATCH] scsi: Allow error handling timeout to be specified
Date: Mon, 13 May 2013 09:40:14 -0500	[thread overview]
Message-ID: <5190FB4E.4000900@tributary.com> (raw)
In-Reply-To: <51907E45.7010409@suse.de>

On 5/13/2013 12:46 AM, Hannes Reinecke wrote:

> True. But and the end of the day, we _do_ want to recover the failed LUN.
> If we were to disable that faulty LUN and continue running with the others
> we won't have a chance of _ever_ recovering that one LUN.

	I don't buy this. Especially for FC devices, the vast majority of errors I see
are related to zoning, SFP and cabling problems. Once one of those happens you
tend to get a lot of shotgun debugging, which injects all kinds of
further errors.	None of these errors are fixed by the linux error recovery paths.

	That said, if the admin fixes something, for FC/SAS (and potentially others)
you _WILL_ get notification that the device is online again.


> SET when the link is down). So we basically _have_ to escalate it to the
> next level. Even though that will mean to stop I/O to other, hitherto
> unaffected instances.

	And a single failure, turns into performance bubbles and further errors on
other devices. Particularly if the functional devices are stateful, and the
error recovery mechanism isn't sufficiently intelligent about that state (see
tape drives). Think about what happens when a marginal SFP on a target causes
a device to repeatably drop off and reappear at some random point in the future.


	Anyway, It is possible to make a determination about the topology and make
decisions about the likely-hood of any given portion being at fault. For
example, if one lun on a target has failed and the remainder continue to work,
then its unlikely that if abort and lun reset fail that anything higher up in
the stack is going to succeed.

	I feel pretty strongly, at that point your better off providing good
diagnostics about the failure and expecting user interaction rather than
muddying the waters by causing other device interruptions. If the user tries
everything and determines that a HBA reset is the right choice, provide that
option, don't do it for them.

	If every device attached to the HBA fails then resetting the HBA is a valid
choice, not before. Same for I_T.

next prev parent reply	other threads:[~2013-05-13 14:40 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-10  3:11 [PATCH] scsi: Allow error handling timeout to be specified Martin K. Petersen
2013-05-10  6:23 ` Bart Van Assche
2013-05-10 14:36   ` Martin K. Petersen
2013-05-10 12:43 ` Ewan Milne
2013-05-10 12:55   ` Hannes Reinecke
2013-05-10 13:09   ` Bryn M. Reeves
2013-05-10 13:22   ` Baruch Even
2013-05-10 14:01     ` Ewan Milne
2013-05-10 14:24       ` Hannes Reinecke
2013-05-10 14:31         ` Bryn M. Reeves
2013-05-10 16:59         ` Ewan Milne
2013-05-13 15:16           ` Elliott, Robert (Server Storage)
2013-05-10 17:51       ` Baruch Even
2013-05-10 20:18         ` Hannes Reinecke
2013-05-10 19:27           ` Baruch Even
2013-05-13  5:46             ` Hannes Reinecke
2013-05-13 14:40               ` Jeremy Linton [this message]
2013-05-13 15:03                 ` Hannes Reinecke
2013-05-13 15:58                   ` Jeremy Linton
2013-05-13 16:50                     ` Baruch Even
2013-05-13 20:29                     ` Martin K. Petersen
2013-05-13 21:01                       ` Jeremy Linton
2013-05-14 22:21                         ` Martin K. Petersen
     [not found]   ` <CAC9+anJ9Y-SnCOK6EOCavTNJwx=xhAbL_X__MsEsL7DroawaJg@mail.gmail.com>
2013-05-10 14:53     ` Martin K. Petersen
2013-05-10 15:27       ` Martin K. Petersen
2013-05-10 17:55       ` Baruch Even

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5190FB4E.4000900@tributary.com \
    --to=jlinton@tributary.com \
    --cc=baruch@ev-en.org \
    --cc=emilne@redhat.com \
    --cc=hare@suse.de \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=michaelc@cs.wisc.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox