From: Hannes Reinecke <hare@suse.de>
To: Baruch Even <baruch@ev-en.org>
Cc: emilne <emilne@redhat.com>,
"Martin K. Petersen" <martin.petersen@oracle.com>,
linux-scsi <linux-scsi@vger.kernel.org>,
michaelc <michaelc@cs.wisc.edu>
Subject: Re: [PATCH] scsi: Allow error handling timeout to be specified
Date: Mon, 13 May 2013 07:46:45 +0200 [thread overview]
Message-ID: <51907E45.7010409@suse.de> (raw)
In-Reply-To: <CAC9+anKxnDBYh15uwQQoTUzGZkwUe6wuV=8wf6NUVsC4+_TUgw@mail.gmail.com>
On 05/10/2013 09:27 PM, Baruch Even wrote:
> On Fri, May 10, 2013 at 11:18 PM, Hannes Reinecke <hare@suse.de> wrote:
>> On 05/10/2013 07:51 PM, Baruch Even wrote:
>>>
>>> The error handling I have in mind (admittedly, not fully thought out)
>>> should work for both FC and SAS. Currently the error recovery
>>> progresses at the host level regardless of if the errors are on one
>>> device or all of them, it also stops the IOs on all devices and LUNs.
>>> It would be nice if that was taken into account. My ideas may be more
>>> suitable to the environment I work in (enterprise storage devices
>>> rather than hosts) but I believe the same approach would benefit the
>>> hosts as well.
>>>
>>> It would be interesting to see what approach the new error handling will
>>> take.
>>>
>> So, my general idea is this:
>>
>> 1) Send command aborts from scsi_times_out(). There is no requirement
>> on stopping I/O on the host simply because a single command times
>> out. And as scsi_times_out() is run from a separate thread anyway
>> we should be able to send ABORT TASK TMFs without a problem
>> 2) Modify recovery sequence.
>> One of the major pitfalls of the current scsi_eh is that it
>> spills over onto unrelated LUNs for higher levels. So for the
>> new EH we should be using a sequence of
>> - ABORT TASK
>> - ABORT TASK SET
>> - (Terminate I_T nexus)
>> - (Host reset)
>> 'Terminate I_T nexus' for FibreChannel is equivalent to a LOGO.
>> 'Host reset' is the current host reset function.
>> 3) Finegrained recovery setting.
>> There is no need to stop the entire host when doing a recovery;
>> it should be sufficient to stop I/O to the unit
>> (LUN, I_T nexus, host) when the error recovery is at the
>> respective level.
>
> This looks great and much in line with what I'm thinking.
>
> What about not going to the higher level if not everything at that
> level had failed?
> I mean that if at the target not all LUNs failed it will be quite
> troublesome to other LUNs if I-T-Nexus is terminated and that at the
> host level if there are still targets that are functioning it will
> kill them too to reset the host.
>
True. But and the end of the day, we _do_ want to recover the failed
LUN. If we were to disable that faulty LUN and continue running with
the others we won't have a chance of _ever_ recovering that one LUN.
Plus we have to keep in mind that the attempted error recovery did
not succeed for totally unrelated issues (ie sending a ABORT TASK
SET when the link is down). So we basically _have_ to escalate it
to the next level. Even though that will mean to stop I/O to other,
hitherto unaffected instances.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2013-05-13 5:46 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-05-10 3:11 [PATCH] scsi: Allow error handling timeout to be specified Martin K. Petersen
2013-05-10 6:23 ` Bart Van Assche
2013-05-10 14:36 ` Martin K. Petersen
2013-05-10 12:43 ` Ewan Milne
2013-05-10 12:55 ` Hannes Reinecke
2013-05-10 13:09 ` Bryn M. Reeves
2013-05-10 13:22 ` Baruch Even
2013-05-10 14:01 ` Ewan Milne
2013-05-10 14:24 ` Hannes Reinecke
2013-05-10 14:31 ` Bryn M. Reeves
2013-05-10 16:59 ` Ewan Milne
2013-05-13 15:16 ` Elliott, Robert (Server Storage)
2013-05-10 17:51 ` Baruch Even
2013-05-10 20:18 ` Hannes Reinecke
2013-05-10 19:27 ` Baruch Even
2013-05-13 5:46 ` Hannes Reinecke [this message]
2013-05-13 14:40 ` Jeremy Linton
2013-05-13 15:03 ` Hannes Reinecke
2013-05-13 15:58 ` Jeremy Linton
2013-05-13 16:50 ` Baruch Even
2013-05-13 20:29 ` Martin K. Petersen
2013-05-13 21:01 ` Jeremy Linton
2013-05-14 22:21 ` Martin K. Petersen
[not found] ` <CAC9+anJ9Y-SnCOK6EOCavTNJwx=xhAbL_X__MsEsL7DroawaJg@mail.gmail.com>
2013-05-10 14:53 ` Martin K. Petersen
2013-05-10 15:27 ` Martin K. Petersen
2013-05-10 17:55 ` Baruch Even
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51907E45.7010409@suse.de \
--to=hare@suse.de \
--cc=baruch@ev-en.org \
--cc=emilne@redhat.com \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=michaelc@cs.wisc.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.