From: Hannes Reinecke <hare@suse.de>
To: Baruch Even <baruch@ev-en.org>
Cc: emilne <emilne@redhat.com>,
"Martin K. Petersen" <martin.petersen@oracle.com>,
linux-scsi <linux-scsi@vger.kernel.org>,
michaelc <michaelc@cs.wisc.edu>
Subject: Re: [PATCH] scsi: Allow error handling timeout to be specified
Date: Mon, 13 May 2013 07:46:45 +0200 [thread overview]
Message-ID: <51907E45.7010409@suse.de> (raw)
In-Reply-To: <CAC9+anKxnDBYh15uwQQoTUzGZkwUe6wuV=8wf6NUVsC4+_TUgw@mail.gmail.com>
On 05/10/2013 09:27 PM, Baruch Even wrote:
> On Fri, May 10, 2013 at 11:18 PM, Hannes Reinecke <hare@suse.de> wrote:
>> On 05/10/2013 07:51 PM, Baruch Even wrote:
>>>
>>> The error handling I have in mind (admittedly, not fully thought out)
>>> should work for both FC and SAS. Currently the error recovery
>>> progresses at the host level regardless of if the errors are on one
>>> device or all of them, it also stops the IOs on all devices and LUNs.
>>> It would be nice if that was taken into account. My ideas may be more
>>> suitable to the environment I work in (enterprise storage devices
>>> rather than hosts) but I believe the same approach would benefit the
>>> hosts as well.
>>>
>>> It would be interesting to see what approach the new error handling will
>>> take.
>>>
>> So, my general idea is this:
>>
>> 1) Send command aborts from scsi_times_out(). There is no requirement
>> on stopping I/O on the host simply because a single command times
>> out. And as scsi_times_out() is run from a separate thread anyway
>> we should be able to send ABORT TASK TMFs without a problem
>> 2) Modify recovery sequence.
>> One of the major pitfalls of the current scsi_eh is that it
>> spills over onto unrelated LUNs for higher levels. So for the
>> new EH we should be using a sequence of
>> - ABORT TASK
>> - ABORT TASK SET
>> - (Terminate I_T nexus)
>> - (Host reset)
>> 'Terminate I_T nexus' for FibreChannel is equivalent to a LOGO.
>> 'Host reset' is the current host reset function.
>> 3) Finegrained recovery setting.
>> There is no need to stop the entire host when doing a recovery;
>> it should be sufficient to stop I/O to the unit
>> (LUN, I_T nexus, host) when the error recovery is at the
>> respective level.
>
> This looks great and much in line with what I'm thinking.
>
> What about not going to the higher level if not everything at that
> level had failed?
> I mean that if at the target not all LUNs failed it will be quite
> troublesome to other LUNs if I-T-Nexus is terminated and that at the
> host level if there are still targets that are functioning it will
> kill them too to reset the host.
>
True. But and the end of the day, we _do_ want to recover the failed
LUN. If we were to disable that faulty LUN and continue running with
the others we won't have a chance of _ever_ recovering that one LUN.
Plus we have to keep in mind that the attempted error recovery did
not succeed for totally unrelated issues (ie sending a ABORT TASK
SET when the link is down). So we basically _have_ to escalate it
to the next level. Even though that will mean to stop I/O to other,
hitherto unaffected instances.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2013-05-13 5:46 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-05-10 3:11 [PATCH] scsi: Allow error handling timeout to be specified Martin K. Petersen
2013-05-10 6:23 ` Bart Van Assche
2013-05-10 14:36 ` Martin K. Petersen
2013-05-10 12:43 ` Ewan Milne
2013-05-10 12:55 ` Hannes Reinecke
2013-05-10 13:09 ` Bryn M. Reeves
2013-05-10 13:22 ` Baruch Even
2013-05-10 14:01 ` Ewan Milne
2013-05-10 14:24 ` Hannes Reinecke
2013-05-10 14:31 ` Bryn M. Reeves
2013-05-10 16:59 ` Ewan Milne
2013-05-13 15:16 ` Elliott, Robert (Server Storage)
2013-05-10 17:51 ` Baruch Even
2013-05-10 20:18 ` Hannes Reinecke
2013-05-10 19:27 ` Baruch Even
2013-05-13 5:46 ` Hannes Reinecke [this message]
2013-05-13 14:40 ` Jeremy Linton
2013-05-13 15:03 ` Hannes Reinecke
2013-05-13 15:58 ` Jeremy Linton
2013-05-13 16:50 ` Baruch Even
2013-05-13 20:29 ` Martin K. Petersen
2013-05-13 21:01 ` Jeremy Linton
2013-05-14 22:21 ` Martin K. Petersen
[not found] ` <CAC9+anJ9Y-SnCOK6EOCavTNJwx=xhAbL_X__MsEsL7DroawaJg@mail.gmail.com>
2013-05-10 14:53 ` Martin K. Petersen
2013-05-10 15:27 ` Martin K. Petersen
2013-05-10 17:55 ` Baruch Even
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51907E45.7010409@suse.de \
--to=hare@suse.de \
--cc=baruch@ev-en.org \
--cc=emilne@redhat.com \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=michaelc@cs.wisc.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox