From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hannes Reinecke Subject: Re: error handler scheduling Date: Wed, 27 Mar 2013 15:35:26 +0100 Message-ID: <515303AE.3060605@suse.de> References: <51525560.3000008@emulex.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from cantor2.suse.de ([195.135.220.15]:52131 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751643Ab3C0Of1 (ORCPT ); Wed, 27 Mar 2013 10:35:27 -0400 In-Reply-To: <51525560.3000008@emulex.com> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: James.Smart@emulex.com Cc: "linux-scsi@vger.kernel.org" On 03/27/2013 03:11 AM, James Smart wrote: > In looking through the error handler, if a command times out and is > added to the eh_cmd_q for the shost, the error handler is only > awakened once shost->host_busy (total number of i/os posted to the > shost) is equal to shost->host_failed (number of i/o that have been > failed and put on the eh_cmd_q). Which means, any other i/o that > was outstanding must either complete or have their timeout fire. > Additionally, as all further i/o is held off at the block layer as > the shost is in recovery, new i/o cannot be submitted until the > error handler runs and resolves the errored i/os. > > Is this true ? > Yes. > I take it is also true that the midlayer thus expects every i/o to > have an i/o timeout. True ? > Yes. But this is guaranteed by the block-layer: void blk_add_timer(struct request *req) { struct request_queue *q =3D req->q; unsigned long expiry; if (!q->rq_timed_out_fn) return; BUG_ON(!list_empty(&req->timeout_list)); BUG_ON(test_bit(REQ_ATOM_COMPLETE, &req->atomic_flags)); /* * Some LLDs, like scsi, peek at the timeout to prevent a * command from being retried forever. */ if (!req->timeout) req->timeout =3D q->rq_timeout; So every request will have a timeout, either the default=20 request_queue timeout or an individual one. > The crux of this point is that when the recovery thread runs to > aborts the timed out i/os, is at the mercy of the last command to > complete or timeout. Additionally, as all further i/o is held off at > the block layer as the shost is in recovery, new i/o cannot be > submitted until the error handler runs and resolves the errored > i/os. So all I/O on the host is stopped until that last i/o > completes/times out. The timeouts may be eons later. Consider > SCSI format commands or verify commands that can take hours to > complete. > Yes, that's true. Unfortunately. > Specifically, I'm in a situation currently, where an application is > using sg to send a command to a target. The app selected no-timeout > - by setting timeout to MAX_INT. Effectively it's so large its > infinite. This I/O was one of those "lost" on the storage fabric. > There was another command that long ago timed out and is sitting on > the error handlers queue. But nothing is happening - new i/o, or > error handler to resolve the failed i/o, until that inifinite i/o > completes. > Hehe. no timeout !=3D MAX_INT. It's easy to apply a timeout if none is set. But how do we determine=20 what constitutes a valid timeout? As mentioned, some command can literally take forever, _and_ being=20 fully legit. So who are we to decide? > I'm hoping I hear that I just misunderstand things. If not, is > there a suggestion for how to resolve this predicament ? IMHO, > I'm surprised we stop all i/o for error handling, and that it can be > so long later... I would assume there's a minimum bound we would > wait in the error handler (30s?) before we unconditionally run it > and abort anything that was outstanding. > Ah, the joys of error recovery. Incidentally, that'll be one of the topics I'll be discussing at=20 LSF; I've been bitten by this on various other occasions. AFAIK the reasoning behind the current error recovery strategy is=20 that it's modelled after SCSI parallel behaviour, where you=20 basically have to stop the entire bus, figure out which state it's=20 in, and then take corrective action. And you typically don't have any LUNs to deal with. _And_ SPI is essentially single-threaded when it comes to target=20 access, so in effect you cannot send commands over the bus when=20 resetting a target. So there it makes sense. Less so for modern fabrics, where target access is governed by an=20 I_T nexus, any of which is largely independent on others. Actually there is another issue with the error handler: The commands will only be release after eh is done. If you look at the eh sequence -> eh_abort -> eh_lun_reset -> eh_target_reset -> eh_bus_reset -> eh_host_reset the command itself is only meaningful until lun_reset() has=20 completed; after lun_reset() the command is invalided. Every other stage still uses the scsi command as an argument, but only as a place holder to figure out which device it should act=20 upon. So we _could_ speed up things by quite a lot when we were able to=20 call ->done() on the command after lun reset; then the command would=20 be returned to the upper layers. And things like multipath could kick in an move I/O to other devices. However, this is a daunting task. I've tried, and it's far from easy. _Especially_ do to some FC HBAs insisting on using scmds for sending=20 TARGET RESET TMFs. If we just could do a LOGO for target reset things would become so=20 much easier ... Cheers, Hannes --=20 Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N=FCrnberg GF: J. Hawn, J. Guild, F. Imend=F6rffer, HRB 16746 (AG N=FCrnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html