From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hannes Reinecke Subject: Re: Debugging scsi abort handling ? Date: Thu, 28 Aug 2014 14:10:39 +0200 Message-ID: <53FF1C3F.1070409@suse.de> References: <53F8AAA8.8040407@redhat.com> <53FAE3CA.6060603@redhat.com> <53FAF80D.2070209@redhat.com> <53FB0FE3.80603@acm.org> <53FB1ACD.1040208@redhat.com> <53FB1D4F.5070606@redhat.com> <53FB2083.9000208@redhat.com> <1408981299.17968.27.camel@jarvis.lan> <53FC4193.4010808@redhat.com> <1409078078.3812.25.camel@jarvis.lan> <53FCDDD7.1020608@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from cantor2.suse.de ([195.135.220.15]:54036 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750926AbaH1MKl (ORCPT ); Thu, 28 Aug 2014 08:10:41 -0400 In-Reply-To: <53FCDDD7.1020608@redhat.com> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Hans de Goede , James Bottomley Cc: Paolo Bonzini , Bart Van Assche , SCSI development list On 08/26/2014 09:19 PM, Hans de Goede wrote: > Hi, > > On 08/26/2014 08:34 PM, James Bottomley wrote: >> On Tue, 2014-08-26 at 10:13 +0200, Hans de Goede wrote: >>> Hi, >>> >>> On 08/25/2014 05:41 PM, James Bottomley wrote: >>>> On Mon, 2014-08-25 at 13:39 +0200, Paolo Bonzini wrote: >>>>> Il 25/08/2014 13:26, Hans de Goede ha scritto: >>>>>> Thanks Bart and Paolo, your insights into this are greatly >>>>>> appreciated. >>>>>> >>>>>> So with uas there are separate usb transaction for cmd, data >>>>>> in, data out >>>>>> and sense for each tag. At the time of abort, usually one of >>>>>> data in / data >>>>>> out and a sense usb transaction will be outstanding. >>>>>> >>>>>> There already is logic in the driver to kill the data in / out >>>>>> transactions >>>>>> if a sense gets returned (usually with an error) before they >>>>>> are done. >>>>>> >>>>>> So if I'm reading this correctly, then on a successful abort, >>>>>> the sense >>>>>> transaction (if not already completed by the target) should be >>>>>> cancelled as >>>>>> it will never complete, correct ? >>>>> >>>>> Yes. More precisely, once the response IU comes back for the >>>>> abort, >>>>> there should be no more IUs for that task. Figure 13 >>>>> ("Multiple Command >>>>> Example") in the UAS spec actually shows that. >>>>> >>>>> At least, that's what it should be like in theory. I suspect >>>>> firmware >>>>> bugs will abound in this area, but at least you shouldn't be >>>>> actively >>>>> expecting an IU for an aborted task. >>>> >>>> Just a note on this. The abort area in all devices is where we >>>> have >>>> scope for spec compliance issues. Also in the old days abort >>>> itself >>>> could trigger a firmware crash on some devices (including >>>> arrays). The >>>> problem is actually that the system is usually groaning because >>>> it's out >>>> of memory within the controller. That actually means that >>>> sending in >>>> another task (the abort) causes greater pressure. For this >>>> reason, some >>>> device drivers choose to skip the abort step and head straight >>>> to reset. >>>> For reset, you guarantee that all outstanding tasks for the unit >>>> are >>>> killed. >>> >>> Hmm, I like this idea, given the finickiness of the abort path in >>> the uas >>> driver, and that: >>> >>> 1) We really have no proper way to test this >>> 2) We already have some known issues there (we don't kill sense >>> urbs atm, >>> and I've a note somewhere about a double free on some corner >>> case >>> where an urb submit fails) >>> 3) >>> 3) In all the cases where I've managed to trip op an uas device >>> the only >>> thing which actually worked to recover things was doing a usb >>> reset >>> 4) Aborts should not happen in the first place, so using a big >>> hammer >>> when they do is not really a big problem, instead we should >>> focus >>> on figuring out why they happen and fix the cause >>> >>> I think that just dropping abort handling altogether is a good >>> idea actually, >>> so if there are no objections I'm just going to do that. >>> >>> I can simply not set eh_abort_handler (aka set it to NULL), right ? >> >> Just so you know, if you do this, error handling becomes much more >> painful. The abort handler can now handle aborting and retrying >> without >> pausing the host. The reset definitely stops the host and causes >> a big >> and noticeable burp in processing. However, there are hosts which >> have >> to do it. > > I understand, but shouldn't aborts be something which rarely happens. > I guess that with a faulty drive they happen more often, but at this > point in time the uas driver's error handling problems are mostly wit= h > dealing with timeouts during probing, which are likely caused by > the device not grokking some command we've send, and responding to > this in a bad manner (hanging mostly). > > So I'm tempted to just remove the abort handling, and first focus on > getting uas stable under all conditions. Once it is stable we can > look into making it deal more graciously with errors. > Sounds like a reasonable plan. [ .. ] >> >> Wait, could this also be your abort problem? In the running abort >> handler, we could get a scenario where we abort a bunch of >> commands at the same time. > > I don't think so, the uas code does not return from eh_abort_handler > until the abort has either succeeded or timedout (for which a 3 secon= d > timeout is used). AFAIK the scsi core will always issue multiple abor= ts > sequentially, right. And if the abort succeeds then the tag is free > again for the next abort. Only if an abort fails (times out, which is > what I've seen in all the error conditions which I've encountered so > far), > then will subsequent aborts, and the logical unit reset, fail due to > there not being a free tag to use for them. > If the logic for command/task abort and logical unit reset is=20 similar then there is no point in implementing both. So by what I've seen I would first implement target reset (and host=20 reset :-), and worry about finer grained error control later on. Cheers, Hannes --=20 Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N=C3=BCrnberg GF: J. Hawn, J. Guild, F. Imend=C3=B6rffer, HRB 16746 (AG N=C3=BCrnberg= ) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html