Re: [SUMMARY] libata EH

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Tejun Heo <htejun@gmail.com>
To: Jeff Garzik <jgarzik@pobox.com>
Cc: linux-ide@vger.kernel.org, albertcc@tw.ibm.com, mlord@pobox.com,
	lkosewsk@gmail.com, luben_tuikov@adaptec.com,
	Alan Cox <alan@lxorguk.ukuu.org.uk>,
	Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>,
	Jens Axboe <axboe@suse.de>
Subject: Re: [SUMMARY] libata EH
Date: Sun, 21 Aug 2005 02:31:14 +0900	[thread overview]
Message-ID: <430768E2.60808@gmail.com> (raw)
In-Reply-To: <4306B62F.7090509@pobox.com>


  Hello, guys.

Jeff Garzik wrote:
> 
> (added Alan, Bart and Jens to CC)
> 
> Tejun Heo wrote:
> 
>>  Hello, Jeff & libata developers.
>>
>>  I've been spending some time on libata EH and am trying to make a
>> list of things which can be improved.  If you have something, please
>> add to the list.
>>
>> 1. Errors are handled in multiple paths.
>>
>>     * ATA errors are handled directly in intr context
> 
> 
> Simple stuff like "command aborted" (invalid command) can be handled 
> immediately, no need to kick in the error handling.
> 
> But as long as the right hardware interrupts are acknowledged, I don't 
> mind if all error handling is moved to the thread.
> 

  My preference is toward unifying into single path as long as 
performance penalty is acceptable for the sake of simplicity.

> 
>>     * timeouts are handled by ->eng_timeout via SCSI EH
>>
>>     * ATAPI errors are forwarded to ->eng_timeout in somewhat hackish way
> 
> 
> ->eng_timeout is somewhat misnamed.  A more correct name would be 
> "error_hander".  ATAPI errors are intentionally forwarded to SCSI EH via 
> the normal check-condition handling paths, which cause the SCSI EH to be 
> invoked.
> 
> 
>> 2. Synchronization
>>
>>     * SCSI EH entrance is not synchronized with normal processing.
>>       ATAPI error handling/timeout handling can run concurrently
>>       with normal command processing.  Albert, I think it's the
>>       same problem you're trying to solve by moving ATA_QCFLAG_ACTIVE
>>       clearing.
>>
>>       http://marc.theaimsgroup.com/?l=linux-ide&m=112417360223374&w=2
> 
> 
> The SCSI layer stops all command processing before calling 
> ->eh_strategy_handler().  Where do you see that it runs concurrently 
> with normal command processing?  That should definitely -not- be happening.
> 

  There are currently two problems.

  * As we don't grab host_set lock on entry to ata_scsi_error(), we can
    run concurrently with latter part of ata_qc_complete().  This race is
    addressed by the following patches I've just posted.

    http://marc.theaimsgroup.com/?l=linux-ide&m=112454734102242&w=2

  * After entering EH, normal command completion or spurious interrupt
    can occur.  We currently don't peg those interrupts, so interrupt
    handling can interfere with EH.

> 
>>     * SCSI EH entrance is not synchronized with polling tasks.
> 
> 
> Yes, this definitely needs fixing.
> 
> Luckily the polling task is very rarely used, by normal users.
> 
> 
>> 3. Error handling too weak
>>
>>     * We need to check the device responds to commands (say, w/
>>       IDENTIFY or CHK_POWER) after an error, and then reset if it
>>           doens't.  To do this, we need to handle all errors in EH.
> 
> 
> First you need to classify the error, then handle based on that 
> classification.
> 
> - DMA and PCI bus errors are usually indicated via status bits in 
> controller registers, such as ATA_DMA_ERR on standard PCI IDE controllers.
> 
> - Device errors will be indicated via the ATA Status and Error registers.
> 
> 
> - PCI bus errors should be handled by resetting the host controller (if 
> possible), and then retrying the command [NOTE: better suggestions welcome]
> 
> - DMA errors should be handled by hueristics:  If more than $N (3?) DMA 
> errors happen in 15 minutes,
> * decrease SATA PHY speed.  if speed cannot be decreased,
> * decrease UDMA xfer speed.  if at UDMA0, switch to PIO4
> * decrease PIO xfer speed.  if at PIO3, complain, but continue
> 
> Commands should be retried after DMA errors.
> 
> - Device errors should handled as per ATA specs.  Usually these error do 
> not cause a retry, but they may require additional inquiry such as READ 
> LOG EXT if its a media error.
> 
> 
>> 4. Better error reporting
>>
>>     * We currently depend on ATA_STAT and ATA_ERR register values
>>       to check for and report errors.  As Jeff said, this is way
>>       too crude.  We need better error reporting.  I think having
>>       unified code path for error handling will help implementing
>>       this.
> 
> 
> This will fall naturally out of better error handling (stuff I described 
> above).
> 
> 
>> 5. EH is currently holding off other improvements
>>
>>     * NCQ controllers (or any other non-legacy ATA interface based
>>           ones) don't fit nicely into current ATA error handling in
>>           interrupt scheme.  As NCQ errors require issuing read log
>>       command, we need to be in EH context to handle these errors.
>>
>>     * To properly implement hotplug, we need to have solid error
>>           handling.  IMHO, implementing hotplug with current EH will
>>       be quite fragile.
> 
> 
> Correct.  Both NCQ and hotplug really want decent error handling -- 
> particularly hotplug.  Hotplug will likely travel the error handling 
> paths, though at that point we prefer the English word "exception" 
> rather than "error"  :)
> 
> 
>>  Whether we choose to stay with ->eh_strategy_handler or move over to
>> fine-grained SCSI EH, IMHO, libata EH needs some work.  So... how
>> should we proceed?
> 
> 
> Well, moving over to the fine-grained hooks should make the remaining 
> problems Mark sees go away, as well as guaranteeing that we get command 
> completion right.
> 
> Staying with ->eh_strategy_handler() means auditing the entire SCSI 
> layer to check and see what gets set on error, which libata must then 
> manually reset.  Obviously there are still some bugs in this area, as 
> Mark has demonstrated.  The big reason why I don't like 
> ->eh_strategy_handler() is that it continues to be an unknown quantity, 
> in terms of, we don't still have a complete list of things-to-do during 
> error handling.

  As I wrote previously, I'm not very sure if Mark and I are looking at 
the same problem, but I think that once we know how the system is locked 
up, the debugging shouldn't be that difficult.

  As there are concerns regarding semantics of ->eh_strategy_handler and 
it's a less-used and less-charted territory, I'm gonna try to write a 
document describing the following.

  * How SCSI EH works and commands flow through it with the default
    fine-grained hooks.
  * From above, extract what ->eh_strategy_handler() should do.
  * What libata error conditions are there and how qc's should be
    handle.
  * How to integrate libata EH into SCSI EH without losing commands.

  I don't how good the doc will turn out (don't expect too much), but I 
hope it could serve as a basis for discussion if nothing else.

> If you are interested in tackling the work, you're more than welcome to 
> choose either path.  Either way, the work won't go to waste.
> 
>     Jeff

  After writing above mentioned doc, I'll try to improve/revise and 
break down my previously posted EH patchset and explain how they conform 
to above yet-to-be-written document such that it can be better 
understood and easier to review/debug.

  Thanks a lot.

-- 
tejun

next prev parent reply	other threads:[~2005-08-20 17:34 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-08-20  2:33 [SUMMARY] libata EH Tejun Heo
2005-08-20  4:48 ` Jeff Garzik
2005-08-20 17:31   ` Tejun Heo [this message]
2005-08-20 20:02     ` Alan Cox
2005-08-21  4:09     ` Jeff Garzik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=430768E2.60808@gmail.com \
    --to=htejun@gmail.com \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=albertcc@tw.ibm.com \
    --cc=axboe@suse.de \
    --cc=bzolnier@gmail.com \
    --cc=jgarzik@pobox.com \
    --cc=linux-ide@vger.kernel.org \
    --cc=lkosewsk@gmail.com \
    --cc=luben_tuikov@adaptec.com \
    --cc=mlord@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).