Re: [PATCHSET #upstream] libata: improve FLUSH error handling

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Mark Lord <liml@rtr.ca>
To: Tejun Heo <htejun@gmail.com>
Cc: jeff@garzik.org, linux-ide@vger.kernel.org, alan@lxorguk.ukuu.org.uk
Subject: Re: [PATCHSET #upstream] libata: improve FLUSH error handling
Date: Thu, 27 Mar 2008 22:33:26 -0400	[thread overview]
Message-ID: <47EC58F6.3070601@rtr.ca> (raw)
In-Reply-To: <47EC5079.5020105@gmail.com>

Tejun Heo wrote:
> Hello, Mark.
> 
> Mark Lord wrote:
>> Speaking of which.. these are all WRITEs.
>>
>> In 18 years of IDE/ATA development,
>> I have *never* seen a hard disk drive report a WRITE error.
>>
>> Which makes sense, if you think about it -- it's rewriting the sector
>> with new ECC info, so it *should* succeed.  The only case where it won't,
>> is if the sector has been marked as "bad" internally, and the drive is
>> too dumb to try anyways after it runs out of remap space.
>>
>> In which case we've already lost data, and taking more than a hundred
>> and twenty seconds isn't going to make a serious difference.
> 
> Yeah, the disk must be knee deep in shit to report WRITE failure.  I
> don't really expect the code to be exercised often but was mainly trying
> fill the loophole in libata error handling as this type of behavior is
> what the spec requires on FLUSH errors.
> 
> I didn't add global timeout because retries are done iff the drive is
> reporting progress.
> 
> 1. Drives genuinely deep in shit and getting lots of WRITE errors would
> report different sectors on each FLUSH and we NEED to keep retrying.
> That's what the spec requires and the FLUSH could be from shutdown and
> if so that would be the drive's last chance to write data to the drive.
> 
> 2. There are other issues causing the command to fail (e.g. timeout, HSM
> violation or somesuch).  This is the case EH can take a really long time
> if it keeps retrying but the posted code doesn't retry if this is the case.
> 
> 3. The drive is crazy and reporting errors for no good reason.  Unless
> the drive is really anti-social and raise such error condition only
> after tens of seconds, this shouldn't take too long.  Also, if LBA
> doesn't change for each retry, the tries count is halved.
> 
> So, I think the code should be safe.  Do you still think we need a
> global timeout?  It is easy to add.  I'm just not sure whether we need
> it or not.
..

With EH becoming more and more capable and complex,
a global deadline for FLUSH looks like a reasonable thing.
People who have no backups can leave it at the default "near-infinity" setting
that is there now, and folks with RAID1 (or better) can set it to a much 
shorter number -- so that their system-recovery reboot doesn't take 3 hours
to get past the FLUSH_CACHE on the failing drive.  :)

Cheers

next prev parent reply	other threads:[~2008-03-28  2:33 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-03-27 10:14 [PATCHSET #upstream] libata: improve FLUSH error handling Tejun Heo
2008-03-27 10:14 ` [PATCH 1/4] libata: make ata_tf_to_lba[48]() generic Tejun Heo
2008-04-04  7:45   ` Jeff Garzik
2008-03-27 10:14 ` [PATCH 2/4] libata: implement ATA_QCFLAG_RETRY Tejun Heo
2008-03-27 10:14 ` [PATCH 3/4] libata: kill unused ata_flush_cache() Tejun Heo
2008-03-27 10:14 ` [PATCH 4/4] libata: improve FLUSH error handling Tejun Heo
2008-04-04  7:46   ` Jeff Garzik
2008-03-27 10:23 ` Debug patch to induce errors on FLUSH Tejun Heo
2008-03-27 14:24 ` [PATCHSET #upstream] libata: improve FLUSH error handling Mark Lord
2008-03-27 14:35   ` Mark Lord
2008-03-27 15:31     ` Alan Cox
2008-03-27 18:01     ` Ric Wheeler
2008-03-28  1:57     ` Tejun Heo
2008-03-28  2:33       ` Mark Lord [this message]
2008-03-28 13:36         ` Ric Wheeler
2008-03-28 14:52           ` Tejun Heo
2008-03-28 14:53             ` Ric Wheeler
2008-03-28 15:16               ` Alan Cox
2008-03-28 16:57                 ` Ric Wheeler
2008-03-28 16:04             ` Mark Lord
2008-03-27 17:53   ` Ric Wheeler
2008-03-27 18:52     ` Jeff Garzik
2008-03-27 20:23       ` Ric Wheeler
2008-03-28  7:46   ` Andi Kleen
2008-03-28  8:30     ` Tejun Heo
2008-03-28  8:48       ` Andi Kleen
2008-03-28  8:53         ` Tejun Heo
2008-03-27 17:51 ` Ric Wheeler
2008-03-27 18:53   ` Jeff Garzik
2008-03-27 22:00   ` Alan Cox
2008-03-28  2:02   ` Tejun Heo
2008-03-28  9:48     ` Alan Cox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47EC58F6.3070601@rtr.ca \
    --to=liml@rtr.ca \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=htejun@gmail.com \
    --cc=jeff@garzik.org \
    --cc=linux-ide@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).