All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mark Lord <liml@rtr.ca>
To: Tejun Heo <htejun@gmail.com>
Cc: jeff@garzik.org, linux-ide@vger.kernel.org, alan@lxorguk.ukuu.org.uk
Subject: Re: [PATCHSET #upstream] libata: improve FLUSH error handling
Date: Thu, 27 Mar 2008 22:33:26 -0400	[thread overview]
Message-ID: <47EC58F6.3070601@rtr.ca> (raw)
In-Reply-To: <47EC5079.5020105@gmail.com>

Tejun Heo wrote:
> Hello, Mark.
> 
> Mark Lord wrote:
>> Speaking of which.. these are all WRITEs.
>>
>> In 18 years of IDE/ATA development,
>> I have *never* seen a hard disk drive report a WRITE error.
>>
>> Which makes sense, if you think about it -- it's rewriting the sector
>> with new ECC info, so it *should* succeed.  The only case where it won't,
>> is if the sector has been marked as "bad" internally, and the drive is
>> too dumb to try anyways after it runs out of remap space.
>>
>> In which case we've already lost data, and taking more than a hundred
>> and twenty seconds isn't going to make a serious difference.
> 
> Yeah, the disk must be knee deep in shit to report WRITE failure.  I
> don't really expect the code to be exercised often but was mainly trying
> fill the loophole in libata error handling as this type of behavior is
> what the spec requires on FLUSH errors.
> 
> I didn't add global timeout because retries are done iff the drive is
> reporting progress.
> 
> 1. Drives genuinely deep in shit and getting lots of WRITE errors would
> report different sectors on each FLUSH and we NEED to keep retrying.
> That's what the spec requires and the FLUSH could be from shutdown and
> if so that would be the drive's last chance to write data to the drive.
> 
> 2. There are other issues causing the command to fail (e.g. timeout, HSM
> violation or somesuch).  This is the case EH can take a really long time
> if it keeps retrying but the posted code doesn't retry if this is the case.
> 
> 3. The drive is crazy and reporting errors for no good reason.  Unless
> the drive is really anti-social and raise such error condition only
> after tens of seconds, this shouldn't take too long.  Also, if LBA
> doesn't change for each retry, the tries count is halved.
> 
> So, I think the code should be safe.  Do you still think we need a
> global timeout?  It is easy to add.  I'm just not sure whether we need
> it or not.
..

With EH becoming more and more capable and complex,
a global deadline for FLUSH looks like a reasonable thing.
People who have no backups can leave it at the default "near-infinity" setting
that is there now, and folks with RAID1 (or better) can set it to a much 
shorter number -- so that their system-recovery reboot doesn't take 3 hours
to get past the FLUSH_CACHE on the failing drive.  :)

Cheers

  reply	other threads:[~2008-03-28  2:33 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-03-27 10:14 [PATCHSET #upstream] libata: improve FLUSH error handling Tejun Heo
2008-03-27 10:14 ` [PATCH 1/4] libata: make ata_tf_to_lba[48]() generic Tejun Heo
2008-04-04  7:45   ` Jeff Garzik
2008-03-27 10:14 ` [PATCH 2/4] libata: implement ATA_QCFLAG_RETRY Tejun Heo
2008-03-27 10:14 ` [PATCH 3/4] libata: kill unused ata_flush_cache() Tejun Heo
2008-03-27 10:14 ` [PATCH 4/4] libata: improve FLUSH error handling Tejun Heo
2008-04-04  7:46   ` Jeff Garzik
2008-03-27 10:23 ` Debug patch to induce errors on FLUSH Tejun Heo
2008-03-27 14:24 ` [PATCHSET #upstream] libata: improve FLUSH error handling Mark Lord
2008-03-27 14:35   ` Mark Lord
2008-03-27 15:31     ` Alan Cox
2008-03-27 18:01     ` Ric Wheeler
2008-03-28  1:57     ` Tejun Heo
2008-03-28  2:33       ` Mark Lord [this message]
2008-03-28 13:36         ` Ric Wheeler
2008-03-28 14:52           ` Tejun Heo
2008-03-28 14:53             ` Ric Wheeler
2008-03-28 15:16               ` Alan Cox
2008-03-28 16:57                 ` Ric Wheeler
2008-03-28 16:04             ` Mark Lord
2008-03-27 17:53   ` Ric Wheeler
2008-03-27 18:52     ` Jeff Garzik
2008-03-27 20:23       ` Ric Wheeler
2008-03-28  7:46   ` Andi Kleen
2008-03-28  8:30     ` Tejun Heo
2008-03-28  8:48       ` Andi Kleen
2008-03-28  8:53         ` Tejun Heo
2008-03-27 17:51 ` Ric Wheeler
2008-03-27 18:53   ` Jeff Garzik
2008-03-27 22:00   ` Alan Cox
2008-03-28  2:02   ` Tejun Heo
2008-03-28  9:48     ` Alan Cox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47EC58F6.3070601@rtr.ca \
    --to=liml@rtr.ca \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=htejun@gmail.com \
    --cc=jeff@garzik.org \
    --cc=linux-ide@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.