Re: faulty disk testing

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Tejun Heo <htejun@gmail.com>
To: Ric Wheeler <ric@emc.com>
Cc: Linux-ide <linux-ide@vger.kernel.org>,
	Mark Lord <mlord@pobox.com>, Jeff Garzik <jgarzik@pobox.com>
Subject: Re: faulty disk testing
Date: Tue, 05 Sep 2006 13:57:30 +0200	[thread overview]
Message-ID: <44FD662A.6060404@gmail.com> (raw)
In-Reply-To: <44FCD328.3020800@emc.com>

Hello, Ric.

[cc'ing Jeff]

Ric Wheeler wrote:
> Hi Tejun,
> 
> We have been trying to inject some errors on some drives & validate that 
> the new error handling kicks out drives.
> 
> Using 2.6.18rc3 on a box with 4 drives - 3 good & one with an 
> artificially created ecc error in the 4-way MD RAID1 partition.
> 
> The error handling worked through the various transitions, but did not 
> give up on the drive well enough to let the boot continue using the 
> other 3.

I suppose the introduced errors are transient and some sectors complete 
IO successfully between errors, right?  As long as the drive responds to 
recovery action (provide signature on reset, ID data on IDENTIFY and 
responds to SETFEATURES), libata assumes the error condition is 
transient and let the drive continue operating.

So, no, libata won't drop a drive unless it fails to respond to recovery 
sequence.  libata just doesn't have enough information about how devices 
are used to determine whether a device is failing too often to be 
useful.  e.g. there is a very big difference between a harddrive serving 
rootfs by itself and a drive which is in md array w/ several spares.

> I plan to look at the state of the drive with an analyzer tomorrow to 
> make sure that the drive is not holding the bus or something & try your 
> latest "new init" git tree code.

New init stuff won't change anything regarding this.

> What it looks like is a soft hang - maybe the box is stuck in 
> ata_port_wait_eh() which never seems to timeout on a bus that does not 
> recover?

It seems like we need a separate mechanism here to implement policy for 
longer-term handling for frequently-failing devices.  Probably providing 
some monitoring sysfs nodes should do it - some error history w/ 
recovery time record and stuff such that user management process can 
decide to pull the plug if seems appropriate.

Thanks.

-- 
tejun

next prev parent reply	other threads:[~2006-09-05 11:57 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-09-05  1:30 faulty disk testing Ric Wheeler
2006-09-05 11:57 ` Tejun Heo [this message]
2006-09-05 12:46   ` Ric Wheeler
2006-09-05 13:48   ` Mark Lord
2006-09-05 14:08     ` Tejun Heo
2006-09-05 14:15       ` Mark Lord
2006-09-05 14:45         ` Tejun Heo
2006-09-05 14:19       ` Ric Wheeler
2006-09-05 14:56         ` Tejun Heo
2006-09-05 15:48           ` Ric Wheeler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=44FD662A.6060404@gmail.com \
    --to=htejun@gmail.com \
    --cc=jgarzik@pobox.com \
    --cc=linux-ide@vger.kernel.org \
    --cc=mlord@pobox.com \
    --cc=ric@emc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).