When does a disk get flagged as bad?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* When does a disk get flagged as bad?
@ 2007-05-25  4:16 Alberto Alonso
  2007-05-31  2:28 ` Mike Accetta
  0 siblings, 1 reply; 6+ messages in thread
From: Alberto Alonso @ 2007-05-25  4:16 UTC (permalink / raw)
  To: linux-raid

OK, lets see if I can understand how a disk gets flagged
as bad and removed from an array. I was under the impression
that any read or write operation failure flags the drive as
bad and it gets removed automatically from the array.

However, as I indicated in a prior post I am having problems
where the array is never degraded. Does an error of type:
end_request: I/O error, dev sdb, sector ....
not count as a read/write error?

Thanks,

Alberto

-- 
Alberto Alonso                        Global Gate Systems LLC.
(512) 351-7233                        http://www.ggsys.net
Hardware, consulting, sysadmin, monitoring and remote backups

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: When does a disk get flagged as bad?
  2007-05-25  4:16 When does a disk get flagged as bad? Alberto Alonso
@ 2007-05-31  2:28 ` Mike Accetta
  2007-05-31  2:49   ` Alberto Alonso
  0 siblings, 1 reply; 6+ messages in thread
From: Mike Accetta @ 2007-05-31  2:28 UTC (permalink / raw)
  To: Alberto Alonso; +Cc: linux-raid

Alberto Alonso writes:
> OK, lets see if I can understand how a disk gets flagged
> as bad and removed from an array. I was under the impression
> that any read or write operation failure flags the drive as
> bad and it gets removed automatically from the array.
> 
> However, as I indicated in a prior post I am having problems
> where the array is never degraded. Does an error of type:
> end_request: I/O error, dev sdb, sector ....
> not count as a read/write error?

I was also under the impression that any read or write error would
fail the drive out of the array but some recent experiments with error
injecting seem to indicate otherwise at least for raid1.  My working
hypothesis is that only write errors fail the drive.  Read errors appear
to just redirect the sector to a different mirror.

I actually ran across what looks like a bug in the raid1
recovery/check/repair read error logic that I posted about
last week but which hasn't generated any response yet (cf.
http://article.gmane.org/gmane.linux.raid/15354).  This bug results in
sending a zero length write request down to the underlying device driver.
A consequence of issuing a zero length write is that it fails at the
device level, which raid1 sees as a write failure, which then fails the
array.  The fix I proposed actually has the effect of *not* failing the
array in this case since the spurious failing write is never generated.
I'm not sure what is actually supposed to happen in this case.  Hopefully,
someone more knowledgeable will comment soon.
--
Mike Accetta

ECI Telecom Ltd.
Data Networking Division (previously Laurel Networks)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: When does a disk get flagged as bad?
  2007-05-31  2:28 ` Mike Accetta
@ 2007-05-31  2:49   ` Alberto Alonso
  2007-05-31  6:10     ` Neil Brown
  0 siblings, 1 reply; 6+ messages in thread
From: Alberto Alonso @ 2007-05-31  2:49 UTC (permalink / raw)
  To: linux-raid

On Wed, 2007-05-30 at 22:28 -0400, Mike Accetta wrote:
> Alberto Alonso writes:
> > OK, lets see if I can understand how a disk gets flagged
> > as bad and removed from an array. I was under the impression
> > that any read or write operation failure flags the drive as
> > bad and it gets removed automatically from the array.
> > 
> > However, as I indicated in a prior post I am having problems
> > where the array is never degraded. Does an error of type:
> > end_request: I/O error, dev sdb, sector ....
> > not count as a read/write error?
> 
> I was also under the impression that any read or write error would
> fail the drive out of the array but some recent experiments with error
> injecting seem to indicate otherwise at least for raid1.  My working
> hypothesis is that only write errors fail the drive.  Read errors appear
> to just redirect the sector to a different mirror.
> 
> I actually ran across what looks like a bug in the raid1
> recovery/check/repair read error logic that I posted about
> last week but which hasn't generated any response yet (cf.
> http://article.gmane.org/gmane.linux.raid/15354).  This bug results in
> sending a zero length write request down to the underlying device driver.
> A consequence of issuing a zero length write is that it fails at the
> device level, which raid1 sees as a write failure, which then fails the
> array.  The fix I proposed actually has the effect of *not* failing the
> array in this case since the spurious failing write is never generated.
> I'm not sure what is actually supposed to happen in this case.  Hopefully,
> someone more knowledgeable will comment soon.
> --
> Mike Accetta

I was starting to think that nobody got my posts, I know there
are plenty of people that understand raid and didn't get any answers
to any of my related posts.

After thinking about your post, I guess I can see some logic behind
not failing on the read, although I would say that after x amount of
read failures a drive should be kicked out no matter what.

In my case I believe the errors are during writes, which is still
confusing.

Unfortunately I've never done any kind of disk I/O code so I am
afraid of looking at the code and getting completely lost.

Alberto


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: When does a disk get flagged as bad?
  2007-05-31  2:49   ` Alberto Alonso
@ 2007-05-31  6:10     ` Neil Brown
  2007-06-02  0:07       ` Bill Davidsen
  2007-06-02 15:50       ` Alberto Alonso
  0 siblings, 2 replies; 6+ messages in thread
From: Neil Brown @ 2007-05-31  6:10 UTC (permalink / raw)
  To: Alberto Alonso; +Cc: linux-raid

On Wednesday May 30, alberto@ggsys.net wrote:
> 
> After thinking about your post, I guess I can see some logic behind
> not failing on the read, although I would say that after x amount of
> read failures a drive should be kicked out no matter what.

When md gets a read error, it collects the correct data from elsewhere
and tries to write it to the drive that apparently failed.
If that succeeds, it tries to read it back again.  If that succeeds as
well, it assumes that the problem has been fixed.  Otherwise it fails
the drive.


NeilBrown

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: When does a disk get flagged as bad?
  2007-05-31  6:10     ` Neil Brown
@ 2007-06-02  0:07       ` Bill Davidsen
  2007-06-02 15:50       ` Alberto Alonso
  1 sibling, 0 replies; 6+ messages in thread
From: Bill Davidsen @ 2007-06-02  0:07 UTC (permalink / raw)
  To: Neil Brown; +Cc: Alberto Alonso, linux-raid

Neil Brown wrote:
> On Wednesday May 30, alberto@ggsys.net wrote:
>   
>> After thinking about your post, I guess I can see some logic behind
>> not failing on the read, although I would say that after x amount of
>> read failures a drive should be kicked out no matter what.
>>     
>
> When md gets a read error, it collects the correct data from elsewhere
> and tries to write it to the drive that apparently failed.
> If that succeeds, it tries to read it back again.  If that succeeds as
> well, it assumes that the problem has been fixed.  Otherwise it fails
> the drive.
>   
That's the way it should work, but hopefully the error is logged 
somewhere. Other than as a SMART relocate?

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: When does a disk get flagged as bad?
  2007-05-31  6:10     ` Neil Brown
  2007-06-02  0:07       ` Bill Davidsen
@ 2007-06-02 15:50       ` Alberto Alonso
  1 sibling, 0 replies; 6+ messages in thread
From: Alberto Alonso @ 2007-06-02 15:50 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

So, what kind of error is:

end_request: I/O error, dev sdb, sector 42644555
end_request: I/O error, dev sdb, sector 124365763
...

I am still trying to figure out why that just make one of my servers
unresponsive.

Is there a way to have the md code kick that drive out of the array? 

The datacenter people are starting to get impatient having to reboot 
it every other day.

Thanks,

Alberto

On Thu, 2007-05-31 at 16:10 +1000, Neil Brown wrote:
> On Wednesday May 30, alberto@ggsys.net wrote:
> > 
> > After thinking about your post, I guess I can see some logic behind
> > not failing on the read, although I would say that after x amount of
> > read failures a drive should be kicked out no matter what.
> 
> When md gets a read error, it collects the correct data from elsewhere
> and tries to write it to the drive that apparently failed.
> If that succeeds, it tries to read it back again.  If that succeeds as
> well, it assumes that the problem has been fixed.  Otherwise it fails
> the drive.
> 
> 
> NeilBrown
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Alberto Alonso                        Global Gate Systems LLC.
(512) 351-7233                        http://www.ggsys.net
Hardware, consulting, sysadmin, monitoring and remote backups


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-06-02 15:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-25  4:16 When does a disk get flagged as bad? Alberto Alonso
2007-05-31  2:28 ` Mike Accetta
2007-05-31  2:49   ` Alberto Alonso
2007-05-31  6:10     ` Neil Brown
2007-06-02  0:07       ` Bill Davidsen
2007-06-02 15:50       ` Alberto Alonso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).