good drive / bad drive (maxtor topic)

All of lore.kernel.org
 help / color / mirror / Atom feed

* good drive / bad drive  (maxtor topic)
@ 2004-11-25 11:46 Mark Klarzynski
  2004-11-27 18:06 ` Guy
  0 siblings, 1 reply; 6+ messages in thread
From: Mark Klarzynski @ 2004-11-25 11:46 UTC (permalink / raw)
  To: linux-raid

In the world of hardware raid we decide that a drive has failed based on
various criteria, one of which is the obvious 'has the drive responded'
within a set time.  This set time various depending on the drive, the
application, the load etc.  This 'timeout' value is realistically
between 6 and 10 seconds.  There is no real formula, just lots of
experience. Set it too short and drives will look failed too often, set
it too long and you risk allowing a suspect drive to continue.

Once we detect a timeout we have to decide what to with it. in SCSI we
issue a scsi bus rest (hardware reset on the bus).. the reason we do
this (and all hardware raid manufactures) is because life is just that
way. drives do lock up.   We issue up to 3 resets, and then fail.  This
is extremely effective and does exactly what it is supposed to do.
Often the drive will never cause an issue again. if it is faulty then it
will escalate and fail.

We have utilised countless sata drives, and timeouts are by far the most
significant failure we see and as sata (although its hard to tell much
else on sata).. but it is therefore imperative that the timeout values
are correct for the drive and the application.

But the point is that we do not see anywhere near the failure rates on
the Maxtor's that you guys are mentioning.  Also, if we trial sata's on
different hardware RAID's we see differing failure rates... (i.e. ICP
come in higher then 3ware, which are higher than the host independent
raids we have tested and so on)

So I am wondering if it is worth thinking about the timeout values?  And
what do you do once the drive has timed out?

I am seeing some tremendous work going on in this group and without a
doubt this community is going to propel MD to enterprise level raid one
day. so this is honestly meant as constructive and is based on way too
many years designing raid solutions. i.e. I'm not looking to start an
argument simply offering some information.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: good drive / bad drive  (maxtor topic)
  2004-11-25 11:46 good drive / bad drive (maxtor topic) Mark Klarzynski
@ 2004-11-27 18:06 ` Guy
  2004-11-28  6:16   ` Brad Campbell
  0 siblings, 1 reply; 6+ messages in thread
From: Guy @ 2004-11-27 18:06 UTC (permalink / raw)
  To: 'Mark Klarzynski', linux-raid

It would be handy if someone would do an extended test of all of the disk
drives.  Consumer Reports does this type of thing all the time, just not on
disk drives.  I don't think they do any extended tests on any computer
hardware.  The testing should continue for 5 years.  And it could be mostly
automated.  No user interaction unless something goes wrong.  Now we need
someone with money!  :)

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Mark Klarzynski
Sent: Thursday, November 25, 2004 6:46 AM
To: linux-raid@vger.kernel.org
Subject: good drive / bad drive (maxtor topic)

In the world of hardware raid we decide that a drive has failed based on
various criteria, one of which is the obvious 'has the drive responded'
within a set time.  This set time various depending on the drive, the
application, the load etc.  This 'timeout' value is realistically
between 6 and 10 seconds.  There is no real formula, just lots of
experience. Set it too short and drives will look failed too often, set
it too long and you risk allowing a suspect drive to continue.

Once we detect a timeout we have to decide what to with it. in SCSI we
issue a scsi bus rest (hardware reset on the bus).. the reason we do
this (and all hardware raid manufactures) is because life is just that
way. drives do lock up.   We issue up to 3 resets, and then fail.  This
is extremely effective and does exactly what it is supposed to do.
Often the drive will never cause an issue again. if it is faulty then it
will escalate and fail.

We have utilised countless sata drives, and timeouts are by far the most
significant failure we see and as sata (although its hard to tell much
else on sata).. but it is therefore imperative that the timeout values
are correct for the drive and the application.

But the point is that we do not see anywhere near the failure rates on
the Maxtor's that you guys are mentioning.  Also, if we trial sata's on
different hardware RAID's we see differing failure rates... (i.e. ICP
come in higher then 3ware, which are higher than the host independent
raids we have tested and so on)

So I am wondering if it is worth thinking about the timeout values?  And
what do you do once the drive has timed out?

I am seeing some tremendous work going on in this group and without a
doubt this community is going to propel MD to enterprise level raid one
day. so this is honestly meant as constructive and is based on way too
many years designing raid solutions. i.e. I'm not looking to start an
argument simply offering some information.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: good drive / bad drive  (maxtor topic)
  2004-11-27 18:06 ` Guy
@ 2004-11-28  6:16   ` Brad Campbell
  2004-11-28  6:56     ` Guy
  0 siblings, 1 reply; 6+ messages in thread
From: Brad Campbell @ 2004-11-28  6:16 UTC (permalink / raw)
  To: Guy; +Cc: 'Mark Klarzynski', linux-raid

Guy wrote:
> It would be handy if someone would do an extended test of all of the disk
> drives.  Consumer Reports does this type of thing all the time, just not on
> disk drives.  I don't think they do any extended tests on any computer
> hardware.  The testing should continue for 5 years.  And it could be mostly
> automated.  No user interaction unless something goes wrong.  Now we need
> someone with money!  :)
> 

Problem with this is by the time the test has any sort of meaningful data, the drives have been 
obsoleted/EOL'd.

For the record I have 12 Maxtor Maxline-II 250GB drives here which have 200 days on the clock and 
nary a hiccup. I don't doubt people have trouble with Maxtor drives, hell prior to the release of 
the Maxline-II drives I would not have looked at them sideways, too many previous bad experiences.
Having said that, I have had just as many Seagate drives fail and I won't touch them either. I had a 
bad run with WD and the dodgy firmware that failed when used in a RAID, and I have had a couple of 
them fail recently.

The only drives I have had long enough to consider a good sample were Quantum Fireballs, and I ran 
those for 5 years with no failures. A friend of mine, however was not so lucky and had a huge 
failure rate with them.

Besides the IBM deathstar fiasco I really believe there appears to be little rhyme or reason to 
patterns of failure. Bad batches, lousy operating conditions, different usage patterns, bad 
wholesaler/transport handling will all play a part.

I'm just crossing my fingers and making sure the drives are well cooled, have minimised temperature 
cycling (not shutting the box down unless I have to) and are well monitored.

I'm about to purchase another 25 Maxline-II's so that might help with the sample distribution.
(At least Maxtor have a decent RMA process, WD's internation RMA process sucks)

-- 
Brad
                    /"\
Save the Forests   \ /     ASCII RIBBON CAMPAIGN
Burn a Greenie.     X      AGAINST HTML MAIL
                    / \

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: good drive / bad drive  (maxtor topic)
  2004-11-28  6:16   ` Brad Campbell
@ 2004-11-28  6:56     ` Guy
  0 siblings, 0 replies; 6+ messages in thread
From: Guy @ 2004-11-28  6:56 UTC (permalink / raw)
  To: 'Brad Campbell', linux-raid

I am glad you are having good luck with Maxtor.  I hope they and others
improve the life span/quality of the disk drives.

Seems like you will have a good sample, so keep us up to date!

I know Tvio (and/or replaytv) use Maxtor, but only has a 30 or 90 day
warranty!
Really risky if you buy the non-transferable (to a new unit) lifetime
subscription.  You must read the fine print!

Guy

-----Original Message-----
From: Brad Campbell [mailto:brad@wasp.net.au] 
Sent: Sunday, November 28, 2004 1:17 AM
To: Guy
Cc: 'Mark Klarzynski'; linux-raid@vger.kernel.org
Subject: Re: good drive / bad drive (maxtor topic)

Guy wrote:
> It would be handy if someone would do an extended test of all of the disk
> drives.  Consumer Reports does this type of thing all the time, just not
on
> disk drives.  I don't think they do any extended tests on any computer
> hardware.  The testing should continue for 5 years.  And it could be
mostly
> automated.  No user interaction unless something goes wrong.  Now we need
> someone with money!  :)
> 

Problem with this is by the time the test has any sort of meaningful data,
the drives have been 
obsoleted/EOL'd.

For the record I have 12 Maxtor Maxline-II 250GB drives here which have 200
days on the clock and 
nary a hiccup. I don't doubt people have trouble with Maxtor drives, hell
prior to the release of 
the Maxline-II drives I would not have looked at them sideways, too many
previous bad experiences.
Having said that, I have had just as many Seagate drives fail and I won't
touch them either. I had a 
bad run with WD and the dodgy firmware that failed when used in a RAID, and
I have had a couple of 
them fail recently.

The only drives I have had long enough to consider a good sample were
Quantum Fireballs, and I ran 
those for 5 years with no failures. A friend of mine, however was not so
lucky and had a huge 
failure rate with them.

Besides the IBM deathstar fiasco I really believe there appears to be little
rhyme or reason to 
patterns of failure. Bad batches, lousy operating conditions, different
usage patterns, bad 
wholesaler/transport handling will all play a part.

I'm just crossing my fingers and making sure the drives are well cooled,
have minimised temperature 
cycling (not shutting the box down unless I have to) and are well monitored.

I'm about to purchase another 25 Maxline-II's so that might help with the
sample distribution.
(At least Maxtor have a decent RMA process, WD's internation RMA process
sucks)

-- 
Brad
                    /"\
Save the Forests   \ /     ASCII RIBBON CAMPAIGN
Burn a Greenie.     X      AGAINST HTML MAIL
                    / \

^ permalink raw reply	[flat|nested] 6+ messages in thread

* good drive / bad drive  (maxtor topic)
@ 2004-11-24 17:36 Mark Klarzynski
  2004-11-24 22:27 ` David Greaves
  0 siblings, 1 reply; 6+ messages in thread
From: Mark Klarzynski @ 2004-11-24 17:36 UTC (permalink / raw)
  To: linux-ide

In the world of hardware raid we decide that a drive has failed based on
various criteria, one of which is the obvious 'has the drive responded'
within a set time.  This set time various depending on the drive, the
application, the load etc.  This 'timeout' value is realistically
between 6 and 10 seconds.  There is no real formula, just lots of
experience. Set it too short and drives will look failed too often, set
it too long and you risk allowing a suspect drive to continue.

Once we detect a timeout we have to decide what to with it. in SCSI we
issue a scsi bus rest (hardware reset on the bus).. the reason we do
this (and all hardware raid manufactures) is because life is just that
way. drives do lock up.   We issue up to 3 resets, and then fail.  This
is extremely effective and does exactly what it is supposed to do.
Often the drive will never cause an issue again. if it is faulty then it
will escalate and fail.

We have utilised countless sata drives, and timeouts are by far the most
significant failure we see and as sata (although its hard to tell much
else on sata).. but it is therefore imperative that the timeout values
are correct for the drive and the application.

But the point is that we do not see anywhere near the failure rates on
the Maxtor's that you guys are mentioning.  Also, if we trial sata's on
different hardware RAID's we see differing failure rates... (i.e. ICP
come in higher then 3ware, which are higher than the host independent
raids we have tested and so on)

So I am wondering if it is worth thinking about the timeout values?  And
what do you do once the drive has timed out?

I am seeing some tremendous work going on in this group and without a
doubt this community is going to propel MD to enterprise level raid one
day. so this is honestly meant as constructive and is based on way too
many years designing raid solutions. i.e. I'm not looking to start an
argument simply offering some information.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: good drive / bad drive  (maxtor topic)
  2004-11-24 17:36 Mark Klarzynski
@ 2004-11-24 22:27 ` David Greaves
  0 siblings, 0 replies; 6+ messages in thread
From: David Greaves @ 2004-11-24 22:27 UTC (permalink / raw)
  To: Mark Klarzynski; +Cc: linux-ide

Well, I don't class a drive as failed unless it fails the drive 
manufacturers test suite.

One other problem with Maxtor is that the test application needs DOS and 
won't run on VIA or Silicon Image motherboard controllers - so you'd 
better have a test rig with a decent SATA controller in just waiting to 
test failed Maxtors ;)

Last time it failed and had an obvious write problem so I just sent it back.
Next time I'll see about creating a boot-cd with freedos and the maxtor 
test app.

I have now got 6 Maxtor drives in my array - one failed and was accepted 
as faulty by the vendor and replaced.
Another threw a fault that hasn't recurred. Maybe it remapped a block 
noisily? Maybe it was a timeout?
It was inconvenient because you go into a 'fragile' mode. Now of course 
I have a hot spare - but I'm wondering about raid6.

As for enterprise raid - have you seen EVMS recently?
I installed and configured it but reverted to raw md/lvm2 after:
* dm bbr claimed that I had 320 bad blocks on *all* my disks - hmmm
* the lvm2 manager wouldn't _quite_ build properly so I had to do it 
manually
* the raid5 manager wouldn't let me create a degraded array (which I 
needed to do)

And I have enough stability issues without another 'almost there' component.

It's looking damned good though.

sorry for wandering a bit OT :)

David

Mark Klarzynski wrote:

>In the world of hardware raid we decide that a drive has failed based on
>various criteria, one of which is the obvious 'has the drive responded'
>within a set time.  This set time various depending on the drive, the
>application, the load etc.  This 'timeout' value is realistically
>between 6 and 10 seconds.  There is no real formula, just lots of
>experience. Set it too short and drives will look failed too often, set
>it too long and you risk allowing a suspect drive to continue.
>
>Once we detect a timeout we have to decide what to with it. in SCSI we
>issue a scsi bus rest (hardware reset on the bus).. the reason we do
>this (and all hardware raid manufactures) is because life is just that
>way. drives do lock up.   We issue up to 3 resets, and then fail.  This
>is extremely effective and does exactly what it is supposed to do.
>Often the drive will never cause an issue again. if it is faulty then it
>will escalate and fail.
>
>We have utilised countless sata drives, and timeouts are by far the most
>significant failure we see and as sata (although its hard to tell much
>else on sata).. but it is therefore imperative that the timeout values
>are correct for the drive and the application.
>
>But the point is that we do not see anywhere near the failure rates on
>the Maxtor's that you guys are mentioning.  Also, if we trial sata's on
>different hardware RAID's we see differing failure rates... (i.e. ICP
>come in higher then 3ware, which are higher than the host independent
>raids we have tested and so on)
>
>So I am wondering if it is worth thinking about the timeout values?  And
>what do you do once the drive has timed out?
>
>I am seeing some tremendous work going on in this group and without a
>doubt this community is going to propel MD to enterprise level raid one
>day. so this is honestly meant as constructive and is based on way too
>many years designing raid solutions. i.e. I'm not looking to start an
>argument simply offering some information.
>
>
>
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-ide" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>  
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-11-28  6:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-11-25 11:46 good drive / bad drive (maxtor topic) Mark Klarzynski
2004-11-27 18:06 ` Guy
2004-11-28  6:16   ` Brad Campbell
2004-11-28  6:56     ` Guy
  -- strict thread matches above, loose matches on Subject: below --
2004-11-24 17:36 Mark Klarzynski
2004-11-24 22:27 ` David Greaves

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.