Busted disks caused healthy ones to fail

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Busted disks caused healthy ones to fail
@ 2004-12-14  6:42 comsatcat
  2004-12-14  6:55 ` Guy
  0 siblings, 1 reply; 16+ messages in thread
From: comsatcat @ 2004-12-14  6:42 UTC (permalink / raw)
  To: linux-raid

An odd thing happened this weekend.  We were doing some heavy I/O when
one of our servers had two drives in two seperate raid1 mirrors pop.
This was not odd as these drives are old and the batch they are from
have been failing on other boxen as well.  What is odd is that our brand
new disks which the OS resides on (2 drives in raid 1) half busted.

There are 4 md devices

md/0  
md/1
md/2
md/3

md3, md2, and md1 all lost the 2nd drive in the array (sdh3, sdh6, and
sdh5).  md0 however was fine with sdh1 being fine.  Why would losing
disks cause a seemingly healthy disk to go astray?

P.S. I have pull out tons of syslogs showing the two bad disks failing
if that would help.

Thanks,
Ben

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Busted disks caused healthy ones to fail
  2004-12-14  6:42 Busted disks caused healthy ones to fail comsatcat
@ 2004-12-14  6:55 ` Guy
  2004-12-14  8:28   ` comsatcat
  0 siblings, 1 reply; 16+ messages in thread
From: Guy @ 2004-12-14  6:55 UTC (permalink / raw)
  To: comsatcat, linux-raid

Did the disks that failed have anything in common?

SCSI:
If you have disks on 1 SCSI bus, a single failed disk can affect other
disks.  By removing the bad disk you correct the problems with the others.

IDE:  (or what ever they call it today)
2 disks on 1 bus, 1 drive failure will cause the other to fail most of the
time.

Power supply:
If you have external disks, they will have another power supply.  If you
have problems with this power supply, they all could be affected.  Even a
common power cable can cause multi drive failures.

Temperature:
Disks getting too hot can cause failures.

Kids:
Someone turned the disk cabinet off?

I am sure this list is not complete.  But it may help.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of comsatcat
Sent: Tuesday, December 14, 2004 1:42 AM
To: linux-raid@vger.kernel.org
Subject: Busted disks caused healthy ones to fail

An odd thing happened this weekend.  We were doing some heavy I/O when
one of our servers had two drives in two seperate raid1 mirrors pop.
This was not odd as these drives are old and the batch they are from
have been failing on other boxen as well.  What is odd is that our brand
new disks which the OS resides on (2 drives in raid 1) half busted.

There are 4 md devices

md/0  
md/1
md/2
md/3

md3, md2, and md1 all lost the 2nd drive in the array (sdh3, sdh6, and
sdh5).  md0 however was fine with sdh1 being fine.  Why would losing
disks cause a seemingly healthy disk to go astray?

P.S. I have pull out tons of syslogs showing the two bad disks failing
if that would help.

Thanks,
Ben

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Busted disks caused healthy ones to fail
  2004-12-14  6:55 ` Guy
@ 2004-12-14  8:28   ` comsatcat
  2004-12-14 14:11     ` Michael Stumpf
  2004-12-14 15:22     ` Guy
  0 siblings, 2 replies; 16+ messages in thread
From: comsatcat @ 2004-12-14  8:28 UTC (permalink / raw)
  To: Guy; +Cc: linux-raid

The two disks that were actually dead were both on a different bus.  The
OS disk that died was on scsi0.

Is there a way around this behavior (ie: kernel params that can be
adjusted such as timeout values and queuing)?  It never really recovered
correctly after the disks died, a manual reboot as required.
Applications which were using the failed devices would hang forever (I'm
assuming they were waiting for queued commands to complete).

IDE: not in use
Power: 14 internal drives, no external
Temp: fust fine
Kids: Upstairs taking tech calls.


Thanks,
Ben


On Tue, 2004-12-14 at 01:55 -0500, Guy wrote:
> Did the disks that failed have anything in common?
> 
> SCSI:
> If you have disks on 1 SCSI bus, a single failed disk can affect other
> disks.  By removing the bad disk you correct the problems with the others.
> 
> IDE:  (or what ever they call it today)
> 2 disks on 1 bus, 1 drive failure will cause the other to fail most of the
> time.
> 
> Power supply:
> If you have external disks, they will have another power supply.  If you
> have problems with this power supply, they all could be affected.  Even a
> common power cable can cause multi drive failures.
> 
> Temperature:
> Disks getting too hot can cause failures.
> 
> Kids:
> Someone turned the disk cabinet off?
> 
> I am sure this list is not complete.  But it may help.
> 
> Guy
> 
> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org
> [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of comsatcat
> Sent: Tuesday, December 14, 2004 1:42 AM
> To: linux-raid@vger.kernel.org
> Subject: Busted disks caused healthy ones to fail
> 
> An odd thing happened this weekend.  We were doing some heavy I/O when
> one of our servers had two drives in two seperate raid1 mirrors pop.
> This was not odd as these drives are old and the batch they are from
> have been failing on other boxen as well.  What is odd is that our brand
> new disks which the OS resides on (2 drives in raid 1) half busted.
> 
> There are 4 md devices
> 
> md/0  
> md/1
> md/2
> md/3
> 
> md3, md2, and md1 all lost the 2nd drive in the array (sdh3, sdh6, and
> sdh5).  md0 however was fine with sdh1 being fine.  Why would losing
> disks cause a seemingly healthy disk to go astray?
> 
> P.S. I have pull out tons of syslogs showing the two bad disks failing
> if that would help.
> 
> 
> Thanks,
> Ben
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Busted disks caused healthy ones to fail
  2004-12-14  8:28   ` comsatcat
@ 2004-12-14 14:11     ` Michael Stumpf
  2004-12-14 22:34       ` comsatcat
  2004-12-14 15:22     ` Guy
  1 sibling, 1 reply; 16+ messages in thread
From: Michael Stumpf @ 2004-12-14 14:11 UTC (permalink / raw)
  To: comsatcat, linux-raid

14 Internal drives on a single power supply plus the mb/cpu/etc?  Oy; 
I've got 15 + a p2-400 spinning between 2 550w power supplies, and I'm 
worried it is getting overloaded.  I might be paranoid, but I had some 
flakiness that was pretty much impossible to debug, so I took broad 
steps and overestimated.  Figured that maybe a heavily loaded supply 
could hiccup under an unusual condition if too many were attached to 
one..  and, while anecdotal, my once-a-month drive hiccup (require 
re-add to array, nothing else) problem did go away when I added a power 
supply.

comsatcat wrote:

>The two disks that were actually dead were both on a different bus.  The
>OS disk that died was on scsi0.
>
>Is there a way around this behavior (ie: kernel params that can be
>adjusted such as timeout values and queuing)?  It never really recovered
>correctly after the disks died, a manual reboot as required.
>Applications which were using the failed devices would hang forever (I'm
>assuming they were waiting for queued commands to complete).
>
>IDE: not in use
>Power: 14 internal drives, no external
>Temp: fust fine
>Kids: Upstairs taking tech calls.
>
>
>Thanks,
>Ben
>
>
>On Tue, 2004-12-14 at 01:55 -0500, Guy wrote:
>  
>
>>Did the disks that failed have anything in common?
>>
>>SCSI:
>>If you have disks on 1 SCSI bus, a single failed disk can affect other
>>disks.  By removing the bad disk you correct the problems with the others.
>>
>>IDE:  (or what ever they call it today)
>>2 disks on 1 bus, 1 drive failure will cause the other to fail most of the
>>time.
>>
>>Power supply:
>>If you have external disks, they will have another power supply.  If you
>>have problems with this power supply, they all could be affected.  Even a
>>common power cable can cause multi drive failures.
>>
>>Temperature:
>>Disks getting too hot can cause failures.
>>
>>Kids:
>>Someone turned the disk cabinet off?
>>
>>I am sure this list is not complete.  But it may help.
>>
>>Guy
>>
>>-----Original Message-----
>>From: linux-raid-owner@vger.kernel.org
>>[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of comsatcat
>>Sent: Tuesday, December 14, 2004 1:42 AM
>>To: linux-raid@vger.kernel.org
>>Subject: Busted disks caused healthy ones to fail
>>
>>An odd thing happened this weekend.  We were doing some heavy I/O when
>>one of our servers had two drives in two seperate raid1 mirrors pop.
>>This was not odd as these drives are old and the batch they are from
>>have been failing on other boxen as well.  What is odd is that our brand
>>new disks which the OS resides on (2 drives in raid 1) half busted.
>>
>>There are 4 md devices
>>
>>md/0  
>>md/1
>>md/2
>>md/3
>>
>>md3, md2, and md1 all lost the 2nd drive in the array (sdh3, sdh6, and
>>sdh5).  md0 however was fine with sdh1 being fine.  Why would losing
>>disks cause a seemingly healthy disk to go astray?
>>
>>P.S. I have pull out tons of syslogs showing the two bad disks failing
>>if that would help.
>>
>>
>>Thanks,
>>Ben
>>
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>    
>>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>  
>


--------------------------------------------
My mailbox is spam-free with ChoiceMail, the leader in personal and corporate anti-spam solutions. Download your free copy of ChoiceMail from www.choicemailfree.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Busted disks caused healthy ones to fail
  2004-12-14  8:28   ` comsatcat
  2004-12-14 14:11     ` Michael Stumpf
@ 2004-12-14 15:22     ` Guy
  2004-12-14 20:13       ` Brad Campbell
  1 sibling, 1 reply; 16+ messages in thread
From: Guy @ 2004-12-14 15:22 UTC (permalink / raw)
  To: comsatcat; +Cc: linux-raid

14 drives in 1 case?  That's a big box!

Did you ask your kids for help?  :)

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of comsatcat
Sent: Tuesday, December 14, 2004 3:29 AM
To: Guy
Cc: linux-raid@vger.kernel.org
Subject: RE: Busted disks caused healthy ones to fail

The two disks that were actually dead were both on a different bus.  The
OS disk that died was on scsi0.

Is there a way around this behavior (ie: kernel params that can be
adjusted such as timeout values and queuing)?  It never really recovered
correctly after the disks died, a manual reboot as required.
Applications which were using the failed devices would hang forever (I'm
assuming they were waiting for queued commands to complete).

IDE: not in use
Power: 14 internal drives, no external
Temp: fust fine
Kids: Upstairs taking tech calls.


Thanks,
Ben


On Tue, 2004-12-14 at 01:55 -0500, Guy wrote:
> Did the disks that failed have anything in common?
> 
> SCSI:
> If you have disks on 1 SCSI bus, a single failed disk can affect other
> disks.  By removing the bad disk you correct the problems with the others.
> 
> IDE:  (or what ever they call it today)
> 2 disks on 1 bus, 1 drive failure will cause the other to fail most of the
> time.
> 
> Power supply:
> If you have external disks, they will have another power supply.  If you
> have problems with this power supply, they all could be affected.  Even a
> common power cable can cause multi drive failures.
> 
> Temperature:
> Disks getting too hot can cause failures.
> 
> Kids:
> Someone turned the disk cabinet off?
> 
> I am sure this list is not complete.  But it may help.
> 
> Guy
> 
> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org
> [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of comsatcat
> Sent: Tuesday, December 14, 2004 1:42 AM
> To: linux-raid@vger.kernel.org
> Subject: Busted disks caused healthy ones to fail
> 
> An odd thing happened this weekend.  We were doing some heavy I/O when
> one of our servers had two drives in two seperate raid1 mirrors pop.
> This was not odd as these drives are old and the batch they are from
> have been failing on other boxen as well.  What is odd is that our brand
> new disks which the OS resides on (2 drives in raid 1) half busted.
> 
> There are 4 md devices
> 
> md/0  
> md/1
> md/2
> md/3
> 
> md3, md2, and md1 all lost the 2nd drive in the array (sdh3, sdh6, and
> sdh5).  md0 however was fine with sdh1 being fine.  Why would losing
> disks cause a seemingly healthy disk to go astray?
> 
> P.S. I have pull out tons of syslogs showing the two bad disks failing
> if that would help.
> 
> 
> Thanks,
> Ben
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Busted disks caused healthy ones to fail
  2004-12-14 15:22     ` Guy
@ 2004-12-14 20:13       ` Brad Campbell
  2004-12-14 21:47         ` Guy
  2004-12-14 21:49         ` Jim Paris
  0 siblings, 2 replies; 16+ messages in thread
From: Brad Campbell @ 2004-12-14 20:13 UTC (permalink / raw)
  To: Guy; +Cc: comsatcat, linux-raid

Guy wrote:
> 14 drives in 1 case?  That's a big box!
> 

It's not that hard.
I have 4 drives loaded in the rear bays and 2 x 5 Way SATA Hotswap bays in the 6 front 5.25 inch 
bays. 14 Drives. And yes, they are on a single 420w PSU along with the motherboard, Athlon XP 2600+. 
and 5 80mm fans. Not much else though.

I'm working my way towards a 15 drive box now. Just waiting for the new enclosures to arrive.

-- 
Brad
                    /"\
Save the Forests   \ /     ASCII RIBBON CAMPAIGN
Burn a Greenie.     X      AGAINST HTML MAIL
                    / \

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Busted disks caused healthy ones to fail
  2004-12-14 20:13       ` Brad Campbell
@ 2004-12-14 21:47         ` Guy
  2004-12-14 23:54           ` Alvin Oga
  2004-12-14 21:49         ` Jim Paris
  1 sibling, 1 reply; 16+ messages in thread
From: Guy @ 2004-12-14 21:47 UTC (permalink / raw)
  To: 'Brad Campbell'; +Cc: comsatcat, linux-raid

My disk drives are rated to use 19.1 watts while active.
My disk drives also get very hot without forced air movement.
14 such disks use 267.4 watts.  That would give you 152.6 watts for
everything else.  I bet your disks use less power than mine.  I agree with
Michael, you may be exceeding the power rating of your power supply.

Also, I would not want to push a power supply to the max rating.  It should
be sized 50 watts or more beyond what you need.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Brad Campbell
Sent: Tuesday, December 14, 2004 3:13 PM
To: Guy
Cc: comsatcat@earthlink.net; linux-raid@vger.kernel.org
Subject: Re: Busted disks caused healthy ones to fail

Guy wrote:
> 14 drives in 1 case?  That's a big box!
> 

It's not that hard.
I have 4 drives loaded in the rear bays and 2 x 5 Way SATA Hotswap bays in
the 6 front 5.25 inch 
bays. 14 Drives. And yes, they are on a single 420w PSU along with the
motherboard, Athlon XP 2600+. 
and 5 80mm fans. Not much else though.

I'm working my way towards a 15 drive box now. Just waiting for the new
enclosures to arrive.

-- 
Brad
                    /"\
Save the Forests   \ /     ASCII RIBBON CAMPAIGN
Burn a Greenie.     X      AGAINST HTML MAIL
                    / \
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Busted disks caused healthy ones to fail
  2004-12-14 20:13       ` Brad Campbell
  2004-12-14 21:47         ` Guy
@ 2004-12-14 21:49         ` Jim Paris
  2004-12-14 22:13           ` Guy
  2004-12-15  4:46           ` Brad Campbell
  1 sibling, 2 replies; 16+ messages in thread
From: Jim Paris @ 2004-12-14 21:49 UTC (permalink / raw)
  To: Brad Campbell; +Cc: Guy, comsatcat, linux-raid

> It's not that hard.
> I have 4 drives loaded in the rear bays and 2 x 5 Way SATA Hotswap bays in 
> the 6 front 5.25 inch bays. 14 Drives. And yes, they are on a single 420w 
> PSU along with the motherboard, Athlon XP 2600+. and 5 80mm fans. Not much 
> else though.

!!!!!!  Holy crap!

Let's pick a random typical hard drive, a Seagate 120GB SATA:
http://www.mittoni.com.au/catalog/product_info.php/products_id/1690
It lists maximum current draw as 2.8 A on the +12V line.
Multiply that by 14 drives and we get __39.2 amps__.

Now let's pick a random 420W power supply:
http://www.newegg.com/app/viewproductdesc.asp?submit=Go&description=N82E16817103445
Note how it's +12V output is rated for only __15 amps__.

Your numbers might differ a bit.  But it is NO surprise that your
drives are failing.  The surprising part is that they and your power
supply have worked this long.

-jim

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Busted disks caused healthy ones to fail
  2004-12-14 21:49         ` Jim Paris
@ 2004-12-14 22:13           ` Guy
  2004-12-15  4:46           ` Brad Campbell
  1 sibling, 0 replies; 16+ messages in thread
From: Guy @ 2004-12-14 22:13 UTC (permalink / raw)
  To: 'Jim Paris', 'Brad Campbell'; +Cc: comsatcat, linux-raid

If all of my disks were to start at the same time (assuming I had 14), they
would require 483 watts while starting up!

My disks use 2.5 amps on startup.  On the 12V line.  But use .98 amps when
idle.  And .90 amps on the 5V line.  I have my system to delay the starting
of the disks based on SCSI ID, so only one starts at a time.  If your disks
start at the same time, then OUCH!  How can it work?

You should find the specs on your disks and do some math just to be sure.

If you want to compare, specs for my disks:
http://www.seagate.com/support/disc/specs/scsi/st118202lc.html

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Jim Paris
Sent: Tuesday, December 14, 2004 4:50 PM
To: Brad Campbell
Cc: Guy; comsatcat@earthlink.net; linux-raid@vger.kernel.org
Subject: Re: Busted disks caused healthy ones to fail

> It's not that hard.
> I have 4 drives loaded in the rear bays and 2 x 5 Way SATA Hotswap bays in

> the 6 front 5.25 inch bays. 14 Drives. And yes, they are on a single 420w 
> PSU along with the motherboard, Athlon XP 2600+. and 5 80mm fans. Not much

> else though.

!!!!!!  Holy crap!

Let's pick a random typical hard drive, a Seagate 120GB SATA:
http://www.mittoni.com.au/catalog/product_info.php/products_id/1690
It lists maximum current draw as 2.8 A on the +12V line.
Multiply that by 14 drives and we get __39.2 amps__.

Now let's pick a random 420W power supply:
http://www.newegg.com/app/viewproductdesc.asp?submit=Go&description=N82E1681
7103445
Note how it's +12V output is rated for only __15 amps__.

Your numbers might differ a bit.  But it is NO surprise that your
drives are failing.  The surprising part is that they and your power
supply have worked this long.

-jim
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Busted disks caused healthy ones to fail
  2004-12-14 14:11     ` Michael Stumpf
@ 2004-12-14 22:34       ` comsatcat
  0 siblings, 0 replies; 16+ messages in thread
From: comsatcat @ 2004-12-14 22:34 UTC (permalink / raw)
  To: mjstumpf; +Cc: linux-raid

Not 1, 3 power supplys.

On Tue, 2004-12-14 at 08:11 -0600, Michael Stumpf wrote:
> 14 Internal drives on a single power supply plus the mb/cpu/etc?  Oy; 
> I've got 15 + a p2-400 spinning between 2 550w power supplies, and I'm 
> worried it is getting overloaded.  I might be paranoid, but I had some 
> flakiness that was pretty much impossible to debug, so I took broad 
> steps and overestimated.  Figured that maybe a heavily loaded supply 
> could hiccup under an unusual condition if too many were attached to 
> one..  and, while anecdotal, my once-a-month drive hiccup (require 
> re-add to array, nothing else) problem did go away when I added a power 
> supply.
> 
> comsatcat wrote:
> 
> >The two disks that were actually dead were both on a different bus.  The
> >OS disk that died was on scsi0.
> >
> >Is there a way around this behavior (ie: kernel params that can be
> >adjusted such as timeout values and queuing)?  It never really recovered
> >correctly after the disks died, a manual reboot as required.
> >Applications which were using the failed devices would hang forever (I'm
> >assuming they were waiting for queued commands to complete).
> >
> >IDE: not in use
> >Power: 14 internal drives, no external
> >Temp: fust fine
> >Kids: Upstairs taking tech calls.
> >
> >
> >Thanks,
> >Ben
> >
> >
> >On Tue, 2004-12-14 at 01:55 -0500, Guy wrote:
> >  
> >
> >>Did the disks that failed have anything in common?
> >>
> >>SCSI:
> >>If you have disks on 1 SCSI bus, a single failed disk can affect other
> >>disks.  By removing the bad disk you correct the problems with the others.
> >>
> >>IDE:  (or what ever they call it today)
> >>2 disks on 1 bus, 1 drive failure will cause the other to fail most of the
> >>time.
> >>
> >>Power supply:
> >>If you have external disks, they will have another power supply.  If you
> >>have problems with this power supply, they all could be affected.  Even a
> >>common power cable can cause multi drive failures.
> >>
> >>Temperature:
> >>Disks getting too hot can cause failures.
> >>
> >>Kids:
> >>Someone turned the disk cabinet off?
> >>
> >>I am sure this list is not complete.  But it may help.
> >>
> >>Guy
> >>
> >>-----Original Message-----
> >>From: linux-raid-owner@vger.kernel.org
> >>[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of comsatcat
> >>Sent: Tuesday, December 14, 2004 1:42 AM
> >>To: linux-raid@vger.kernel.org
> >>Subject: Busted disks caused healthy ones to fail
> >>
> >>An odd thing happened this weekend.  We were doing some heavy I/O when
> >>one of our servers had two drives in two seperate raid1 mirrors pop.
> >>This was not odd as these drives are old and the batch they are from
> >>have been failing on other boxen as well.  What is odd is that our brand
> >>new disks which the OS resides on (2 drives in raid 1) half busted.
> >>
> >>There are 4 md devices
> >>
> >>md/0  
> >>md/1
> >>md/2
> >>md/3
> >>
> >>md3, md2, and md1 all lost the 2nd drive in the array (sdh3, sdh6, and
> >>sdh5).  md0 however was fine with sdh1 being fine.  Why would losing
> >>disks cause a seemingly healthy disk to go astray?
> >>
> >>P.S. I have pull out tons of syslogs showing the two bad disks failing
> >>if that would help.
> >>
> >>
> >>Thanks,
> >>Ben
> >>
> >>-
> >>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >>the body of a message to majordomo@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>-
> >>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >>the body of a message to majordomo@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>    
> >>
> >
> >-
> >To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >  
> >
> 
> 
> --------------------------------------------
> My mailbox is spam-free with ChoiceMail, the leader in personal and corporate anti-spam solutions. Download your free copy of ChoiceMail from www.choicemailfree.com
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Busted disks caused healthy ones to fail
  2004-12-14 21:47         ` Guy
@ 2004-12-14 23:54           ` Alvin Oga
  2004-12-15  1:03             ` Guy
  0 siblings, 1 reply; 16+ messages in thread
From: Alvin Oga @ 2004-12-14 23:54 UTC (permalink / raw)
  To: Guy; +Cc: 'Brad Campbell', comsatcat, linux-raid



On Tue, 14 Dec 2004, Guy wrote:

> My disk drives are rated to use 19.1 watts while active.
> My disk drives also get very hot without forced air movement.
> 14 such disks use 267.4 watts.  That would give you 152.6 watts for
> everything else.  I bet your disks use less power than mine.  I agree with
> Michael, you may be exceeding the power rating of your power supply.

an overworked power supply will fail "faster"

for "wattage" ... using a sharp needle, "pop" ... that is NOT how "wattage
works"

	- for a given voltage, you should be operating at 1/2 of its rated
	amperage

	- at 12v ... what are the disks rated at for power up ( spinup )
	vs ambeint normal operation
		- typically 1A to power up and 0.5A for normal spinning
		operation

	for a powersupply rated at 12V for 10A .. you can run 5 disks

	if you run more disks ... your power supply will die faster
	and/or your data gets corrupted during power ups when the
	system and the drives think you're in "normal operation"
	and enables write  while in fact, its is still in its bootup

> Also, I would not want to push a power supply to the max rating.  It should
> be sized 50 watts or more beyond what you need.

make that 2x the total wattage needed by the system .. NOT 50W 
	- add up the amps needed on the 3.3V, 5V, 12V 
	- use the MAX needed for operation, which is the power up sequence
	to get a "random number" of operating current ..

	Total watts == 12 * 12v(maxcurrent) + 5v * 5v(maxcurrent)
			3.3V * 3.3v(maxcurrent)

	- normal operation usually does NOT need as much current

- switched power supplies can output higher current than it is rated
  at, but it will not be able to sustain that "extra current load"
  for more than a few seconds before its over-current circuitry kicks
  in to shut itself down

	- you will be bouncing up and down in current on each power
	line till the system is all booted

	- put a digital storage scope on the 12V and 5V and 3.3V line
	and an ampmeter on each power line and watch it go bonkers

c ya
alvin


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Busted disks caused healthy ones to fail
  2004-12-14 23:54           ` Alvin Oga
@ 2004-12-15  1:03             ` Guy
  2004-12-15  1:23               ` Alvin Oga
  0 siblings, 1 reply; 16+ messages in thread
From: Guy @ 2004-12-15  1:03 UTC (permalink / raw)
  To: 'Alvin Oga'; +Cc: 'Brad Campbell', comsatcat, linux-raid

Just an FYI...
My disks take 2.5 amps on the 12V line (30 watts) to start.
Someone else said his require 2.9 amps (34.8 watts).
The above does not include the 5V line.

2 X startup wattage or amperage seems excessive.
Power supply delivers 50% during startup, and 25% while the system is in
use.  Seems wrong to me.

Guy

-----Original Message-----
From: Alvin Oga [mailto:aoga@ns.Linux-Consulting.com] 
Sent: Tuesday, December 14, 2004 6:55 PM
To: Guy
Cc: 'Brad Campbell'; comsatcat@earthlink.net; linux-raid@vger.kernel.org
Subject: RE: Busted disks caused healthy ones to fail



On Tue, 14 Dec 2004, Guy wrote:

> My disk drives are rated to use 19.1 watts while active.
> My disk drives also get very hot without forced air movement.
> 14 such disks use 267.4 watts.  That would give you 152.6 watts for
> everything else.  I bet your disks use less power than mine.  I agree with
> Michael, you may be exceeding the power rating of your power supply.

an overworked power supply will fail "faster"

for "wattage" ... using a sharp needle, "pop" ... that is NOT how "wattage
works"

	- for a given voltage, you should be operating at 1/2 of its rated
	amperage

	- at 12v ... what are the disks rated at for power up ( spinup )
	vs ambeint normal operation
		- typically 1A to power up and 0.5A for normal spinning
		operation

	for a powersupply rated at 12V for 10A .. you can run 5 disks

	if you run more disks ... your power supply will die faster
	and/or your data gets corrupted during power ups when the
	system and the drives think you're in "normal operation"
	and enables write  while in fact, its is still in its bootup

> Also, I would not want to push a power supply to the max rating.  It
should
> be sized 50 watts or more beyond what you need.

make that 2x the total wattage needed by the system .. NOT 50W 
	- add up the amps needed on the 3.3V, 5V, 12V 
	- use the MAX needed for operation, which is the power up sequence
	to get a "random number" of operating current ..

	Total watts == 12 * 12v(maxcurrent) + 5v * 5v(maxcurrent)
			3.3V * 3.3v(maxcurrent)

	- normal operation usually does NOT need as much current

- switched power supplies can output higher current than it is rated
  at, but it will not be able to sustain that "extra current load"
  for more than a few seconds before its over-current circuitry kicks
  in to shut itself down

	- you will be bouncing up and down in current on each power
	line till the system is all booted

	- put a digital storage scope on the 12V and 5V and 3.3V line
	and an ampmeter on each power line and watch it go bonkers

c ya
alvin


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Busted disks caused healthy ones to fail
  2004-12-15  1:03             ` Guy
@ 2004-12-15  1:23               ` Alvin Oga
  0 siblings, 0 replies; 16+ messages in thread
From: Alvin Oga @ 2004-12-15  1:23 UTC (permalink / raw)
  To: Guy; +Cc: 'Alvin Oga', 'Brad Campbell', comsatcat,
	linux-raid



On Tue, 14 Dec 2004, Guy wrote:

> Just an FYI...
> My disks take 2.5 amps on the 12V line (30 watts) to start.
> Someone else said his require 2.9 amps (34.8 watts).

the max rated startup current is NOT necessarily the real
current
	- use a storage digital scope to get "REAL" numbers

> The above does not include the 5V line.
> 
> 2 X startup wattage or amperage seems excessive.

comes from the last 50 years of "rule of thumb" 
when things was not as reliable  

another rule of thumb ...

	- things will die 2x as fast if the average sustained operating
	temp goes up by 10C

	- it works out just about right .. when you take the
	1,000,000 hour MTBF and factor in all the temps and
	24x7 operation and you find out that the little disk
	dies 1 month after the warranty expired :-)

c ya
alvin


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Busted disks caused healthy ones to fail
  2004-12-14 21:49         ` Jim Paris
  2004-12-14 22:13           ` Guy
@ 2004-12-15  4:46           ` Brad Campbell
  2004-12-15  5:04             ` Guy
  1 sibling, 1 reply; 16+ messages in thread
From: Brad Campbell @ 2004-12-15  4:46 UTC (permalink / raw)
  To: Jim Paris; +Cc: Guy, comsatcat, linux-raid

Jim Paris wrote:
>>It's not that hard.
>>I have 4 drives loaded in the rear bays and 2 x 5 Way SATA Hotswap bays in 
>>the 6 front 5.25 inch bays. 14 Drives. And yes, they are on a single 420w 
>>PSU along with the motherboard, Athlon XP 2600+. and 5 80mm fans. Not much 
>>else though.
> 
> 
> !!!!!!  Holy crap!
> 
> Let's pick a random typical hard drive, a Seagate 120GB SATA:
> http://www.mittoni.com.au/catalog/product_info.php/products_id/1690
> It lists maximum current draw as 2.8 A on the +12V line.
> Multiply that by 14 drives and we get __39.2 amps__.

Now, lets actually pick my hard drives shall we?

Max current draw on the 12v line is 1.56A at spinup, dropping to 600mA at seek and 556mA at Idle.

So _worst_ case is 21.84A for about 2 seconds (which does actually exceed the PSU ratings by nearly 
3 amps). This machine only gets power cycled about once every three months and I did actually 
monitor the 12V rail with a CRO to check specs and ripple and they never budged.

Worst case running load is 8.4A which leaves ~10A on my 12V rail for my motherboard. Ample.

> Now let's pick a random 420W power supply:
> http://www.newegg.com/app/viewproductdesc.asp?submit=Go&description=N82E16817103445
> Note how it's +12V output is rated for only __15 amps__.

Now lets pick my power supply http://www.wasp.net.au/~brad/p1000256.jpg

So yes, on spinup I'm exceeding my 12V rail by 3 Amps for about 1.5 Seconds (Which this supply has 
amply proven capable of handling). Outside that I don't see an issue.

> Your numbers might differ a bit.  But it is NO surprise that your
> drives are failing.  The surprising part is that they and your power
> supply have worked this long.

I never said anything about failing disks! In fact, if you check back you will see me commenting I 
have a bucket load of Maxtor Maxline-II drives in there that have been flawless to date. (In fact I 
have just ordered 25 more, 15 for me and 10 for a mate. That should increase the sample a little) 
They all sit at below 40 Degrees C and the PSU remains quite cool. (I'm an electronic technician by 
trade and have several thermocouples I use to verify measurements).

Here is the reason the drives stay nice and cool. http://www.wasp.net.au/~brad/p1000250.jpg

-- 
Brad
                    /"\
Save the Forests   \ /     ASCII RIBBON CAMPAIGN
Burn a Greenie.     X      AGAINST HTML MAIL
                    / \

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Busted disks caused healthy ones to fail
  2004-12-15  4:46           ` Brad Campbell
@ 2004-12-15  5:04             ` Guy
  2004-12-15  5:22               ` Brad Campbell
  0 siblings, 1 reply; 16+ messages in thread
From: Guy @ 2004-12-15  5:04 UTC (permalink / raw)
  To: 'Brad Campbell', 'Jim Paris'; +Cc: comsatcat, linux-raid

Maxtor drives....and no problems....You must be crazy!  :)

Well then, uh, it could...., humm....  That only leaves crazy!  :)

I give up!  You defended yourself well.  I have no idea.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Brad Campbell
Sent: Tuesday, December 14, 2004 11:46 PM
To: Jim Paris
Cc: Guy; comsatcat@earthlink.net; linux-raid@vger.kernel.org
Subject: Re: Busted disks caused healthy ones to fail

Jim Paris wrote:
>>It's not that hard.
>>I have 4 drives loaded in the rear bays and 2 x 5 Way SATA Hotswap bays in

>>the 6 front 5.25 inch bays. 14 Drives. And yes, they are on a single 420w 
>>PSU along with the motherboard, Athlon XP 2600+. and 5 80mm fans. Not much

>>else though.
> 
> 
> !!!!!!  Holy crap!
> 
> Let's pick a random typical hard drive, a Seagate 120GB SATA:
> http://www.mittoni.com.au/catalog/product_info.php/products_id/1690
> It lists maximum current draw as 2.8 A on the +12V line.
> Multiply that by 14 drives and we get __39.2 amps__.

Now, lets actually pick my hard drives shall we?

Max current draw on the 12v line is 1.56A at spinup, dropping to 600mA at
seek and 556mA at Idle.

So _worst_ case is 21.84A for about 2 seconds (which does actually exceed
the PSU ratings by nearly 
3 amps). This machine only gets power cycled about once every three months
and I did actually 
monitor the 12V rail with a CRO to check specs and ripple and they never
budged.

Worst case running load is 8.4A which leaves ~10A on my 12V rail for my
motherboard. Ample.

> Now let's pick a random 420W power supply:
>
http://www.newegg.com/app/viewproductdesc.asp?submit=Go&description=N82E1681
7103445
> Note how it's +12V output is rated for only __15 amps__.

Now lets pick my power supply http://www.wasp.net.au/~brad/p1000256.jpg

So yes, on spinup I'm exceeding my 12V rail by 3 Amps for about 1.5 Seconds
(Which this supply has 
amply proven capable of handling). Outside that I don't see an issue.

> Your numbers might differ a bit.  But it is NO surprise that your
> drives are failing.  The surprising part is that they and your power
> supply have worked this long.

I never said anything about failing disks! In fact, if you check back you
will see me commenting I 
have a bucket load of Maxtor Maxline-II drives in there that have been
flawless to date. (In fact I 
have just ordered 25 more, 15 for me and 10 for a mate. That should increase
the sample a little) 
They all sit at below 40 Degrees C and the PSU remains quite cool. (I'm an
electronic technician by 
trade and have several thermocouples I use to verify measurements).

Here is the reason the drives stay nice and cool.
http://www.wasp.net.au/~brad/p1000250.jpg

-- 
Brad
                    /"\
Save the Forests   \ /     ASCII RIBBON CAMPAIGN
Burn a Greenie.     X      AGAINST HTML MAIL
                    / \
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Busted disks caused healthy ones to fail
  2004-12-15  5:04             ` Guy
@ 2004-12-15  5:22               ` Brad Campbell
  0 siblings, 0 replies; 16+ messages in thread
From: Brad Campbell @ 2004-12-15  5:22 UTC (permalink / raw)
  To: Guy; +Cc: 'Jim Paris', comsatcat, linux-raid

Guy wrote:
> Maxtor drives....and no problems....You must be crazy!  :)
> 
> Well then, uh, it could...., humm....  That only leaves crazy!  :)
> 
> I give up!  You defended yourself well.  I have no idea.
> 

Note that it was not me who had the failing disk in the first place. I was just responding to the 
comment that 14 drives in a case sounded ludicrous. If the setup is reasonably well thought out and 
well cooled I see no issue. Yes, I do exceed my PSU continuous rating every time I spin up, but as I 
said I timed that overload at about 1.5 seconds and the PSU does not make any form of complaint.

I hope I'm more than just lucky with the Maxtor drives, but honestly I have had just as bad a run 
with every brand except Quantum (and who owns them now?). I figure by keeping the drives as cool as 
practical, in a stable environment with minimal temperature fluctuation and power cycling (this 
machine is a 24x7 server) then I'm probably fairly likely to get a better than average lifetime.

Sure with 29 Identical Maxtor drives I expect failures. I have a cold spare on standby just in case 
and the new 15 drive box will run Raid-6. In addition, this is a home entertainment system, it's not 
mission critical. Just a bit of fun on the weekends.

It's also a good test of the md and libata drivers. All up between the 29 drives I now have 7 
Promise SATA150TX4 controllers. Looking forward to hotswap :p)

-- 
Brad
                    /"\
Save the Forests   \ /     ASCII RIBBON CAMPAIGN
Burn a Greenie.     X      AGAINST HTML MAIL
                    / \

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2004-12-15  5:22 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-12-14  6:42 Busted disks caused healthy ones to fail comsatcat
2004-12-14  6:55 ` Guy
2004-12-14  8:28   ` comsatcat
2004-12-14 14:11     ` Michael Stumpf
2004-12-14 22:34       ` comsatcat
2004-12-14 15:22     ` Guy
2004-12-14 20:13       ` Brad Campbell
2004-12-14 21:47         ` Guy
2004-12-14 23:54           ` Alvin Oga
2004-12-15  1:03             ` Guy
2004-12-15  1:23               ` Alvin Oga
2004-12-14 21:49         ` Jim Paris
2004-12-14 22:13           ` Guy
2004-12-15  4:46           ` Brad Campbell
2004-12-15  5:04             ` Guy
2004-12-15  5:22               ` Brad Campbell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).