raid5 - failed disks

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* raid5 - failed disks
@ 2005-04-01 10:08 Alvin Oga
  2005-04-01 10:33 ` Frank Wittig
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Alvin Oga @ 2005-04-01 10:08 UTC (permalink / raw)
  To: linux-raid


hi ya raiders ..

we(they) have 14x 72GB scsi disks config'd as raid5,
( no hot spare .. )

- if 1 disk dies, no problem ... ez to recover

- my dumb question is,
	- if 2 disks dies at the same time, i
	assume the entire raid5 is basically hosed 
	if it won't reassemble and resync from
	the point where it last was before the crash ??

	- i assume that the similar 2 disk failure
	also applies to hw raid controllrs, but it'd
	be more dependent upon the raid controller's 
	firmware for it's ability to recover from 
	2 of 14 simultaneous disk failures
	( lets say the dell powervault 2205 series )

- i think 4x 300GB ide disks is better ( less likely to fail ?? )

	and yes it has already crashed twice with
	2 different disks running at 78F at nights
	and weekends when the air conditioning is off

- i wish i can just change things but... its
  not yet my call :-)

c ya
alvin


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: raid5 - failed disks
  2005-04-01 10:08 raid5 - failed disks Alvin Oga
@ 2005-04-01 10:33 ` Frank Wittig
  2005-04-01 10:56   ` Alvin Oga
  2005-04-01 10:55 ` raid5 - failed disks Andy Smith
  2005-04-01 11:05 ` Gordon Henderson
  2 siblings, 1 reply; 14+ messages in thread
From: Frank Wittig @ 2005-04-01 10:33 UTC (permalink / raw)
  To: Alvin Oga; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1109 bytes --]

Alvin Oga wrote:
> hi ya raiders ..
> 
> we(they) have 14x 72GB scsi disks config'd as raid5,
> ( no hot spare .. )
> 
> - if 1 disk dies, no problem ... ez to recover

right. if one dies raid does exactly what is's supposed to do...

> - my dumb question is,
> 	- if 2 disks dies at the same time, i

if 2 disks fail at the same time you data is lost.
if you have raid5 with 5 hot spares and a second disk dies, befor a hot
spare is synced into the array (will be listed as spare until sync hast
finished) the same - data gone.

> - i think 4x 300GB ide disks is better ( less likely to fail ?? )

it's a simple calculation.
if one disk fails with a risk of 1 to 100000, then 10 disks fail with 10
to 100000. which means that having 10 disks is two times as dangerous as
having 5 disks.

if i had an array with as much disks as you have i would chose raid
level 6 (not sure if dm supports this by now).
raid level 6 survives two disks failing at once.

btw: what reason is there to switch off air conditioning of your it
equipment. temperature changes are poison to this fragile things.

greetings,
frank

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: raid5 - failed disks
  2005-04-01 10:08 raid5 - failed disks Alvin Oga
  2005-04-01 10:33 ` Frank Wittig
@ 2005-04-01 10:55 ` Andy Smith
  2005-04-01 11:04   ` Alvin Oga
  2005-04-01 11:05 ` Gordon Henderson
  2 siblings, 1 reply; 14+ messages in thread
From: Andy Smith @ 2005-04-01 10:55 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1677 bytes --]

On Fri, Apr 01, 2005 at 02:08:21AM -0800, Alvin Oga wrote:
> 
> hi ya raiders ..
> 
> we(they) have 14x 72GB scsi disks config'd as raid5,
> ( no hot spare .. )

This seems like an awful lot of disks to have in a raid 5 with no
hot spares, to me, but then I am fairly new to RAID issues so maybe
I am wrong.. but I would much rather have raid 10.

> - if 1 disk dies, no problem ... ez to recover
> 
> - my dumb question is,
> 	- if 2 disks dies at the same time, i
> 	assume the entire raid5 is basically hosed 
> 	if it won't reassemble and resync from
> 	the point where it last was before the crash ??

Technically it's screwed but it could be possible to recover it with
some losses.. I've fortunately never yet had to do that, maybe
someone who has could answer more fully.

> - i think 4x 300GB ide disks is better ( less likely to fail ?? )

Hard to say.. the typical IDE disk is usually regarded as less
reliable than the typical SCSI disk, and also there are then less
spindles per array so the performance may be worse.

I think I would still be happier with 14x72GB SCSI in a RAID-10
(504GB usable) than 14x72GB in RAID-5, although the RAID-10 would
give only a bit more than half the capacity.  Also if I felt I
needed the performance of a 14 disk SCSI RAID-5 then I probably
wouldn't want to go down to a 4 disk IDE RAID-5.

> 	and yes it has already crashed twice with
> 	2 different disks running at 78F at nights
> 	and weekends when the air conditioning is off

If you're running the disks in an environment that is too hot for
them then I think you are wasting money by just throwing more disks
(of any sort) at it.

[-- Attachment #2: Type: application/pgp-signature, Size: 187 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: raid5 - failed disks
  2005-04-01 10:33 ` Frank Wittig
@ 2005-04-01 10:56   ` Alvin Oga
  2005-04-01 11:09     ` Gordon Henderson
  0 siblings, 1 reply; 14+ messages in thread
From: Alvin Oga @ 2005-04-01 10:56 UTC (permalink / raw)
  To: linux-raid


hi ya frank

On Fri, 1 Apr 2005, Frank Wittig wrote:

> > - my dumb question is,
> >       - if 2 disks dies at the same time, i
>
> if 2 disks fail at the same time you data is lost.
> if you have raid5 with 5 hot spares and a second disk dies, befor a hot
> spare is synced into the array (will be listed as spare until sync hast
> finished) the same - data gone.

yup about failure before the resync completes..

> > - i think 4x 300GB ide disks is better ( less likely to fail ?? )
> 
> it's a simple calculation.
> if one disk fails with a risk of 1 to 100000, then 10 disks fail with 10
> to 100000. which means that having 10 disks is two times as dangerous as
> having 5 disks.

thanx ... that's also a good reaffirmation too ( that i'm not the only
one to use n * MTBF )  or 1/n .. etc depending on point of view

> if i had an array with as much disks as you have i would chose raid
> level 6 (not sure if dm supports this by now).
> raid level 6 survives two disks failing at once.

ah .. but its not my choices yet ...  stuff on inherits including
the finger for somebody elses decisions to buy xx vs yy for nn reasons

> btw: what reason is there to switch off air conditioning of your it
> equipment. temperature changes are poison to this fragile things.

out here.. some building owners are greedy with their nickels, and
its NOT uncommong to have air conditioning turned off on nights and
weekends
	- and for those thinking colo, the colo's too turn aircondition
	off till people complain its to hot ( temp inching upward till
	people complain )
  
	- ambient temp should be 65F or less
	and disk operating temp ( hddtemp ) should be 35 or less

c ya
alvin


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: raid5 - failed disks
  2005-04-01 10:55 ` raid5 - failed disks Andy Smith
@ 2005-04-01 11:04   ` Alvin Oga
  0 siblings, 0 replies; 14+ messages in thread
From: Alvin Oga @ 2005-04-01 11:04 UTC (permalink / raw)
  To: Andy Smith; +Cc: linux-raid


On Fri, 1 Apr 2005, Andy Smith wrote:

> This seems like an awful lot of disks to have in a raid 5 with no
> hot spares, to me, but then I am fairly new to RAID issues so maybe
> I am wrong.. but I would much rather have raid 10.

i'd say its an over kill .. but thats what they have ..
 
> Technically it's screwed but it could be possible to recover it with
> some losses.. I've fortunately never yet had to do that, maybe
> someone who has could answer more fully.

unfortunately for me, i'm always inheriting broken systems
and told to "fix it" with no budget, etc, etc..

> > - i think 4x 300GB ide disks is better ( less likely to fail ?? )
> 
> Hard to say.. the typical IDE disk is usually regarded as less
> reliable than the typical SCSI disk, and also there are then less
> spindles per array so the performance may be worse.

i alwyas try to have 2 live copies of the same data on 
different servers... different cities if possible ...
	- the bigger the data, 1TB, 10TB, 50TB, the more
	copies i would have locally 

for me, i've had good luch with ide disks

for me, all my failed disk subsystems has all ( say 75% ) been scsi
... and i kept all the dead disks to show it
	- and for fairness, we'll have to ignore the superbad
	ibm deathstars

> If you're running the disks in an environment that is too hot for
> them then I think you are wasting money by just throwing more disks
> (of any sort) at it.

well ... yup.. but some people like "name brand" and live with their
"bad" rules given out by "name brand"  vs using something that is better
from not-so-big-of-a-name-brand including office/colo spaces
that is better
	- it's a bad office bldg in terms of computer's livelyhood

- give it time ... 

c ya
alvin 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: raid5 - failed disks
  2005-04-01 10:08 raid5 - failed disks Alvin Oga
  2005-04-01 10:33 ` Frank Wittig
  2005-04-01 10:55 ` raid5 - failed disks Andy Smith
@ 2005-04-01 11:05 ` Gordon Henderson
  2005-04-01 17:01   ` Mike Hardy
  2 siblings, 1 reply; 14+ messages in thread
From: Gordon Henderson @ 2005-04-01 11:05 UTC (permalink / raw)
  To: linux-raid

On Fri, 1 Apr 2005, Alvin Oga wrote:

>
> hi ya raiders ..
>
> we(they) have 14x 72GB scsi disks config'd as raid5,
> ( no hot spare .. )
>
> - if 1 disk dies, no problem ... ez to recover
>
> - my dumb question is,
> 	- if 2 disks dies at the same time, i
> 	assume the entire raid5 is basically hosed
> 	if it won't reassemble and resync from
> 	the point where it last was before the crash ??

It's possible to recover it - IF one of the failed disks hasn't really
failed. ie. no genuine bad sectors or lost data.

I had a 6-year old 8-disk array a while back that had been retired after 5
years of trouble-free operation, but was subsequently pressed into use on
a different server - it featured some dodgyness about it - it would
occasionally fail a disk because the sun was in the wrong place, or the
moon was full, or something - never got to the bottom of it - the disks
would always surface check OK afterwards - they may have been remapping
sectors, but I never observed data or file system corruption, and I did
occasionally get a 2-disk failure, but I was always able to resurect it
using the last disk to fail as part of the array. Fortunately the stop-gap
it was filling has been replaced by something new now!

> 	- i assume that the similar 2 disk failure
> 	also applies to hw raid controllrs, but it'd
> 	be more dependent upon the raid controller's
> 	firmware for it's ability to recover from
> 	2 of 14 simultaneous disk failures
> 	( lets say the dell powervault 2205 series )
>
> - i think 4x 300GB ide disks is better ( less likely to fail ?? )

Who knows. With the H/W solution, you really are at the mercy of the
hardware supporting software. Less disks might be less risk of failure
though. Some modern disks don't seem to be having a good press recently
though. (eg. Maxtor) I've switched to RAID-6 now, even for a 4-disk system
I built recently. Disks are cheap enough now. (Unless you have to buy them
from Dull or Stun!!!)

> 	and yes it has already crashed twice with
> 	2 different disks running at 78F at nights
> 	and weekends when the air conditioning is off

Um - thats only 25C. Well inside the limits I'd have thought. I have some
disks (Maxtors!) that are hapilly running at 50C (Although for how much
longer, I don't know, but they have survived 15 months so-far, but they
are in a fairly stable temperature environment - at the top of a lift
shaft!)

By comparison, I have another box (same config & age) thats effectively
outside and the temperature cycles are very visible and it's just had a
disk fail )-:

Gordon

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: raid5 - failed disks
  2005-04-01 10:56   ` Alvin Oga
@ 2005-04-01 11:09     ` Gordon Henderson
  2005-04-01 11:22       ` raid5 - failed disks - i'm confusing Alvin Oga
  0 siblings, 1 reply; 14+ messages in thread
From: Gordon Henderson @ 2005-04-01 11:09 UTC (permalink / raw)
  To: linux-raid

On Fri, 1 Apr 2005, Alvin Oga wrote:

> 	- ambient temp should be 65F or less
> 	and disk operating temp ( hddtemp ) should be 35 or less

Are we confusing F and C here?

hddtemp typically reports temperatures in C. 35F is bloody cold!

65F is barely room temperature. (18C)

Gordon

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: raid5 - failed disks - i'm confusing
  2005-04-01 11:09     ` Gordon Henderson
@ 2005-04-01 11:22       ` Alvin Oga
  2005-04-04 18:59         ` Doug Ledford
  0 siblings, 1 reply; 14+ messages in thread
From: Alvin Oga @ 2005-04-01 11:22 UTC (permalink / raw)
  To: Gordon Henderson; +Cc: linux-raid


hi ya gordon

On Fri, 1 Apr 2005, Gordon Henderson wrote:

> On Fri, 1 Apr 2005, Alvin Oga wrote:
> 
> > 	- ambient temp should be 65F or less
> > 	and disk operating temp ( hddtemp ) should be 35 or less
> 
> Are we confusing F and C here?

65F was for normal server room environment 
	( some folks use 72F for office )

and i changed units to 35C for hd operating temp vs 25C 
	- most of my ide disks run at under 30C
	- p4-2.xG cpu temps under 40C

> hddtemp typically reports temperatures in C. 35F is bloody cold!

nah ... i like my disks cold to the touch ... ( 2 fans per disks )
 
> 65F is barely room temperature. (18C)

yup ... 

thanx
alvin


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: raid5 - failed disks
  2005-04-01 11:05 ` Gordon Henderson
@ 2005-04-01 17:01   ` Mike Hardy
  0 siblings, 0 replies; 14+ messages in thread
From: Mike Hardy @ 2005-04-01 17:01 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 645 bytes --]


Gordon Henderson wrote:

> sectors, but I never observed data or file system corruption, and I did
> occasionally get a 2-disk failure, but I was always able to resurect it
> using the last disk to fail as part of the array. Fortunately the stop-gap

This should do the trick. If you're curious whether you lost any data or
not before you start doing things that change data, you can use this
script to scan the physical disks using the linux left-asymmetric
algorithm to see if the parity is on or not.

This is the same as the one I posted previously, with the exception of
an error that's been fixed by Matthias Julius and patched in

-Mike

[-- Attachment #2: raid5calc.pl --]
[-- Type: text/plain, Size: 7305 bytes --]

#!/usr/bin/perl -w

#
# raid5 perl utility
#   Copyright (C) 2005 Mike Hardy <mike@mikehardy.net>
#
# This script understands the default linux raid5 disk layout,
# and can be used to check parity in an array stripe, or to calculate
# the data that should be present in a chunk with a read error.
#
# Constructive criticism, detailed bug reports, patches, etc gladly accepted!
#
# Thanks to Ashford Computer Consulting Service for their handy RAID information:
#    http://www.accs.com/p_and_p/RAID/index.html
#
# Thanks also to the various linux kernel hackers that have worked on 'md',
# the header files and source code were quite informative when writing this.
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2, or (at your option)
# any later version.
#
# You should have received a copy of the GNU General Public License
# (for example /usr/src/linux/COPYING); if not, write to the Free
# Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
#

my @array_components = (
			"/dev/loop0",
			"/dev/loop1",
			"/dev/loop2",
			"/dev/loop3",
			"/dev/loop4",
			"/dev/loop5",
			"/dev/loop6",
			"/dev/loop7"
			);

my $chunk_size = 64 * 1024; # chunk size is 64K
my $sectors_per_chunk = $chunk_size / 512;


# Problem - I have a bad sector on one disk in an array
my %component = (
    "sector" => 2032,
    "device" => "/dev/loop3"
);


# 1) Get the array-related info for that sector
# 2) See if it was the parity disk or not
# 2a) If it was the parity disk, calculate the parity
# 2b) If it was not the parity disk, calculate its value from parity
# 3) Write the data back into the sector

(
 $component{"array_chunk"},
 $component{"chunk_offset"}, 
 $component{"stripe"},
 $component{"parity_device"}
 ) = &getInfoForComponentAddress($component{"sector"}, $component{"device"});

foreach my $KEY (keys(%component)) {
    print $KEY . " => " . $component{$KEY} . "\n";
}

# We started with the information on the bad sector, and now we know how it fits into the array
# Lets see if we can fix the bad sector with the information at hand

# Build up the list of devices to xor in order to derive our value
my $xor_count = -1;
for (my $i = 0; $i <= $#array_components; $i++) {

    # skip ourselves as we roll through
    next if ($component{"device"} eq $array_components[$i]);

    # skip the parity chunk as we roll through
    next if ($component{"parity_device"} eq $array_components[$i]);

    $xor_devices{++$xor_count} = $array_components[$i];

    print 
	"Adding xor device " . 
	$array_components[$i] . " as xor device " . 
	$xor_count . "\n";
}

# If we are not the parity device, put the parity device at the end
if (!($component{"device"} eq $component{"parity_device"})) {

    $xor_devices{++$xor_count} = $component{"parity_device"};

    print 
	"Adding parity device " . 
	$component{"parity_device"} . " as xor device " . 
	$xor_count . "\n";
}


# pre-calculate the device offset, and initialize the xor buffer
my $device_offset = $component{"stripe"} * $sectors_per_chunk;
my $xor_result = "0" x ($sectors_per_chunk * 512);

# Read in the chunks and feed them into the xor buffer
for (my $i = 0; $i <= $xor_count; $i++) {

    print 
	"Reading in chunk on stripe " . 
	$component{"stripe"} . " (sectors " .
	$device_offset . " - " .
	($device_offset + $sectors_per_chunk) . ") of device " .
	$xor_devices{$i} . "\n";
    
    # Open the device and read this chunk in
    open(DEVICE, "<" . $xor_devices{$i})
	|| die "Unable to open device " . $xor_devices{$i} . ": " . $! . "\n";
    seek(DEVICE, $device_offset, 0)
	|| die "Unable to seek to " . $device_offset . " device " . $xor_devices{$i} . ": " . $! . "\n";
    read(DEVICE, $data, ($sectors_per_chunk * 512))
	|| die "Unable to read device " . $xor_devices{$1} . ": " . $! . "\n";
    close(DEVICE);
    
    # Convert binary to hex for printing
    my $hexdata = unpack("H*", pack ("B*", $data));
    #print "Got data '" . $hexdata . "' from device " . $xor_devices{$i} . "\n";

    # xor the data in there
    $xor_result ^= $data;
}

my $hex_xor_result = unpack("H*", pack ("B*", $xor_result));
#print "got hex xor result '" . $hex_xor_result . "'\n";

#########################################################################################
# Testing only -
# Check to see if the result I got is the same as what is in the block
open (DEVICE, "<" . $component{"device"})
    || die "Unable to open device " . $compoent{"device"} . ": " . $! . "\n";
seek(DEVICE, $device_offset, 0)
    || die "Unable to seek to " . $device_offset . " device " . $xor_devices{$i} . ": " . $! . "\n";
read(DEVICE, $data, ($sectors_per_chunk * 512))
    || die "Unable to read device " . $xor_devices{$1} . ": " . $! . "\n";
close(DEVICE);

# Convert binary to hex for printing
my $hexdata = unpack("H*", pack ("B*", $data));
#print "Got data '" . $hexdata . "' from device " . $component{"device"} . "\n";


# Do the comparison, and report what we've got
if (!($hexdata eq $hex_xor_result)) {
    print "The value from the device, and the computed value from parity are inequal for some reason...\n";
}
else {
    print "Device value matches what we computed from other devices. Score!\n";
}
#########################################################################################



# Given an array component, and a sector address in that component, we want
# 1) the disk/sector combination for the start of its stripe
# 2) the disk/sector combination for the start of its parity
sub getInfoForComponentAddress() {

    # Get our arguments into (hopefully) well-named variables
    my $sector = shift();
    my $device = shift();

    print "determining info for sector " 
	. $sector . " on " 
	. $device . "\n";
    
    # Get the stripe number
    my $stripe = int($sector / $sectors_per_chunk);
    print "stripe number is " . $stripe . "\n";
    
    # Get the offset in the stripe
    my $chunk_offset = $sector % $sectors_per_chunk;
    print "chunk offset is " . $chunk_offset . "\n";
   
    # See what device index our device is
    my $device_index = 0;
    for ($i = 0; $i <= $#array_components; $i++) {
	if ($device eq $array_components[$i]) {
	    $device_index = $i;
	    print "This disk is device " . $device_index . " in the array\n";
	}
    }
 
    # Figure out which disk holds parity for this stripe
    # FIXME only handling the default left-asymmetric style right now
    my $parity_device_index = ($#array_components) - ($stripe % $array_components);
    print "parity device index for stripe " . $stripe . " is " . $parity_device_index . "\n";
    my $parity_device = $array_components[$parity_device_index];
    
    # Figure out which chunk of the array this is
    # FIXME only handling the default left-asymmetric style right now
    my $array_chunk = $stripe * ($array_components - 1) + $device_index;
    if ($device_index > $parity_device_index) {
	$array_chunk--;
    }

    # Check for the special case where this device *is* the parity device and return special
    if ($device_index == $parity_device_index) {
	$array_chunk = -1;
    }
    
    return (
	    $array_chunk,
	    $chunk_offset,
	    $stripe,
	    $parity_device
	    );
}





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: raid5 - failed disks - i'm confusing
  2005-04-01 11:22       ` raid5 - failed disks - i'm confusing Alvin Oga
@ 2005-04-04 18:59         ` Doug Ledford
  2005-04-04 19:46           ` Richard Scobie
  2005-04-04 22:51           ` Alvin Oga
  0 siblings, 2 replies; 14+ messages in thread
From: Doug Ledford @ 2005-04-04 18:59 UTC (permalink / raw)
  To: Alvin Oga; +Cc: Gordon Henderson, linux-raid

On Fri, 2005-04-01 at 03:22 -0800, Alvin Oga wrote:
> On Fri, 1 Apr 2005, Gordon Henderson wrote:
> > On Fri, 1 Apr 2005, Alvin Oga wrote:
> > 
> > > 	- ambient temp should be 65F or less
> > > 	and disk operating temp ( hddtemp ) should be 35 or less
> > 
> > Are we confusing F and C here?
> 
> 65F was for normal server room environment 
> 	( some folks use 72F for office )
> 
> and i changed units to 35C for hd operating temp vs 25C 
> 	- most of my ide disks run at under 30C
> 	- p4-2.xG cpu temps under 40C
> 
> > hddtemp typically reports temperatures in C. 35F is bloody cold!
> 
> nah ... i like my disks cold to the touch ... ( 2 fans per disks )

Just for the record, second guessing mechanical engineers with
thermodynamics background training and an eye towards differing material
expansion rates and the like can be risky.  This is like saying "Nah, I
like the engine in my car to run cold, so I use no thermostat and two
fans on the radiator."  It might sound like a good idea to you, but
proper cylinder to piston wall clearance is obtained at a specific
temperature (cylinder sleeves are typically some sort of iron or steel
compound and expand in diameter slower than the aluminum pistons when
heated to operating temperature, so the pistons are made smaller in
diameter at room temperature so that when both the sleeve and the piston
are at operating temperature the clearance will be correct).  Running an
engine at a lower temperature increases that clearance and can result in
premature piston failure.

As far as hard drive internals are concerned, I'm not positive whether
or not they are subject to the same sort of thermal considerations, but
just looking at the outside of a hard drive shows a very common case of
an aluminum cast frame and some sort of iron/steel based top plate.
These are going to expand at different rates with temperature and for
all I know if you run the drive overly cool, you may be placing undue
stress on the seal between these two parts of the drive (consider the
case of both the aluminum frame and the top plate having a channel for a
rubber o-ring, and until the drive reaches operating temp. the channels
may not line up perfectly, resulting in stress on the o-ring).

Anyway, it might or might not hurt the drives to run them well below
their designed operating temperature, I don't have schematics and
materials lists in front of me to tell for sure.  But second guessing
mechanical engineers that likely have compensated for thermal issues at
a given, specific common operating temperature is usually risky.  Most
people think "Heat kills" and therefore like to keep things as cool as
possible.  For mechanical devices anyway, it's not so much that heat
kills, as it is operating outside of the designed temperature range,
either above or below, that reduces overall life expectancy.  Keep your
drives from overheating, but don't try to freeze them would be my
advice.

-- 
Doug Ledford <dledford@redhat.com>
http://people.redhat.com/dledford

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: raid5 - failed disks - i'm confusing
  2005-04-04 18:59         ` Doug Ledford
@ 2005-04-04 19:46           ` Richard Scobie
  2005-04-04 23:12             ` Alvin Oga
  2005-04-04 22:51           ` Alvin Oga
  1 sibling, 1 reply; 14+ messages in thread
From: Richard Scobie @ 2005-04-04 19:46 UTC (permalink / raw)
  To: linux-raid

Doug Ledford wrote:

> Anyway, it might or might not hurt the drives to run them well below
> their designed operating temperature, I don't have schematics and
> materials lists in front of me to tell for sure.  But second guessing
> mechanical engineers that likely have compensated for thermal issues at
> a given, specific common operating temperature is usually risky.  Most
> people think "Heat kills" and therefore like to keep things as cool as
> possible.  For mechanical devices anyway, it's not so much that heat
> kills, as it is operating outside of the designed temperature range,
> either above or below, that reduces overall life expectancy.  Keep your
> drives from overheating, but don't try to freeze them would be my
> advice.

Indeed. This paper

  http://www.hitachigst.com/hdd/technolo/drivetemp/drivetemp.htm

shows some of the factors you mention and for what it's worth Hitachi's
recommended operating range is 5 - 55 C for their 15K SCSI.

Regards,

Richard


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: raid5 - failed disks - i'm confusing
  2005-04-04 18:59         ` Doug Ledford
  2005-04-04 19:46           ` Richard Scobie
@ 2005-04-04 22:51           ` Alvin Oga
  2005-04-05  1:02             ` Doug Ledford
  1 sibling, 1 reply; 14+ messages in thread
From: Alvin Oga @ 2005-04-04 22:51 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-raid



On Mon, 4 Apr 2005, Doug Ledford wrote:

> Anyway, it might or might not hurt the drives to run them well below
> their designed operating temperature, I don't have schematics and
> materials lists in front of me to tell for sure.

ez enough to do ... its called "specs" on the various manufacturers 
websites ... similarly for the operating temp of the ICs on the
disk controllers ..

you're welcome to run your disks hot ...

i prefer to run it cool to the finger touch test as the server
room to be 65F

and its a known "fact" for 40+ years ... "heat kills" electromechanical
items,  car engines is a different animal for different reasons

c ya
alvin
- feel free to second guess my reasons :-) ... there's no specs on that

>  But second guessing
> mechanical engineers that likely have compensated for thermal issues at
> a given, specific common operating temperature is usually risky.  Most
> people think "Heat kills" and therefore like to keep things as cool as
> possible.  For mechanical devices anyway, it's not so much that heat
> kills, as it is operating outside of the designed temperature range,
> either above or below, that reduces overall life expectancy.  Keep your
> drives from overheating, but don't try to freeze them would be my
> advice.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: raid5 - failed disks - i'm confusing
  2005-04-04 19:46           ` Richard Scobie
@ 2005-04-04 23:12             ` Alvin Oga
  0 siblings, 0 replies; 14+ messages in thread
From: Alvin Oga @ 2005-04-04 23:12 UTC (permalink / raw)
  To: Richard Scobie; +Cc: linux-raid


On Tue, 5 Apr 2005, Richard Scobie wrote:

>   http://www.hitachigst.com/hdd/technolo/drivetemp/drivetemp.htm
> 
> shows some of the factors you mention and for what it's worth Hitachi's
> recommended operating range is 5 - 55 C for their 15K SCSI.

those specs is talking mostly about the "SMART" monitoring systems

the specs of a drive ... say ..
	http://www.hitachigst.com/hdd/support/10k300/10k300.htm

but, not many will tell you/us how temp affects their mtbf
and warranty ...
	- even if the disk is covered under warranty repair,
	are you willing to leave that disk missing for a few
	days, weeks while the drive is in warranty repair

sample effects of temp vs reliabilty ( degradation still applies to hard
disks )
	http://www.Linux-1U.net/CPU/

it's easy to see the effects of temp vs reliability, by
noting the number of disks problems one has over the past 5-10 years
out of hundreds, thousands of disks, remembering too, that disk
manufacturers are in the business of selling new disks

c ya
alvin


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: raid5 - failed disks - i'm confusing
  2005-04-04 22:51           ` Alvin Oga
@ 2005-04-05  1:02             ` Doug Ledford
  0 siblings, 0 replies; 14+ messages in thread
From: Doug Ledford @ 2005-04-05  1:02 UTC (permalink / raw)
  To: Alvin Oga; +Cc: linux-raid

On Mon, 2005-04-04 at 15:51 -0700, Alvin Oga wrote:
> 
> On Mon, 4 Apr 2005, Doug Ledford wrote:
> 
> > Anyway, it might or might not hurt the drives to run them well below
> > their designed operating temperature, I don't have schematics and
> > materials lists in front of me to tell for sure.
> 
> ez enough to do ... its called "specs" on the various manufacturers 
> websites ... similarly for the operating temp of the ICs on the
> disk controllers ..
> 
> you're welcome to run your disks hot ...

I didn't say to run them hot, just design temp.  Overheating is bad,
just like you mentioned.

> i prefer to run it cool to the finger touch test as the server
> room to be 65F
> 
> and its a known "fact" for 40+ years ... "heat kills" electromechanical
> items,  car engines is a different animal for different reasons

Yes it does, and my point wasn't to say that it doesn't, just to say
that for the mechanical portion of electromechanical devices, excessive
cool can be bad as well.

-- 
Doug Ledford <dledford@redhat.com>
http://people.redhat.com/dledford



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2005-04-05  1:02 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-01 10:08 raid5 - failed disks Alvin Oga
2005-04-01 10:33 ` Frank Wittig
2005-04-01 10:56   ` Alvin Oga
2005-04-01 11:09     ` Gordon Henderson
2005-04-01 11:22       ` raid5 - failed disks - i'm confusing Alvin Oga
2005-04-04 18:59         ` Doug Ledford
2005-04-04 19:46           ` Richard Scobie
2005-04-04 23:12             ` Alvin Oga
2005-04-04 22:51           ` Alvin Oga
2005-04-05  1:02             ` Doug Ledford
2005-04-01 10:55 ` raid5 - failed disks Andy Smith
2005-04-01 11:04   ` Alvin Oga
2005-04-01 11:05 ` Gordon Henderson
2005-04-01 17:01   ` Mike Hardy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).