From mboxrd@z Thu Jan  1 00:00:00 1970
From: Bill Davidsen <davidsen@tmr.com>
Subject: Re: disks becoming slow but not explicitly failing anyone?
Date: Thu, 04 May 2006 20:45:23 -0400
Message-ID: <445AA023.6070406@tmr.com>
References: <Pine.LNX.4.44.0604231344470.18611-100000@coffee.psychology.mcmaster.ca> <874q0irdah.fsf@hades.wkstn.nix>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <linux-raid-owner@vger.kernel.org>
In-Reply-To: <874q0irdah.fsf@hades.wkstn.nix>
Sender: linux-raid-owner@vger.kernel.org
To: Nix <nix@esperi.org.uk>
Cc: Mark Hahn <hahn@physics.mcmaster.ca>, linux-raid@vger.kernel.org
List-Id: linux-raid.ids

Nix wrote:

>On 23 Apr 2006, Mark Hahn stipulated:
>  
>
>>>I've seen a lot of cheap disks say (generally deep in the data sheet
>>>that's only available online after much searching and that nobody ever
>>>reads) that they are only reliable if used for a maximum of twelve hours
>>>a day, or 90 hours a week, or something of that nature. Even server
>>>      
>>>
>>I haven't, and I read lots of specs.  they _will_ sometimes say that 
>>non-enterprise drives are "intended" or "designed" for a 8x5 desktop-like
>>usage pattern.
>>    
>>
>
>That's the phrasing, yes: foolish me assumed that meant `if you leave it
>on for much longer than that, things will go wrong'.
>
>  
>
>>                to the normal way of thinking about reliability, this would 
>>simply mean a factor of 4.2x lower reliability - say from 1M to 250K hours
>>MTBF.  that's still many times lower rate of failure than power supplies or 
>>fans.
>>    
>>
>
>Ah, right, it's not a drastic change.
>
>  
>
>>>It still stuns me that anyone would ever voluntarily buy drives that
>>>can't be left switched on (which is perhaps why the manufacturers hide
>>>      
>>>
>>I've definitely never seen any spec that stated that the drive had to be 
>>switched off.  the issue is really just "what is the designed duty-cycle?"
>>    
>>
>
>I see. So it's just `we didn't try to push the MTBF up as far as we would
>on other sorts of disks'.
>
>  
>
>>I run a number of servers which are used as compute clusters.  load is
>>definitely 24x7, since my users always keep the queues full.  but the servers
>>are not maxed out 24x7, and do work quite nicely with desktop drives
>>for years at a time.  it's certainly also significant that these are in a 
>>decent machineroom environment.
>>    
>>
>
>Yeah; i.e., cooled. I don't have a cleanroom in my house so the RAID
>array I run there is necessarily uncooled, and the alleged aircon in the
>room housing work's array is permanently on the verge of total collapse
>(I think it lowers the temperature, but not by much).
>
>  
>
>>it's unfortunate that disk vendors aren't more forthcoming with their drive
>>stats.  for instance, it's obvious that "wear" in MTBF terms would depend 
>>nonlinearly on the duty cycle.  it's important for a customer to know where 
>>that curve bends, and to try to stay in the low-wear zone.  similarly, disk
>>    
>>
>
>Agreed! I tend to assume that non-laptop disks hate being turned on and
>hate temperature changes, so just keep them running 24x7. This seems to be OK,
>with the only disks this has ever killed being Hitachi server-class disks in
>a very expensive Sun server which was itself meant for 24x7 operation; the
>cheaper disks in my home systems were quite happy. (Go figure...)
>
>  
>
>>specs often just give a max operating temperature (often 60C!), which is 
>>almost disingenuous, since temperature has a superlinear effect on reliability.
>>    
>>
>
>I'll say. I'm somewhat twitchy about the uncooled 37C disks in one of my
>machines: but one of the other disks ran at well above 60C for *years*
>without incident: it was an old one with no onboard temperature sensing,
>and it was perhaps five years after startup that I opened that machine
>for the first time in years and noticed that the disk housing nearly
>burned me when I touched it. The guy who installed it said that yes, it
>had always run that hot, and was that important? *gah*
>
>I got a cooler for that disk in short order.
>
>  
>
>>a system designer needs to evaluate the expected duty cycle when choosing
>>disks, as well as many other factors which are probably more important.
>>for instance, an earlier thread concerned a vast amount of read traffic 
>>to disks resulting from atime updates.
>>    
>>
>
>Oddly, I see a steady pulse of write traffic, ~100Kb/s, to one dm device
>(translating into read+write on the underlying disks) even when the
>system is quiescient, all daemons killed, and all fsen mounted with
>noatime. One of these days I must fish out blktrace and see what's
>causing it (but that machine is hard to quiesce like that: it's in heavy
>use).
>
>  
>
>>simply using more disks also decreases the load per disk, though this is 
>>clearly only a win if it's the difference in staying out of the disks 
>>"duty-cycle danger zone" (since more disks divide system MTBF).
>>    
>>
>
>Well, yes, but if you have enough more you can make some of them spares
>and push up the MTBF again (and the cooling requirements, and the power
>consumption: I wish there was a way to spin down spares until they were
>needed, but non-laptop controllers don't often seem to provide a way to
>spin anything down at all that I know of).
>
>  
>
hdparam will let you set the spindown time. I have all mine set that way 
for power and heat reasons, they tend to be in burst use. Dropped the CR 
temp by enough to notice, but I need some more local cooling for that 
room still.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979