raid5 software vs hardware: parity calculations?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* raid5 software vs hardware: parity calculations?
@ 2007-01-11 22:44 James Ralston
  2007-01-12 17:39 ` dean gaudet
  0 siblings, 1 reply; 18+ messages in thread
From: James Ralston @ 2007-01-11 22:44 UTC (permalink / raw)
  To: linux-raid

I'm having a discussion with a coworker concerning the cost of md's
raid5 implementation versus hardware raid5 implementations.

Specifically, he states:

> The performance [of raid5 in hardware] is so much better with the
> write-back caching on the card and the offload of the parity, it
> seems to me that the minor increase in work of having to upgrade the
> firmware if there's a buggy one is a highly acceptable trade-off to
> the increased performance.  The md driver still commits you to
> longer run queues since IO calls to disk, parity calculator and the
> subsequent kflushd operations are non-interruptible in the CPU.  A
> RAID card with write-back cache releases the IO operation virtually
> instantaneously.

It would seem that his comments have merit, as there appears to be
work underway to move stripe operations outside of the spinlock:

    http://lwn.net/Articles/184102/

What I'm curious about is this: for real-world situations, how much
does this matter?  In other words, how hard do you have to push md
raid5 before doing dedicated hardware raid5 becomes a real win?

James

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: raid5 software vs hardware: parity calculations?
  2007-01-11 22:44 raid5 software vs hardware: parity calculations? James Ralston
@ 2007-01-12 17:39 ` dean gaudet
  2007-01-12 20:34   ` James Ralston
  0 siblings, 1 reply; 18+ messages in thread
From: dean gaudet @ 2007-01-12 17:39 UTC (permalink / raw)
  To: James Ralston; +Cc: linux-raid

On Thu, 11 Jan 2007, James Ralston wrote:

> I'm having a discussion with a coworker concerning the cost of md's
> raid5 implementation versus hardware raid5 implementations.
> 
> Specifically, he states:
> 
> > The performance [of raid5 in hardware] is so much better with the
> > write-back caching on the card and the offload of the parity, it
> > seems to me that the minor increase in work of having to upgrade the
> > firmware if there's a buggy one is a highly acceptable trade-off to
> > the increased performance.  The md driver still commits you to
> > longer run queues since IO calls to disk, parity calculator and the
> > subsequent kflushd operations are non-interruptible in the CPU.  A
> > RAID card with write-back cache releases the IO operation virtually
> > instantaneously.
> 
> It would seem that his comments have merit, as there appears to be
> work underway to move stripe operations outside of the spinlock:
> 
>     http://lwn.net/Articles/184102/
> 
> What I'm curious about is this: for real-world situations, how much
> does this matter?  In other words, how hard do you have to push md
> raid5 before doing dedicated hardware raid5 becomes a real win?

hardware with battery backed write cache is going to beat the software at 
small write traffic latency essentially all the time but it's got nothing 
to do with the parity computation.

-dean

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: raid5 software vs hardware: parity calculations?
  2007-01-12 17:39 ` dean gaudet
@ 2007-01-12 20:34   ` James Ralston
  2007-01-13  9:20     ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: James Ralston @ 2007-01-12 20:34 UTC (permalink / raw)
  To: linux-raid

On 2007-01-12 at 09:39-08 dean gaudet <dean@arctic.org> wrote:

> On Thu, 11 Jan 2007, James Ralston wrote:
> 
> > I'm having a discussion with a coworker concerning the cost of
> > md's raid5 implementation versus hardware raid5 implementations.
> > 
> > Specifically, he states:
> > 
> > > The performance [of raid5 in hardware] is so much better with
> > > the write-back caching on the card and the offload of the
> > > parity, it seems to me that the minor increase in work of having
> > > to upgrade the firmware if there's a buggy one is a highly
> > > acceptable trade-off to the increased performance.  The md
> > > driver still commits you to longer run queues since IO calls to
> > > disk, parity calculator and the subsequent kflushd operations
> > > are non-interruptible in the CPU.  A RAID card with write-back
> > > cache releases the IO operation virtually instantaneously.
> > 
> > It would seem that his comments have merit, as there appears to be
> > work underway to move stripe operations outside of the spinlock:
> > 
> >     http://lwn.net/Articles/184102/
> > 
> > What I'm curious about is this: for real-world situations, how
> > much does this matter?  In other words, how hard do you have to
> > push md raid5 before doing dedicated hardware raid5 becomes a real
> > win?
> 
> hardware with battery backed write cache is going to beat the
> software at small write traffic latency essentially all the time but
> it's got nothing to do with the parity computation.

I'm not convinced that's true.  What my coworker is arguing is that md
raid5 code spinlocks while it is performing this sequence of
operations:

    1.  executing the write
    2.  reading the blocks necessary for recalculating the parity
    3.  recalculating the parity
    4.  updating the parity block

My [admittedly cursory] read of the code, coupled with the link above,
leads me to believe that my coworker is correct, which is why I was
for trolling for [informed] opinions about how much of a performance
hit the spinlock causes.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: raid5 software vs hardware: parity calculations?
  2007-01-12 20:34   ` James Ralston
@ 2007-01-13  9:20     ` Dan Williams
  2007-01-13 17:32       ` Bill Davidsen
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2007-01-13  9:20 UTC (permalink / raw)
  To: James Ralston; +Cc: linux-raid

On 1/12/07, James Ralston <qralston+ml.linux-raid@andrew.cmu.edu> wrote:
> On 2007-01-12 at 09:39-08 dean gaudet <dean@arctic.org> wrote:
>
> > On Thu, 11 Jan 2007, James Ralston wrote:
> >
> > > I'm having a discussion with a coworker concerning the cost of
> > > md's raid5 implementation versus hardware raid5 implementations.
> > >
> > > Specifically, he states:
> > >
> > > > The performance [of raid5 in hardware] is so much better with
> > > > the write-back caching on the card and the offload of the
> > > > parity, it seems to me that the minor increase in work of having
> > > > to upgrade the firmware if there's a buggy one is a highly
> > > > acceptable trade-off to the increased performance.  The md
> > > > driver still commits you to longer run queues since IO calls to
> > > > disk, parity calculator and the subsequent kflushd operations
> > > > are non-interruptible in the CPU.  A RAID card with write-back
> > > > cache releases the IO operation virtually instantaneously.
> > >
> > > It would seem that his comments have merit, as there appears to be
> > > work underway to move stripe operations outside of the spinlock:
> > >
> > >     http://lwn.net/Articles/184102/
> > >
> > > What I'm curious about is this: for real-world situations, how
> > > much does this matter?  In other words, how hard do you have to
> > > push md raid5 before doing dedicated hardware raid5 becomes a real
> > > win?
> >
> > hardware with battery backed write cache is going to beat the
> > software at small write traffic latency essentially all the time but
> > it's got nothing to do with the parity computation.
>
> I'm not convinced that's true.
No, it's true.  md implements a write-through cache to ensure that
data reaches the disk.

>What my coworker is arguing is that md
> raid5 code spinlocks while it is performing this sequence of
> operations:
>
>     1.  executing the write
not performed under the lock
>     2.  reading the blocks necessary for recalculating the parity
not performed under the lock
>     3.  recalculating the parity
>     4.  updating the parity block
>
> My [admittedly cursory] read of the code, coupled with the link above,
> leads me to believe that my coworker is correct, which is why I was
> for trolling for [informed] opinions about how much of a performance
> hit the spinlock causes.
>
The spinlock is not a source of performance loss, the reason for
moving parity calculations outside the lock is to maximize the benefit
of using asynchronous xor+copy engines.

The hardware vs software raid trade-offs are well documented here:
http://linux.yyz.us/why-software-raid.html

Regards,
Dan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: raid5 software vs hardware: parity calculations?
  2007-01-13  9:20     ` Dan Williams
@ 2007-01-13 17:32       ` Bill Davidsen
  2007-01-13 23:23         ` Robin Bowes
  0 siblings, 1 reply; 18+ messages in thread
From: Bill Davidsen @ 2007-01-13 17:32 UTC (permalink / raw)
  To: Dan Williams; +Cc: James Ralston, linux-raid

Dan Williams wrote:
> On 1/12/07, James Ralston <qralston+ml.linux-raid@andrew.cmu.edu> wrote:
>> On 2007-01-12 at 09:39-08 dean gaudet <dean@arctic.org> wrote:
>>
>> > On Thu, 11 Jan 2007, James Ralston wrote:
>> >
>> > > I'm having a discussion with a coworker concerning the cost of
>> > > md's raid5 implementation versus hardware raid5 implementations.
>> > >
>> > > Specifically, he states:
>> > >
>> > > > The performance [of raid5 in hardware] is so much better with
>> > > > the write-back caching on the card and the offload of the
>> > > > parity, it seems to me that the minor increase in work of having
>> > > > to upgrade the firmware if there's a buggy one is a highly
>> > > > acceptable trade-off to the increased performance.  The md
>> > > > driver still commits you to longer run queues since IO calls to
>> > > > disk, parity calculator and the subsequent kflushd operations
>> > > > are non-interruptible in the CPU.  A RAID card with write-back
>> > > > cache releases the IO operation virtually instantaneously.
>> > >
>> > > It would seem that his comments have merit, as there appears to be
>> > > work underway to move stripe operations outside of the spinlock:
>> > >
>> > >     http://lwn.net/Articles/184102/
>> > >
>> > > What I'm curious about is this: for real-world situations, how
>> > > much does this matter?  In other words, how hard do you have to
>> > > push md raid5 before doing dedicated hardware raid5 becomes a real
>> > > win?
>> >
>> > hardware with battery backed write cache is going to beat the
>> > software at small write traffic latency essentially all the time but
>> > it's got nothing to do with the parity computation.
>>
>> I'm not convinced that's true.
> No, it's true.  md implements a write-through cache to ensure that
> data reaches the disk.
>
>> What my coworker is arguing is that md
>> raid5 code spinlocks while it is performing this sequence of
>> operations:
>>
>>     1.  executing the write
> not performed under the lock
>>     2.  reading the blocks necessary for recalculating the parity
> not performed under the lock
>>     3.  recalculating the parity
>>     4.  updating the parity block
>>
>> My [admittedly cursory] read of the code, coupled with the link above,
>> leads me to believe that my coworker is correct, which is why I was
>> for trolling for [informed] opinions about how much of a performance
>> hit the spinlock causes.
>>
> The spinlock is not a source of performance loss, the reason for
> moving parity calculations outside the lock is to maximize the benefit
> of using asynchronous xor+copy engines.
>
> The hardware vs software raid trade-offs are well documented here:
> http://linux.yyz.us/why-software-raid.html 

There have been several recent threads on the list regarding software 
RAID-5 performance. The reference might be updated to reflect the poor 
write performance of RAID-5 until/unless significant tuning is done. 
Read that as tuning obscure parameters and throwing a lot of memory into 
stripe cache. The reasons for hardware RAID should include "performance 
of RAID-5 writes is usually much better than software RAID-5 with 
default tuning.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: raid5 software vs hardware: parity calculations?
  2007-01-13 17:32       ` Bill Davidsen
@ 2007-01-13 23:23         ` Robin Bowes
  2007-01-14  3:16           ` dean gaudet
  2007-01-15 15:29           ` Bill Davidsen
  0 siblings, 2 replies; 18+ messages in thread
From: Robin Bowes @ 2007-01-13 23:23 UTC (permalink / raw)
  To: linux-raid

Bill Davidsen wrote:
>
> There have been several recent threads on the list regarding software
> RAID-5 performance. The reference might be updated to reflect the poor
> write performance of RAID-5 until/unless significant tuning is done.
> Read that as tuning obscure parameters and throwing a lot of memory into
> stripe cache. The reasons for hardware RAID should include "performance
> of RAID-5 writes is usually much better than software RAID-5 with
> default tuning.

Could you point me at a source of documentation describing how to
perform such tuning?

Specifically, I have 8x500GB WD STAT drives on a Supermicro PCI-X 8-port
SATA card configured as a single RAID6 array (~3TB available space)

Thanks,

R.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: raid5 software vs hardware: parity calculations?
  2007-01-13 23:23         ` Robin Bowes
@ 2007-01-14  3:16           ` dean gaudet
  2007-01-15 11:48             ` Michael Tokarev
  2007-01-15 15:29           ` Bill Davidsen
  1 sibling, 1 reply; 18+ messages in thread
From: dean gaudet @ 2007-01-14  3:16 UTC (permalink / raw)
  To: Robin Bowes; +Cc: linux-raid

On Sat, 13 Jan 2007, Robin Bowes wrote:

> Bill Davidsen wrote:
> >
> > There have been several recent threads on the list regarding software
> > RAID-5 performance. The reference might be updated to reflect the poor
> > write performance of RAID-5 until/unless significant tuning is done.
> > Read that as tuning obscure parameters and throwing a lot of memory into
> > stripe cache. The reasons for hardware RAID should include "performance
> > of RAID-5 writes is usually much better than software RAID-5 with
> > default tuning.
> 
> Could you point me at a source of documentation describing how to
> perform such tuning?
> 
> Specifically, I have 8x500GB WD STAT drives on a Supermicro PCI-X 8-port
> SATA card configured as a single RAID6 array (~3TB available space)

linux sw raid6 small write performance is bad because it reads the entire 
stripe, merges the small write, and writes back the changed disks.  
unlike raid5 where a small write can get away with a partial stripe read 
(i.e. the smallest raid5 write will read the target disk, read the parity, 
write the target, and write the updated parity)... afaik this optimization 
hasn't been implemented in raid6 yet.

depending on your use model you might want to go with raid5+spare.  
benchmark if you're not sure.

for raid5/6 i always recommend experimenting with moving your fs journal 
to a raid1 device instead (on separate spindles -- such as your root 
disks).

if this is for a database or fs requiring lots of small writes then 
raid5/6 are generally a mistake... raid10 is the only way to get 
performance.  (hw raid5/6 with nvram support can help a bit in this area, 
but you just can't beat raid10 if you need lots of writes/s.)

beyond those config choices you'll want to become friendly with /sys/block 
and all the myriad of subdirectories and options under there.

in particular:

/sys/block/*/queue/scheduler
/sys/block/*/queue/read_ahead_kb
/sys/block/*/queue/nr_requests
/sys/block/mdX/md/stripe_cache_size

for * = any of the component disks or the mdX itself...

some systems have an /etc/sysfs.conf you can place these settings in to 
have them take effect on reboot.  (sysfsutils package on debuntu)

-dean

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: raid5 software vs hardware: parity calculations?
  2007-01-14  3:16           ` dean gaudet
@ 2007-01-15 11:48             ` Michael Tokarev
  0 siblings, 0 replies; 18+ messages in thread
From: Michael Tokarev @ 2007-01-15 11:48 UTC (permalink / raw)
  To: dean gaudet; +Cc: Robin Bowes, linux-raid

dean gaudet wrote:
[]
> if this is for a database or fs requiring lots of small writes then 
> raid5/6 are generally a mistake... raid10 is the only way to get 
> performance.  (hw raid5/6 with nvram support can help a bit in this area, 
> but you just can't beat raid10 if you need lots of writes/s.)

A small nitpick.

At least some databases never do "small"-sized I/O, at least not against
the datafiles.  That is, for example, Oracle uses a fixed-size I/O block
size, specified at database (or tablespace) creation time, -- by default
it's 4Kb or 8Kb, but may be 16Kb or 32Kb as well.  Now, if you'll make your
raid array stripe size to match the blocksize of a database, *and* ensure
the files are aligned on disk properly, it will just work without needless
reads to calculate parity blocks during writes.

But the problem with that is it's near impossible to do.

First, even if the db writes in 32Kb blocks, it means the stripe size should
be 32Kb, which is only suitable for raid5 with 3 disks, having chunk size of
16Kb, or with 5 disks, chunk size 8Kb (this last variant is quite bad, because
chunk size of 8Kb is too small).  In other words, only very limited set of
configurations will be more-or-less good.

And second, most filesystems used for databases don't care about "correct"
file placement.  For example, ext[23]fs with maximum blocksize of 4Kb will
align files by 4Kb, not by stripe size - which means that a whole 32Kb block
will be laid like - first 4Kb on first stripe, rest 24Kb on the next stripe,
which means that for both parts full read-write cycle will be needed again
to update parity blocks - the thing we tried to avoid by choosing the sizes
in a previous step.  Only xfs so far (from the list of filesystems I've
checked) pays attention to stripe size and tries to ensure files are aligned
to stripe size.  (Yes I know mke2fs's stride=xxx parameter, but it only
affects metadata, not data).

That's why all the above is a "small nitpick" - i.e., in theory, it IS possible
to use raid5 for database workload in certain cases, but due to all the gory
details, it's nearly impossible to do right.

/mjt

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: raid5 software vs hardware: parity calculations?
  2007-01-13 23:23         ` Robin Bowes
  2007-01-14  3:16           ` dean gaudet
@ 2007-01-15 15:29           ` Bill Davidsen
  2007-01-15 16:22             ` Robin Bowes
  1 sibling, 1 reply; 18+ messages in thread
From: Bill Davidsen @ 2007-01-15 15:29 UTC (permalink / raw)
  To: Robin Bowes; +Cc: linux-raid

Robin Bowes wrote:
> Bill Davidsen wrote:
>   
>> There have been several recent threads on the list regarding software
>> RAID-5 performance. The reference might be updated to reflect the poor
>> write performance of RAID-5 until/unless significant tuning is done.
>> Read that as tuning obscure parameters and throwing a lot of memory into
>> stripe cache. The reasons for hardware RAID should include "performance
>> of RAID-5 writes is usually much better than software RAID-5 with
>> default tuning.
>>     
>
> Could you point me at a source of documentation describing how to
> perform such tuning?
>   
No. There has been a lot of discussion of this topic on this list, and a 
trip through the archives of the last 60 days or so will let you pull 
out a number of tuning tips which allow very good performance. My 
concern was writing large blocks of data, 1MB per write, to RAID-5, and 
didn't involve the overhead of small blocks at all, that leads through 
other code and behavior.

I suppose while it's fresh in my mind I should write a script to rerun 
the whole write test suite and generate some graphs, lists of 
parameters, etc. If you are writing a LOT of data, you may find that 
tuning the dirty_* parameters will result in better system response, 
perhaps at the cost of some small total write throughput, although I 
didn't notice anything significant when I tried them.
> Specifically, I have 8x500GB WD STAT drives on a Supermicro PCI-X 8-port
> SATA card configured as a single RAID6 array (~3TB available space)
>   
No hot spare(s)?

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: raid5 software vs hardware: parity calculations?
  2007-01-15 15:29           ` Bill Davidsen
@ 2007-01-15 16:22             ` Robin Bowes
  2007-01-15 17:37               ` Bill Davidsen
  2007-01-15 21:25               ` dean gaudet
  0 siblings, 2 replies; 18+ messages in thread
From: Robin Bowes @ 2007-01-15 16:22 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-raid

Bill Davidsen wrote:
> Robin Bowes wrote:
>> Bill Davidsen wrote:
>>  
>>> There have been several recent threads on the list regarding software
>>> RAID-5 performance. The reference might be updated to reflect the poor
>>> write performance of RAID-5 until/unless significant tuning is done.
>>> Read that as tuning obscure parameters and throwing a lot of memory into
>>> stripe cache. The reasons for hardware RAID should include "performance
>>> of RAID-5 writes is usually much better than software RAID-5 with
>>> default tuning.
>>>     
>>
>> Could you point me at a source of documentation describing how to
>> perform such tuning?
>>   
> No. There has been a lot of discussion of this topic on this list, and a
> trip through the archives of the last 60 days or so will let you pull
> out a number of tuning tips which allow very good performance. My
> concern was writing large blocks of data, 1MB per write, to RAID-5, and
> didn't involve the overhead of small blocks at all, that leads through
> other code and behavior.

Actually Bill, I'm running RAID6 (my mistake for not mentioning it
explicitly before) - I found some material relating to RAID5 but nothing
on RAID6.

Are the concepts similar, or is RAID6 a different beast altogether?

>> Specifically, I have 8x500GB WD STAT drives on a Supermicro PCI-X 8-port
>> SATA card configured as a single RAID6 array (~3TB available space)
>>   
> No hot spare(s)?

I'm running RAID6 instead of RAID5+1 - I've had a couple of instances
where a drive has failed in a RAID5+1 array and a second has failed
during the rebuild after the hot-spare had kicked in.

R.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: raid5 software vs hardware: parity calculations?
  2007-01-15 16:22             ` Robin Bowes
@ 2007-01-15 17:37               ` Bill Davidsen
  2007-01-15 21:25               ` dean gaudet
  1 sibling, 0 replies; 18+ messages in thread
From: Bill Davidsen @ 2007-01-15 17:37 UTC (permalink / raw)
  To: Robin Bowes; +Cc: linux-raid

Robin Bowes wrote:
> Bill Davidsen wrote:
>   
>> Robin Bowes wrote:
>>     
>>> Bill Davidsen wrote:
>>>  
>>>       
>>>> There have been several recent threads on the list regarding software
>>>> RAID-5 performance. The reference might be updated to reflect the poor
>>>> write performance of RAID-5 until/unless significant tuning is done.
>>>> Read that as tuning obscure parameters and throwing a lot of memory into
>>>> stripe cache. The reasons for hardware RAID should include "performance
>>>> of RAID-5 writes is usually much better than software RAID-5 with
>>>> default tuning.
>>>>     
>>>>         
>>> Could you point me at a source of documentation describing how to
>>> perform such tuning?
>>>   
>>>       
>> No. There has been a lot of discussion of this topic on this list, and a
>> trip through the archives of the last 60 days or so will let you pull
>> out a number of tuning tips which allow very good performance. My
>> concern was writing large blocks of data, 1MB per write, to RAID-5, and
>> didn't involve the overhead of small blocks at all, that leads through
>> other code and behavior.
>>     
>
> Actually Bill, I'm running RAID6 (my mistake for not mentioning it
> explicitly before) - I found some material relating to RAID5 but nothing
> on RAID6.
>
> Are the concepts similar, or is RAID6 a different beast altogether?
>   
You mentioned that before, and I think the concepts covered in the 
RAID-5 discussion apply to RAID-6 as well. I don't have enough unused 
drives to really test anything beyond RAID-5, so I have no particular 
tuning information to share. Testing on system drives introduces too 
much jitter to trust the results.
>   
>>> Specifically, I have 8x500GB WD STAT drives on a Supermicro PCI-X 8-port
>>> SATA card configured as a single RAID6 array (~3TB available space)
>>>   
>>>       
>> No hot spare(s)?
>>     
>
> I'm running RAID6 instead of RAID5+1 - I've had a couple of instances
> where a drive has failed in a RAID5+1 array and a second has failed
> during the rebuild after the hot-spare had kicked in.
-- 

bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: raid5 software vs hardware: parity calculations?
  2007-01-15 16:22             ` Robin Bowes
  2007-01-15 17:37               ` Bill Davidsen
@ 2007-01-15 21:25               ` dean gaudet
  2007-01-15 21:32                 ` Gordon Henderson
  2007-01-16  0:35                 ` berk walker
  1 sibling, 2 replies; 18+ messages in thread
From: dean gaudet @ 2007-01-15 21:25 UTC (permalink / raw)
  To: Robin Bowes; +Cc: Bill Davidsen, linux-raid

On Mon, 15 Jan 2007, Robin Bowes wrote:

> I'm running RAID6 instead of RAID5+1 - I've had a couple of instances
> where a drive has failed in a RAID5+1 array and a second has failed
> during the rebuild after the hot-spare had kicked in.

if the failures were read errors without losing the entire disk (the 
typical case) then new kernels are much better -- on read error md will 
reconstruct the sectors from the other disks and attempt to write it back.

you can also run monthly "checks"...

echo check >/sys/block/mdX/md/sync_action

it'll read the entire array (parity included) and correct read errors as 
they're discovered.

-dean

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: raid5 software vs hardware: parity calculations?
  2007-01-15 21:25               ` dean gaudet
@ 2007-01-15 21:32                 ` Gordon Henderson
  2007-01-16  0:35                 ` berk walker
  1 sibling, 0 replies; 18+ messages in thread
From: Gordon Henderson @ 2007-01-15 21:32 UTC (permalink / raw)
  To: linux-raid

On Mon, 15 Jan 2007, dean gaudet wrote:

> you can also run monthly "checks"...
>
> echo check >/sys/block/mdX/md/sync_action
>
> it'll read the entire array (parity included) and correct read errors as
> they're discovered.

A-Ha ... I've not been keeping up with the list for a bit - what's the 
minimum kernel version for this to work?

Cheers,

Gordon

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: raid5 software vs hardware: parity calculations?
  2007-01-15 21:25               ` dean gaudet
  2007-01-15 21:32                 ` Gordon Henderson
@ 2007-01-16  0:35                 ` berk walker
  2007-01-16  0:48                   ` dean gaudet
  2007-01-16  5:06                   ` Bill Davidsen
  1 sibling, 2 replies; 18+ messages in thread
From: berk walker @ 2007-01-16  0:35 UTC (permalink / raw)
  To: dean gaudet; +Cc: Robin Bowes, Bill Davidsen, linux-raid


dean gaudet wrote:
> On Mon, 15 Jan 2007, Robin Bowes wrote:
>
>   
>> I'm running RAID6 instead of RAID5+1 - I've had a couple of instances
>> where a drive has failed in a RAID5+1 array and a second has failed
>> during the rebuild after the hot-spare had kicked in.
>>     
>
> if the failures were read errors without losing the entire disk (the 
> typical case) then new kernels are much better -- on read error md will 
> reconstruct the sectors from the other disks and attempt to write it back.
>
> you can also run monthly "checks"...
>
> echo check >/sys/block/mdX/md/sync_action
>
> it'll read the entire array (parity included) and correct read errors as 
> they're discovered.
>
> -dean
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>   

Could I get a pointer as to how I can do this "check" in my FC5 [BLAG] 
system?  I can find no appropriate "check", nor "md" available to me.  
It would be a "good thing" if I were able to find potentially weak 
spots, rewrite them to good, and know that it might be time for a new drive.

All of my arrays have drives of approx the same mfg date, so the 
possibility of more than one showing bad at the same time can not be 
ignored.

thanks
b-


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: raid5 software vs hardware: parity calculations?
  2007-01-16  0:35                 ` berk walker
@ 2007-01-16  0:48                   ` dean gaudet
  2007-01-16  3:41                     ` Mr. James W. Laferriere
  2007-01-16  5:06                   ` Bill Davidsen
  1 sibling, 1 reply; 18+ messages in thread
From: dean gaudet @ 2007-01-16  0:48 UTC (permalink / raw)
  To: berk walker; +Cc: Robin Bowes, Bill Davidsen, linux-raid

On Mon, 15 Jan 2007, berk walker wrote:

> dean gaudet wrote:
> > echo check >/sys/block/mdX/md/sync_action
> > 
> > it'll read the entire array (parity included) and correct read errors as
> > they're discovered.

> 
> Could I get a pointer as to how I can do this "check" in my FC5 [BLAG] system?
> I can find no appropriate "check", nor "md" available to me.  It would be a
> "good thing" if I were able to find potentially weak spots, rewrite them to
> good, and know that it might be time for a new drive.
> 
> All of my arrays have drives of approx the same mfg date, so the possibility
> of more than one showing bad at the same time can not be ignored.

it should just be:

echo check >/sys/block/mdX/md/sync_action

if you don't have a /sys/block/mdX/md/sync_action file then your kernel is 
too old... or you don't have /sys mounted... (or you didn't replace X with 
the raid number :)

iirc there were kernel versions which had the sync_action file but didn't 
yet support the "check" action (i think possibly even as recent as 2.6.17 
had a small bug initiating one of the sync_actions but i forget which 
one).  if you can upgrade to 2.6.18.x it should work.

debian unstable (and i presume etch) will do this for all your arrays 
automatically once a month.

-dean

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: raid5 software vs hardware: parity calculations?
  2007-01-16  0:48                   ` dean gaudet
@ 2007-01-16  3:41                     ` Mr. James W. Laferriere
  2007-01-16  4:16                       ` dean gaudet
  0 siblings, 1 reply; 18+ messages in thread
From: Mr. James W. Laferriere @ 2007-01-16  3:41 UTC (permalink / raw)
  To: dean gaudet; +Cc: linux-raid maillist

 	Hello Dean ,

On Mon, 15 Jan 2007, dean gaudet wrote:
...snip...
> it should just be:
>
> echo check >/sys/block/mdX/md/sync_action
>
> if you don't have a /sys/block/mdX/md/sync_action file then your kernel is
> too old... or you don't have /sys mounted... (or you didn't replace X with
> the raid number :)
>
> iirc there were kernel versions which had the sync_action file but didn't
> yet support the "check" action (i think possibly even as recent as 2.6.17
> had a small bug initiating one of the sync_actions but i forget which
> one).  if you can upgrade to 2.6.18.x it should work.
>
> debian unstable (and i presume etch) will do this for all your arrays
> automatically once a month.
>
> -dean

 	Being able to run a 'check' is a good thing (tm) .  But without a method 
to acquire statii & data back from the check ,  Seems rather bland .  Is there a 
tool/file to poll/... where data & statii can be acquired ?
 		Tia ,  JimL

-- 
+-----------------------------------------------------------------+
| James   W.   Laferriere | System   Techniques | Give me VMS     |
| Network        Engineer | 663  Beaumont  Blvd |  Give me Linux  |
| babydr@baby-dragons.com | Pacifica, CA. 94044 |   only  on  AXP |
+-----------------------------------------------------------------+

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: raid5 software vs hardware: parity calculations?
  2007-01-16  3:41                     ` Mr. James W. Laferriere
@ 2007-01-16  4:16                       ` dean gaudet
  0 siblings, 0 replies; 18+ messages in thread
From: dean gaudet @ 2007-01-16  4:16 UTC (permalink / raw)
  To: Mr. James W. Laferriere; +Cc: linux-raid maillist

On Mon, 15 Jan 2007, Mr. James W. Laferriere wrote:

> 	Hello Dean ,
> 
> On Mon, 15 Jan 2007, dean gaudet wrote:
> ...snip...
> > it should just be:
> > 
> > echo check >/sys/block/mdX/md/sync_action
> > 
> > if you don't have a /sys/block/mdX/md/sync_action file then your kernel is
> > too old... or you don't have /sys mounted... (or you didn't replace X with
> > the raid number :)
> > 
> > iirc there were kernel versions which had the sync_action file but didn't
> > yet support the "check" action (i think possibly even as recent as 2.6.17
> > had a small bug initiating one of the sync_actions but i forget which
> > one).  if you can upgrade to 2.6.18.x it should work.
> > 
> > debian unstable (and i presume etch) will do this for all your arrays
> > automatically once a month.
> > 
> > -dean
> 
> 	Being able to run a 'check' is a good thing (tm) .  But without a
> method to acquire statii & data back from the check ,  Seems rather bland .
> Is there a tool/file to poll/... where data & statii can be acquired ?

i'm not 100% certain what you mean, but i generally just monitor dmesg for 
the md read error message (mind you the message pre-2.6.19 or .20 isn't 
very informative but it's obvious enough).

there is also a file mismatch_cnt in the same directory as sync_action ... 
the Documentation/md.txt (in 2.6.18) refers to it incorrectly as 
mismatch_count... but anyhow why don't i just repaste the relevant portion 
of md.txt.

-dean

...

Active md devices for levels that support data redundancy (1,4,5,6)
also have

   sync_action
     a text file that can be used to monitor and control the rebuild
     process.  It contains one word which can be one of:
       resync        - redundancy is being recalculated after unclean
                       shutdown or creation
       recover       - a hot spare is being built to replace a
                       failed/missing device
       idle          - nothing is happening
       check         - A full check of redundancy was requested and is
                       happening.  This reads all block and checks
                       them. A repair may also happen for some raid
                       levels.
       repair        - A full check and repair is happening.  This is
                       similar to 'resync', but was requested by the
                       user, and the write-intent bitmap is NOT used to
                       optimise the process.

      This file is writable, and each of the strings that could be
      read are meaningful for writing.

       'idle' will stop an active resync/recovery etc.  There is no
           guarantee that another resync/recovery may not be automatically
           started again, though some event will be needed to trigger
           this.
        'resync' or 'recovery' can be used to restart the
           corresponding operation if it was stopped with 'idle'.
        'check' and 'repair' will start the appropriate process
           providing the current state is 'idle'.

   mismatch_count
      When performing 'check' and 'repair', and possibly when
      performing 'resync', md will count the number of errors that are
      found.  The count in 'mismatch_cnt' is the number of sectors
      that were re-written, or (for 'check') would have been
      re-written.  As most raid levels work in units of pages rather
      than sectors, this my be larger than the number of actual errors
      by a factor of the number of sectors in a page.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: raid5 software vs hardware: parity calculations?
  2007-01-16  0:35                 ` berk walker
  2007-01-16  0:48                   ` dean gaudet
@ 2007-01-16  5:06                   ` Bill Davidsen
  1 sibling, 0 replies; 18+ messages in thread
From: Bill Davidsen @ 2007-01-16  5:06 UTC (permalink / raw)
  To: berk walker; +Cc: dean gaudet, Robin Bowes, linux-raid

berk walker wrote:
>
> dean gaudet wrote:
>> On Mon, 15 Jan 2007, Robin Bowes wrote:
>>
>>  
>>> I'm running RAID6 instead of RAID5+1 - I've had a couple of instances
>>> where a drive has failed in a RAID5+1 array and a second has failed
>>> during the rebuild after the hot-spare had kicked in.
>>>     
>>
>> if the failures were read errors without losing the entire disk (the 
>> typical case) then new kernels are much better -- on read error md 
>> will reconstruct the sectors from the other disks and attempt to 
>> write it back.
>>
>> you can also run monthly "checks"...
>>
>> echo check >/sys/block/mdX/md/sync_action
>>
>> it'll read the entire array (parity included) and correct read errors 
>> as they're discovered.
>>
>> -dean
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>   
>
> Could I get a pointer as to how I can do this "check" in my FC5 [BLAG] 
> system?  I can find no appropriate "check", nor "md" available to me.  
> It would be a "good thing" if I were able to find potentially weak 
> spots, rewrite them to good, and know that it might be time for a new 
> drive.
Grab a recent mdadm source, it's a part of that.
>
> All of my arrays have drives of approx the same mfg date, so the 
> possibility of more than one showing bad at the same time can not be 
> ignored. 
Never can, but it is highly unlikely, given the MTBF of modern drives. 
And when you consider total failures as opposed to bad sectors it gets 
even smaller. There is no perfect way to avoid ever losing data, just 
ways to reduce the chance to balance the cost of data loss vs. hardware. 
Current Linux will rewrite bad sectors, whole drive failures are an 
argument for spares.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-01-16  5:06 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-01-11 22:44 raid5 software vs hardware: parity calculations? James Ralston
2007-01-12 17:39 ` dean gaudet
2007-01-12 20:34   ` James Ralston
2007-01-13  9:20     ` Dan Williams
2007-01-13 17:32       ` Bill Davidsen
2007-01-13 23:23         ` Robin Bowes
2007-01-14  3:16           ` dean gaudet
2007-01-15 11:48             ` Michael Tokarev
2007-01-15 15:29           ` Bill Davidsen
2007-01-15 16:22             ` Robin Bowes
2007-01-15 17:37               ` Bill Davidsen
2007-01-15 21:25               ` dean gaudet
2007-01-15 21:32                 ` Gordon Henderson
2007-01-16  0:35                 ` berk walker
2007-01-16  0:48                   ` dean gaudet
2007-01-16  3:41                     ` Mr. James W. Laferriere
2007-01-16  4:16                       ` dean gaudet
2007-01-16  5:06                   ` Bill Davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).