Problems with RAID 6 across 15 disks

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Problems with RAID 6 across 15 disks
@ 2010-04-01 13:23 Max Eaves
  2010-04-01 13:49 ` Doug Ledford
  0 siblings, 1 reply; 13+ messages in thread
From: Max Eaves @ 2010-04-01 13:23 UTC (permalink / raw)
  To: linux-raid

Hi there,

I hope this gets through....my first posting on this dist.list.

I am running Centos 5.4 with a 2.6.18-164.15.1.el5 kernel (x86_64) 
kernel using a rather "homebrew" backblaze system 
(http://blog.backblaze.com/) system.

The mdadm version is: mdadm - v2.6.9 - 10th March 2009

It uses a number of Silicon Image 3124 (sIL 3124) cards and a number of 
multiplier port cards (sIL3132) to read a large number of disks.

I have 45 disks arranged into 3 mdadm raid sets of 15 disks.  These 15 
disks are raided using RAID6.

The problem I have is this:

At random times, the RAID decides that it needs to resynchronise 
/dev/md10 /dev/md11 and /dev/md12.  There is no error or log event in 
/var/log/messages, but the first thing I notice is that the performance 
of the RAID array drops, and checking out "cat /proc/mdadm" shows all 
three RAID re synchronising themselves.

ARRAY /dev/md0 level=raid1 num-devices=2 
uuid=7d7b19e6:56cc90cc:3cb166bd:b8086f29 (system boot) (not a problem)
ARRAY /dev/md1 level=raid1 num-devices=2 
uuid=3782d93d:a491ffd4:f32c1014:94a2b3f7 (system LVM) (not a problem)
ARRAY /dev/md10 level=raid6 num-devices=15 
uuid=5ca86e2a-3b86-4c0b-9a7a-59143bdcd0f1 (partition 1) (problem)
ARRAY /dev/md11 level=raid6 num-devices=15 
uuid=61188c90-4825-44c5-8fac-9bc82a5799fe (partition 2) (problem)
ARRAY /dev/md12 level=raid6 num-devices=15 
uuid=fa939816-1d0f-4eaa-98dd-c131449c3921 (partition 3) (problem)

These re-synchronisation events take about a week to complete (the RAID 
is 18TB a pop)

I know that the performance of this system is not great, but I wonder if 
this resynchronisation is occurring because of some I/O time-out.

Oddly enough, a restart of the server fixes the problem for a couple of 
days, and then problem occurs again (humm - not good).

I'm happy to post logs etc....just let me know what you need.

Thanks

Max

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problems with RAID 6 across 15 disks
  2010-04-01 13:23 Problems with RAID 6 across 15 disks Max Eaves
@ 2010-04-01 13:49 ` Doug Ledford
  2010-04-01 14:07   ` Max Eaves
  0 siblings, 1 reply; 13+ messages in thread
From: Doug Ledford @ 2010-04-01 13:49 UTC (permalink / raw)
  To: max; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2659 bytes --]

On 04/01/2010 09:23 AM, Max Eaves wrote:
> Hi there,
> 
> I hope this gets through....my first posting on this dist.list.
> 
> I am running Centos 5.4 with a 2.6.18-164.15.1.el5 kernel (x86_64)
> kernel using a rather "homebrew" backblaze system
> (http://blog.backblaze.com/) system.
> 
> The mdadm version is: mdadm - v2.6.9 - 10th March 2009
> 
> It uses a number of Silicon Image 3124 (sIL 3124) cards and a number of
> multiplier port cards (sIL3132) to read a large number of disks.
> 
> I have 45 disks arranged into 3 mdadm raid sets of 15 disks.  These 15
> disks are raided using RAID6.
> 
> The problem I have is this:
> 
> At random times, the RAID decides that it needs to resynchronise
> /dev/md10 /dev/md11 and /dev/md12.  There is no error or log event in
> /var/log/messages, but the first thing I notice is that the performance
> of the RAID array drops, and checking out "cat /proc/mdadm" shows all
> three RAID re synchronising themselves.
> 
> ARRAY /dev/md0 level=raid1 num-devices=2
> uuid=7d7b19e6:56cc90cc:3cb166bd:b8086f29 (system boot) (not a problem)
> ARRAY /dev/md1 level=raid1 num-devices=2
> uuid=3782d93d:a491ffd4:f32c1014:94a2b3f7 (system LVM) (not a problem)
> ARRAY /dev/md10 level=raid6 num-devices=15
> uuid=5ca86e2a-3b86-4c0b-9a7a-59143bdcd0f1 (partition 1) (problem)
> ARRAY /dev/md11 level=raid6 num-devices=15
> uuid=61188c90-4825-44c5-8fac-9bc82a5799fe (partition 2) (problem)
> ARRAY /dev/md12 level=raid6 num-devices=15
> uuid=fa939816-1d0f-4eaa-98dd-c131449c3921 (partition 3) (problem)
> 
> These re-synchronisation events take about a week to complete (the RAID
> is 18TB a pop)
> 
> I know that the performance of this system is not great, but I wonder if
> this resynchronisation is occurring because of some I/O time-out.
> 
> Oddly enough, a restart of the server fixes the problem for a couple of
> days, and then problem occurs again (humm - not good).
> 
> I'm happy to post logs etc....just let me know what you need.

Disable /etc/cron.weekly/99-raid-check.  They aren't resyncronizing,
they are actually just checking themselves for consistency, but because
the 2.6.18 kernel didn't have a different word for it in the output of
/proc/mdstat it just looks that way.  I can't remember if the version of
mdadm in centos 5.4 has the /etc/sysconfig/raid-check config file, but
if it does, it's easy to disable the weekly check there.


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problems with RAID 6 across 15 disks
  2010-04-01 13:49 ` Doug Ledford
@ 2010-04-01 14:07   ` Max Eaves
  2010-04-01 20:43     ` Neil Brown
  0 siblings, 1 reply; 13+ messages in thread
From: Max Eaves @ 2010-04-01 14:07 UTC (permalink / raw)
  To: linux-raid; +Cc: Doug Ledford

Doug,

Thank you very much for that; a great relief off my shoulders.

You are right - there is a config file located in 
/etc/sysconfig/raid-check.  I've changed ENABLED to no.

Amazing - I've learnt something today.

Thanks once again.

Max

On 01/04/10 14:49, Doug Ledford wrote:
> On 04/01/2010 09:23 AM, Max Eaves wrote:
>    
>> Hi there,
>>
>> I hope this gets through....my first posting on this dist.list.
>>
>> I am running Centos 5.4 with a 2.6.18-164.15.1.el5 kernel (x86_64)
>> kernel using a rather "homebrew" backblaze system
>> (http://blog.backblaze.com/) system.
>>
>> The mdadm version is: mdadm - v2.6.9 - 10th March 2009
>>
>> It uses a number of Silicon Image 3124 (sIL 3124) cards and a number of
>> multiplier port cards (sIL3132) to read a large number of disks.
>>
>> I have 45 disks arranged into 3 mdadm raid sets of 15 disks.  These 15
>> disks are raided using RAID6.
>>
>> The problem I have is this:
>>
>> At random times, the RAID decides that it needs to resynchronise
>> /dev/md10 /dev/md11 and /dev/md12.  There is no error or log event in
>> /var/log/messages, but the first thing I notice is that the performance
>> of the RAID array drops, and checking out "cat /proc/mdadm" shows all
>> three RAID re synchronising themselves.
>>
>> ARRAY /dev/md0 level=raid1 num-devices=2
>> uuid=7d7b19e6:56cc90cc:3cb166bd:b8086f29 (system boot) (not a problem)
>> ARRAY /dev/md1 level=raid1 num-devices=2
>> uuid=3782d93d:a491ffd4:f32c1014:94a2b3f7 (system LVM) (not a problem)
>> ARRAY /dev/md10 level=raid6 num-devices=15
>> uuid=5ca86e2a-3b86-4c0b-9a7a-59143bdcd0f1 (partition 1) (problem)
>> ARRAY /dev/md11 level=raid6 num-devices=15
>> uuid=61188c90-4825-44c5-8fac-9bc82a5799fe (partition 2) (problem)
>> ARRAY /dev/md12 level=raid6 num-devices=15
>> uuid=fa939816-1d0f-4eaa-98dd-c131449c3921 (partition 3) (problem)
>>
>> These re-synchronisation events take about a week to complete (the RAID
>> is 18TB a pop)
>>
>> I know that the performance of this system is not great, but I wonder if
>> this resynchronisation is occurring because of some I/O time-out.
>>
>> Oddly enough, a restart of the server fixes the problem for a couple of
>> days, and then problem occurs again (humm - not good).
>>
>> I'm happy to post logs etc....just let me know what you need.
>>      
> Disable /etc/cron.weekly/99-raid-check.  They aren't resyncronizing,
> they are actually just checking themselves for consistency, but because
> the 2.6.18 kernel didn't have a different word for it in the output of
> /proc/mdstat it just looks that way.  I can't remember if the version of
> mdadm in centos 5.4 has the /etc/sysconfig/raid-check config file, but
> if it does, it's easy to disable the weekly check there.
>
>
>    


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problems with RAID 6 across 15 disks
  2010-04-01 14:07   ` Max Eaves
@ 2010-04-01 20:43     ` Neil Brown
  2010-04-01 22:46       ` Piergiorgio Sartor
  2010-04-02  5:55       ` responsiveness during raid check (Was: Problems with RAID 6 across 15 disks) Luca Berra
  0 siblings, 2 replies; 13+ messages in thread
From: Neil Brown @ 2010-04-01 20:43 UTC (permalink / raw)
  To: max; +Cc: linux-raid, Doug Ledford

On Thu, 01 Apr 2010 15:07:27 +0100
Max Eaves <max@maxeaves.co.uk> wrote:

> Doug,
> 
> Thank you very much for that; a great relief off my shoulders.
> 
> You are right - there is a config file located in 
> /etc/sysconfig/raid-check.  I've changed ENABLED to no.

However there is real value in doing that check, at least occasionally.  It
catches latent read errors.

You might want to run it only every couple of months, and you might want to
wind down one of both of the /proc/sys/dev/raid/speed_limit_* numbers so
there is minimal impact on your system.

But not scrubbing at all is not advisable.

NeilBrown


> 
> Amazing - I've learnt something today.
> 
> Thanks once again.
> 
> Max
> 
> On 01/04/10 14:49, Doug Ledford wrote:
> > On 04/01/2010 09:23 AM, Max Eaves wrote:
> >    
> >> Hi there,
> >>
> >> I hope this gets through....my first posting on this dist.list.
> >>
> >> I am running Centos 5.4 with a 2.6.18-164.15.1.el5 kernel (x86_64)
> >> kernel using a rather "homebrew" backblaze system
> >> (http://blog.backblaze.com/) system.
> >>
> >> The mdadm version is: mdadm - v2.6.9 - 10th March 2009
> >>
> >> It uses a number of Silicon Image 3124 (sIL 3124) cards and a number of
> >> multiplier port cards (sIL3132) to read a large number of disks.
> >>
> >> I have 45 disks arranged into 3 mdadm raid sets of 15 disks.  These 15
> >> disks are raided using RAID6.
> >>
> >> The problem I have is this:
> >>
> >> At random times, the RAID decides that it needs to resynchronise
> >> /dev/md10 /dev/md11 and /dev/md12.  There is no error or log event in
> >> /var/log/messages, but the first thing I notice is that the performance
> >> of the RAID array drops, and checking out "cat /proc/mdadm" shows all
> >> three RAID re synchronising themselves.
> >>
> >> ARRAY /dev/md0 level=raid1 num-devices=2
> >> uuid=7d7b19e6:56cc90cc:3cb166bd:b8086f29 (system boot) (not a problem)
> >> ARRAY /dev/md1 level=raid1 num-devices=2
> >> uuid=3782d93d:a491ffd4:f32c1014:94a2b3f7 (system LVM) (not a problem)
> >> ARRAY /dev/md10 level=raid6 num-devices=15
> >> uuid=5ca86e2a-3b86-4c0b-9a7a-59143bdcd0f1 (partition 1) (problem)
> >> ARRAY /dev/md11 level=raid6 num-devices=15
> >> uuid=61188c90-4825-44c5-8fac-9bc82a5799fe (partition 2) (problem)
> >> ARRAY /dev/md12 level=raid6 num-devices=15
> >> uuid=fa939816-1d0f-4eaa-98dd-c131449c3921 (partition 3) (problem)
> >>
> >> These re-synchronisation events take about a week to complete (the RAID
> >> is 18TB a pop)
> >>
> >> I know that the performance of this system is not great, but I wonder if
> >> this resynchronisation is occurring because of some I/O time-out.
> >>
> >> Oddly enough, a restart of the server fixes the problem for a couple of
> >> days, and then problem occurs again (humm - not good).
> >>
> >> I'm happy to post logs etc....just let me know what you need.
> >>      
> > Disable /etc/cron.weekly/99-raid-check.  They aren't resyncronizing,
> > they are actually just checking themselves for consistency, but because
> > the 2.6.18 kernel didn't have a different word for it in the output of
> > /proc/mdstat it just looks that way.  I can't remember if the version of
> > mdadm in centos 5.4 has the /etc/sysconfig/raid-check config file, but
> > if it does, it's easy to disable the weekly check there.
> >
> >
> >    
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problems with RAID 6 across 15 disks
  2010-04-01 20:43     ` Neil Brown
@ 2010-04-01 22:46       ` Piergiorgio Sartor
  2010-04-01 22:58         ` Jools Wills
  2010-04-02  5:55       ` responsiveness during raid check (Was: Problems with RAID 6 across 15 disks) Luca Berra
  1 sibling, 1 reply; 13+ messages in thread
From: Piergiorgio Sartor @ 2010-04-01 22:46 UTC (permalink / raw)
  To: Neil Brown; +Cc: max, linux-raid, Doug Ledford

Hi,

> However there is real value in doing that check, at least occasionally.  It
> catches latent read errors.

but since it is not possible to correct those errors,
there is no point in doing it... :-)

Sorry, I couldn't resist... ;-)

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problems with RAID 6 across 15 disks
  2010-04-01 22:46       ` Piergiorgio Sartor
@ 2010-04-01 22:58         ` Jools Wills
  2010-04-01 23:04           ` Piergiorgio Sartor
  0 siblings, 1 reply; 13+ messages in thread
From: Jools Wills @ 2010-04-01 22:58 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: Neil Brown, max, linux-raid, Doug Ledford

On Fri, 2010-04-02 at 00:46 +0200, Piergiorgio Sartor wrote:
> Hi,
> 
> > However there is real value in doing that check, at least occasionally.  It
> > catches latent read errors.
> 
> but since it is not possible to correct those errors,
> there is no point in doing it... :-)

Well it can. It can try and rewrite the block based on the data from the
other disks, and if the drive needs to, it can remap the bad block.

Best Regards

Jools

Jools Wills
-- 
IT Consultant
Oxford Inspire - http://www.oxfordinspire.co.uk - be inspired
t: 01235 519446 m: 07966 577498
jools@oxfordinspire.co.uk


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problems with RAID 6 across 15 disks
  2010-04-01 22:58         ` Jools Wills
@ 2010-04-01 23:04           ` Piergiorgio Sartor
  2010-04-01 23:46             ` Michael Evans
  2010-04-02  1:40             ` Jools Wills
  0 siblings, 2 replies; 13+ messages in thread
From: Piergiorgio Sartor @ 2010-04-01 23:04 UTC (permalink / raw)
  To: Jools Wills; +Cc: Piergiorgio Sartor, Neil Brown, max, linux-raid, Doug Ledford

Hi,

> > but since it is not possible to correct those errors,
> > there is no point in doing it... :-)
> 
> Well it can. It can try and rewrite the block based on the data from the
> other disks, and if the drive needs to, it can remap the bad block.

you might be unaware of the repeated neverending
discussions about this topic.

It is *possible* to do it, but, as of today, it
cannot do it.

I mean, there is no functionality, in the RAID-6, to
detect and correct those errors using the available
double parity.

Consider that the RAID check returns only how many
mismatch are present, not where they are, i.e. on
which disks.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problems with RAID 6 across 15 disks
  2010-04-01 23:04           ` Piergiorgio Sartor
@ 2010-04-01 23:46             ` Michael Evans
  2010-04-02  1:40             ` Jools Wills
  1 sibling, 0 replies; 13+ messages in thread
From: Michael Evans @ 2010-04-01 23:46 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: Jools Wills, Neil Brown, max, linux-raid, Doug Ledford

On Thu, Apr 1, 2010 at 4:04 PM, Piergiorgio Sartor
<piergiorgio.sartor@nexgo.de> wrote:
> Hi,
>
>> > but since it is not possible to correct those errors,
>> > there is no point in doing it... :-)
>>
>> Well it can. It can try and rewrite the block based on the data from the
>> other disks, and if the drive needs to, it can remap the bad block.
>
> you might be unaware of the repeated neverending
> discussions about this topic.
>
> It is *possible* to do it, but, as of today, it
> cannot do it.
>
> I mean, there is no functionality, in the RAID-6, to
> detect and correct those errors using the available
> double parity.
>
> Consider that the RAID check returns only how many
> mismatch are present, not where they are, i.e. on
> which disks.
>
> bye,
>
> --
>
> piergiorgio
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

You are correct in that /silent/ errors cannot be detected, however
drives typically do not verify writes and if for whatever reason a
sector that was written cannot be read back the drive will
/eventually/ return an error.  At this point a re-write is issued
based on the data recovered from the other drives.  Only if that fails
is the drive kicked from the array.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problems with RAID 6 across 15 disks
  2010-04-01 23:04           ` Piergiorgio Sartor
  2010-04-01 23:46             ` Michael Evans
@ 2010-04-02  1:40             ` Jools Wills
  2010-04-02  5:03               ` Neil Brown
  1 sibling, 1 reply; 13+ messages in thread
From: Jools Wills @ 2010-04-02  1:40 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: Neil Brown, max, linux-raid, Doug Ledford

On Fri, 2010-04-02 at 01:04 +0200, Piergiorgio Sartor wrote:
> you might be unaware of the repeated neverending
> discussions about this topic.

yup :)

> It is *possible* to do it, but, as of today, it
> cannot do it.
> I mean, there is no functionality, in the RAID-6, to
> detect and correct those errors using the available
> double parity.

Is this the same for raid 5 or specifically a raid 6 issue on linux ?

I had assumed that with my raid5 array, if the raid check finds an error
it will attempt to rewrite back to the disk, and then read again, and
carry on if everything is ok.

Best Regards

Jools

Jools Wills
-- 
IT Consultant
Oxford Inspire - http://www.oxfordinspire.co.uk - be inspired
t: 01235 519446 m: 07966 577498
jools@oxfordinspire.co.uk


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problems with RAID 6 across 15 disks
  2010-04-02  1:40             ` Jools Wills
@ 2010-04-02  5:03               ` Neil Brown
  2010-04-02  8:22                 ` Piergiorgio Sartor
  2010-04-02 10:21                 ` Max Eaves
  0 siblings, 2 replies; 13+ messages in thread
From: Neil Brown @ 2010-04-02  5:03 UTC (permalink / raw)
  To: jools; +Cc: Piergiorgio Sartor, max, linux-raid, Doug Ledford

On Fri, 02 Apr 2010 02:40:13 +0100
Jools Wills <jools@oxfordinspire.co.uk> wrote:

> On Fri, 2010-04-02 at 01:04 +0200, Piergiorgio Sartor wrote:
> > you might be unaware of the repeated neverending
> > discussions about this topic.
> 
> yup :)
> 
> > It is *possible* to do it, but, as of today, it
> > cannot do it.
> > I mean, there is no functionality, in the RAID-6, to
> > detect and correct those errors using the available
> > double parity.
> 
> Is this the same for raid 5 or specifically a raid 6 issue on linux ?
> 
> I had assumed that with my raid5 array, if the raid check finds an error
> it will attempt to rewrite back to the disk, and then read again, and
> carry on if everything is ok.

Piergiogio is confusing you.  Maybe he is confused himself.

The most likely cause of error on modern drives is media problem.  Maybe the
data wasn't stored well, or maybe the charge in the media decayed.
When you have trillions of bytes on a drive, the chance of something going
wrong becomes quite significant.

When this happens the drive will notice while reading and will report an
error (after trying a few times).  It detects an error because an
error-detecting code (CRC?) reported an error.

When this happens on a non-degraded array (RAID 1,10,4,5,6) md will recover
the data from elsewhere and write out good data, which will normally fix the
problem.

Ofcourse md cannot do this if it never reads the data, and on a terabyte
drive there is probably lots of data that won't be read often.

So a regular check pass to 'scrub' the device is a good ideas as it will find
these sleeping bad blocks by reading every single block.
It doesn't have to be weekly, or even monthly.  But regular is important.

You need to find a frequency and speed that matches your storage size and
throughput requirements, and how cautious you feel.

The situation which Piergiogio is referring to is quite different.
It is conceivably possible for wrong data to be written and a matching CRC to
be written with it.  In this case the drive doesn't notice so md doesn't
notice.
If you know the source of the error, or catch it before any write happens on
the same stripe, then it is possible on RAID6 or RAID1 with >2 drives to
work out with high probability which block has wrong data, and to fix it.

This sort of problem is much more rare, and is very likely to be accompanied
by other error the could well lead to general system failure.
Bad memory, bit flips on a bus that is not ECC protected, things like that.

As I said, it only make sense to attempt to 'correct' this if you know that
the stripe has not be written to since the error occurred.  You can only
really know this if you check for errors before every write.  We don't do
that and it would be a significant performance impact (I expect) to do so.

It does not make sense to try to fix these extreme rare possible errors on a
regular scan.  It does make sense to report them with more detail than we
currently do.  Patches always welcome.

http://neil.brown.name/blog/20100211050355

NeilBrown

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problems with RAID 6 across 15 disks
  2010-04-02  5:03               ` Neil Brown
@ 2010-04-02  8:22                 ` Piergiorgio Sartor
  2010-04-02 10:21                 ` Max Eaves
  1 sibling, 0 replies; 13+ messages in thread
From: Piergiorgio Sartor @ 2010-04-02  8:22 UTC (permalink / raw)
  To: Neil Brown; +Cc: jools, Piergiorgio Sartor, max, linux-raid, Doug Ledford

Hi,

as usual very precise... :-)

It's only that I like the topic, maybe someday
someone will provide some patches, if there is
a regular "scrubbing"... ;-)

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problems with RAID 6 across 15 disks
  2010-04-02  5:03               ` Neil Brown
  2010-04-02  8:22                 ` Piergiorgio Sartor
@ 2010-04-02 10:21                 ` Max Eaves
  1 sibling, 0 replies; 13+ messages in thread
From: Max Eaves @ 2010-04-02 10:21 UTC (permalink / raw)
  To: linux-raid

Dear all,

Thank you all very much for everybody's replies over the past 24 hours 
on this.  It did make me chuckle on how I seem to have wandered into a 
hornets nest and given it a jolly good stir.

So - I have decided that what I will do is make the checking script a 
bi-monthly process (it runs every other month), in a new folder on my 
server called /etc/cron.bimonthly and referenced it in /etc/crontab

I feel what should really happen is a more sensible checking of the raid 
arrays, and instead of scanning every single RAID array at the same time 
(not good for my I/O).  I have a slow PCI-X 133Mhz bus here where my 
RAID cards are connected into, so I feel that this is the way forward.  
I'll see what I can do in that direction.

Thanks

Max

^ permalink raw reply	[flat|nested] 13+ messages in thread

* responsiveness during raid check (Was: Problems with RAID 6 across 15 disks)
  2010-04-01 20:43     ` Neil Brown
  2010-04-01 22:46       ` Piergiorgio Sartor
@ 2010-04-02  5:55       ` Luca Berra
  1 sibling, 0 replies; 13+ messages in thread
From: Luca Berra @ 2010-04-02  5:55 UTC (permalink / raw)
  To: linux-raid

On Fri, Apr 02, 2010 at 07:43:25AM +1100, Neil Brown wrote:
>On Thu, 01 Apr 2010 15:07:27 +0100
>Max Eaves <max@maxeaves.co.uk> wrote:
>
>> Doug,
>> 
>> Thank you very much for that; a great relief off my shoulders.
>> 
>> You are right - there is a config file located in 
>> /etc/sysconfig/raid-check.  I've changed ENABLED to no.
>
>However there is real value in doing that check, at least occasionally.  It
>catches latent read errors.
>
>You might want to run it only every couple of months, and you might want to
>wind down one of both of the /proc/sys/dev/raid/speed_limit_* numbers so
>there is minimal impact on your system.
>
sorry if i am hijacking, but i got a report from one user that the
scheduled scrubbing is severely impacting responsiveness, lowering the
speed_limits seems to help a bit, but he reports it is still sluggish, 
i always believed the check should use idle time, and not impact
performance that much. could it be scheduler related?

regards,
L.

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-04-02 10:21 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-01 13:23 Problems with RAID 6 across 15 disks Max Eaves
2010-04-01 13:49 ` Doug Ledford
2010-04-01 14:07   ` Max Eaves
2010-04-01 20:43     ` Neil Brown
2010-04-01 22:46       ` Piergiorgio Sartor
2010-04-01 22:58         ` Jools Wills
2010-04-01 23:04           ` Piergiorgio Sartor
2010-04-01 23:46             ` Michael Evans
2010-04-02  1:40             ` Jools Wills
2010-04-02  5:03               ` Neil Brown
2010-04-02  8:22                 ` Piergiorgio Sartor
2010-04-02 10:21                 ` Max Eaves
2010-04-02  5:55       ` responsiveness during raid check (Was: Problems with RAID 6 across 15 disks) Luca Berra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox