RAID-6 question.

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID-6 question.
@ 2008-11-10 11:56 Justin Piszcz
  2008-11-10 16:02 ` Justin Piszcz
  0 siblings, 1 reply; 6+ messages in thread
From: Justin Piszcz @ 2008-11-10 11:56 UTC (permalink / raw)
  To: Linux-Raid

I ran a check on a RAID6 and my entire machine was timing out ssh 
connections etc until it was just about finished, I never experienced this 
with RAID5, any comments?

$ cat /sys/block/md3/md/sync_speed_min
1000 (system)
$ cat /sys/block/md3/md/sync_speed_max
200000 (system)

md3 : active raid6 sdl1[9] sdk1[8] sdj1[7] sdi1[6] sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1] sdc1[0]
       2344263680 blocks level 6, 1024k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
       [===================>.]  resync = 96.5% (283046144/293032960) finish=2.3min speed=71092K/sec

# dd if=/dev/zero of=disk bs=1M # write to entire raid6 device
# then, run a check > on /sys/etc for all devices that support parity
# (do this on a regular basis w/raid1+5), never seen any slowdown etc like
# i experienced w/ RAID6: /app/jp-mystuff/bin/check_mdraid.sh
Mon Nov 10 06:00:54 EST 2008: Parity check(s) running, sleeping 10 minutes...

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID-6 question.
  2008-11-10 11:56 RAID-6 question Justin Piszcz
@ 2008-11-10 16:02 ` Justin Piszcz
  2008-11-17  0:29   ` H. Peter Anvin
  0 siblings, 1 reply; 6+ messages in thread
From: Justin Piszcz @ 2008-11-10 16:02 UTC (permalink / raw)
  To: Linux-Raid



On Mon, 10 Nov 2008, Justin Piszcz wrote:

> I ran a check on a RAID6 and my entire machine was timing out ssh connections 
> etc until it was just about finished, I never experienced this with RAID5, 
> any comments?
>
> $ cat /sys/block/md3/md/sync_speed_min
> 1000 (system)
> $ cat /sys/block/md3/md/sync_speed_max
> 200000 (system)
>
> md3 : active raid6 sdl1[9] sdk1[8] sdj1[7] sdi1[6] sdh1[5] sdg1[4] sdf1[3] 
> sde1[2] sdd1[1] sdc1[0]
>      2344263680 blocks level 6, 1024k chunk, algorithm 2 [10/10] 
> [UUUUUUUUUU]
>      [===================>.]  resync = 96.5% (283046144/293032960) 
> finish=2.3min speed=71092K/sec
>
> # dd if=/dev/zero of=disk bs=1M # write to entire raid6 device
> # then, run a check > on /sys/etc for all devices that support parity
> # (do this on a regular basis w/raid1+5), never seen any slowdown etc like
> # i experienced w/ RAID6: /app/jp-mystuff/bin/check_mdraid.sh
> Mon Nov 10 06:00:54 EST 2008: Parity check(s) running, sleeping 10 minutes...
>
>

During RAID6-resync (recovering from 2-failed disks)
1. Manually failed 2 drives.
2. Added one drive, it started rebuilding--processes seemed OK.
3. Added the second drive during the rebuilding of the first drive.
4. The exact commands ran are shown below:

   501  mdadm /dev/md3 --fail /dev/sdg1
   502  mdadm /dev/md3 -r /dev/sdg1
   503  mdadm /dev/md3 -a /dev/sdg1
   504  mdadm /dev/md3 --fail /dev/sdh1
   507  mdadm -D /dev/md3
   508  mdadm /dev/md3 -r /dev/sdh1
   517  mdadm -D /dev/md3
   518  mdadm /dev/md3 -a /dev/sdh1
   522  mdadm -D /dev/md3

During this rebuild, this is what the process stats look like:
--------------------------------------------------------------
   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
19108 root      15  -5     0    0    0 R  100  0.0 296:36.31 md3_raid5
25676 root      15  -5     0    0    0 D   41  0.0   4:13.48 md3_resync

It also appears to 'starve' my md/root (RAID1) such that regular processes
go into D-state.  This does not appear to happen under RAID5-resync..
---------------------------------------------------------------------------
root     18954  1.3  0.0      0     0 ?        D    Nov09  12:34 [pdflush]
root     18246  0.0  0.0   5904   668 ?        Ds   Nov09   0:00 /sbin/syslogd -
r
postfix  25761  0.0  0.0  43720  3128 ?        D    10:43   0:00 cleanup -z -t unix -u -c
jpiszcz  25411  0.0  0.0  69020  6544 pts/35   Dl+  10:32   0:00 alpine -i

During this time, I cannot ssh to the host:

md3 : active raid6 sdh1[10](S) sdg1[11] sdj1[7] sdl1[9] sdk1[8] sdi1[6] sdf1[3] sde1[2] sdd1[1] sdc1[0]
       2344263680 blocks level 6, 1024k chunk, algorithm 2 [10/8] [UUUU__UUUU]
       [=========>...........]  recovery = 47.1% (138049268/293032960) finish=24.6min speed=104740K/sec

After I lowered the speed a little bit, the system came back:

# echo 90000 > /sys/block/md3/md/sync_speed_max

The minimum/maximum were default:

# cat /sys/block/md3/md/sync_speed_min
1000 (system)

The sync_speed_max was the default as well until I changed it, once I lowered
the speed, the system was functional again.  By default its set quite high,
this appeared to be the root cause of the problem.

Justin.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID-6 question.
  2008-11-10 16:02 ` Justin Piszcz
@ 2008-11-17  0:29   ` H. Peter Anvin
  2008-11-17  1:01     ` NeilBrown
  2008-11-17  9:42     ` Justin Piszcz
  0 siblings, 2 replies; 6+ messages in thread
From: H. Peter Anvin @ 2008-11-17  0:29 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Linux-Raid

Justin Piszcz wrote:
> 
> The sync_speed_max was the default as well until I changed it, once I
> lowered
> the speed, the system was functional again.  By default its set quite high,
> this appeared to be the root cause of the problem.
> 

You probably ran out of CPU.  2-disk RAID-6 recovery is very CPU intensive.

	-hpa

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID-6 question.
  2008-11-17  0:29   ` H. Peter Anvin
@ 2008-11-17  1:01     ` NeilBrown
  2008-11-17  9:42     ` Justin Piszcz
  1 sibling, 0 replies; 6+ messages in thread
From: NeilBrown @ 2008-11-17  1:01 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Justin Piszcz, Linux-Raid

On Mon, November 17, 2008 11:29 am, H. Peter Anvin wrote:
> Justin Piszcz wrote:
>>
>> The sync_speed_max was the default as well until I changed it, once I
>> lowered
>> the speed, the system was functional again.  By default its set quite
>> high,
>> this appeared to be the root cause of the problem.
>>
>
> You probably ran out of CPU.  2-disk RAID-6 recovery is very CPU
> intensive.

So maybe we need to back-off resync when the CPU is busy ??

I wonder how you measure "am I taking CPU from something else important"??

Maybe we just set the scheduling priority quite low.  Maybe

   set_user_nice(current, 15);

near the top of md_do_sync().
??

NeilBrown


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID-6 question.
  2008-11-17  0:29   ` H. Peter Anvin
  2008-11-17  1:01     ` NeilBrown
@ 2008-11-17  9:42     ` Justin Piszcz
  2008-11-17 16:09       ` H. Peter Anvin
  1 sibling, 1 reply; 6+ messages in thread
From: Justin Piszcz @ 2008-11-17  9:42 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Linux-Raid

On Sun, 16 Nov 2008, H. Peter Anvin wrote:

> Justin Piszcz wrote:
>>
>> The sync_speed_max was the default as well until I changed it, once I
>> lowered
>> the speed, the system was functional again.  By default its set quite high,
>> this appeared to be the root cause of the problem.
>>
>
> You probably ran out of CPU.  2-disk RAID-6 recovery is very CPU intensive.
>
> 	-hpa
>

Yeah, one core was pegged at 100% for the resync, another (md_raidX) process
was at ~35% on the second core.  The third and fourth core were unused but 
immediately when I start the resync it definitley slowed down the system 
quite a bit and hurt interactivity and as I mentioned before *without* 
specifying a lower rebuild speed than the system's maximum I/O the system 
is unusable until the rebuild completes, perhaps its a combination of 
using all of the one core for the resync where raid5 may not?  I recall 
w/raid5 I did not have to specify a lower limit for max_rebuild KiB/s..

Justin.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID-6 question.
  2008-11-17  9:42     ` Justin Piszcz
@ 2008-11-17 16:09       ` H. Peter Anvin
  0 siblings, 0 replies; 6+ messages in thread
From: H. Peter Anvin @ 2008-11-17 16:09 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Linux-Raid

Justin Piszcz wrote:
> 
> Yeah, one core was pegged at 100% for the resync, another (md_raidX)
> process
> was at ~35% on the second core.  The third and fourth core were unused
> but immediately when I start the resync it definitley slowed down the
> system quite a bit and hurt interactivity and as I mentioned before
> *without* specifying a lower rebuild speed than the system's maximum I/O
> the system is unusable until the rebuild completes, perhaps its a
> combination of using all of the one core for the resync where raid5 may
> not?  I recall w/raid5 I did not have to specify a lower limit for
> max_rebuild KiB/s..
> 

RAID-5 recovery can use the normal accelerated functions, whereas RAID-6
recovery can't (only a handful of CPUs have the operations needed to
accelerate dual-disk recovery, and even for those it is not implemented
at this point.)

	-hpa

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-11-17 16:09 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-10 11:56 RAID-6 question Justin Piszcz
2008-11-10 16:02 ` Justin Piszcz
2008-11-17  0:29   ` H. Peter Anvin
2008-11-17  1:01     ` NeilBrown
2008-11-17  9:42     ` Justin Piszcz
2008-11-17 16:09       ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).