* RAID-6 question.
@ 2008-11-10 11:56 Justin Piszcz
2008-11-10 16:02 ` Justin Piszcz
0 siblings, 1 reply; 6+ messages in thread
From: Justin Piszcz @ 2008-11-10 11:56 UTC (permalink / raw)
To: Linux-Raid
I ran a check on a RAID6 and my entire machine was timing out ssh
connections etc until it was just about finished, I never experienced this
with RAID5, any comments?
$ cat /sys/block/md3/md/sync_speed_min
1000 (system)
$ cat /sys/block/md3/md/sync_speed_max
200000 (system)
md3 : active raid6 sdl1[9] sdk1[8] sdj1[7] sdi1[6] sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1] sdc1[0]
2344263680 blocks level 6, 1024k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
[===================>.] resync = 96.5% (283046144/293032960) finish=2.3min speed=71092K/sec
# dd if=/dev/zero of=disk bs=1M # write to entire raid6 device
# then, run a check > on /sys/etc for all devices that support parity
# (do this on a regular basis w/raid1+5), never seen any slowdown etc like
# i experienced w/ RAID6: /app/jp-mystuff/bin/check_mdraid.sh
Mon Nov 10 06:00:54 EST 2008: Parity check(s) running, sleeping 10 minutes...
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: RAID-6 question.
2008-11-10 11:56 RAID-6 question Justin Piszcz
@ 2008-11-10 16:02 ` Justin Piszcz
2008-11-17 0:29 ` H. Peter Anvin
0 siblings, 1 reply; 6+ messages in thread
From: Justin Piszcz @ 2008-11-10 16:02 UTC (permalink / raw)
To: Linux-Raid
On Mon, 10 Nov 2008, Justin Piszcz wrote:
> I ran a check on a RAID6 and my entire machine was timing out ssh connections
> etc until it was just about finished, I never experienced this with RAID5,
> any comments?
>
> $ cat /sys/block/md3/md/sync_speed_min
> 1000 (system)
> $ cat /sys/block/md3/md/sync_speed_max
> 200000 (system)
>
> md3 : active raid6 sdl1[9] sdk1[8] sdj1[7] sdi1[6] sdh1[5] sdg1[4] sdf1[3]
> sde1[2] sdd1[1] sdc1[0]
> 2344263680 blocks level 6, 1024k chunk, algorithm 2 [10/10]
> [UUUUUUUUUU]
> [===================>.] resync = 96.5% (283046144/293032960)
> finish=2.3min speed=71092K/sec
>
> # dd if=/dev/zero of=disk bs=1M # write to entire raid6 device
> # then, run a check > on /sys/etc for all devices that support parity
> # (do this on a regular basis w/raid1+5), never seen any slowdown etc like
> # i experienced w/ RAID6: /app/jp-mystuff/bin/check_mdraid.sh
> Mon Nov 10 06:00:54 EST 2008: Parity check(s) running, sleeping 10 minutes...
>
>
During RAID6-resync (recovering from 2-failed disks)
1. Manually failed 2 drives.
2. Added one drive, it started rebuilding--processes seemed OK.
3. Added the second drive during the rebuilding of the first drive.
4. The exact commands ran are shown below:
501 mdadm /dev/md3 --fail /dev/sdg1
502 mdadm /dev/md3 -r /dev/sdg1
503 mdadm /dev/md3 -a /dev/sdg1
504 mdadm /dev/md3 --fail /dev/sdh1
507 mdadm -D /dev/md3
508 mdadm /dev/md3 -r /dev/sdh1
517 mdadm -D /dev/md3
518 mdadm /dev/md3 -a /dev/sdh1
522 mdadm -D /dev/md3
During this rebuild, this is what the process stats look like:
--------------------------------------------------------------
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19108 root 15 -5 0 0 0 R 100 0.0 296:36.31 md3_raid5
25676 root 15 -5 0 0 0 D 41 0.0 4:13.48 md3_resync
It also appears to 'starve' my md/root (RAID1) such that regular processes
go into D-state. This does not appear to happen under RAID5-resync..
---------------------------------------------------------------------------
root 18954 1.3 0.0 0 0 ? D Nov09 12:34 [pdflush]
root 18246 0.0 0.0 5904 668 ? Ds Nov09 0:00 /sbin/syslogd -
r
postfix 25761 0.0 0.0 43720 3128 ? D 10:43 0:00 cleanup -z -t unix -u -c
jpiszcz 25411 0.0 0.0 69020 6544 pts/35 Dl+ 10:32 0:00 alpine -i
During this time, I cannot ssh to the host:
md3 : active raid6 sdh1[10](S) sdg1[11] sdj1[7] sdl1[9] sdk1[8] sdi1[6] sdf1[3] sde1[2] sdd1[1] sdc1[0]
2344263680 blocks level 6, 1024k chunk, algorithm 2 [10/8] [UUUU__UUUU]
[=========>...........] recovery = 47.1% (138049268/293032960) finish=24.6min speed=104740K/sec
After I lowered the speed a little bit, the system came back:
# echo 90000 > /sys/block/md3/md/sync_speed_max
The minimum/maximum were default:
# cat /sys/block/md3/md/sync_speed_min
1000 (system)
The sync_speed_max was the default as well until I changed it, once I lowered
the speed, the system was functional again. By default its set quite high,
this appeared to be the root cause of the problem.
Justin.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: RAID-6 question.
2008-11-10 16:02 ` Justin Piszcz
@ 2008-11-17 0:29 ` H. Peter Anvin
2008-11-17 1:01 ` NeilBrown
2008-11-17 9:42 ` Justin Piszcz
0 siblings, 2 replies; 6+ messages in thread
From: H. Peter Anvin @ 2008-11-17 0:29 UTC (permalink / raw)
To: Justin Piszcz; +Cc: Linux-Raid
Justin Piszcz wrote:
>
> The sync_speed_max was the default as well until I changed it, once I
> lowered
> the speed, the system was functional again. By default its set quite high,
> this appeared to be the root cause of the problem.
>
You probably ran out of CPU. 2-disk RAID-6 recovery is very CPU intensive.
-hpa
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: RAID-6 question.
2008-11-17 0:29 ` H. Peter Anvin
@ 2008-11-17 1:01 ` NeilBrown
2008-11-17 9:42 ` Justin Piszcz
1 sibling, 0 replies; 6+ messages in thread
From: NeilBrown @ 2008-11-17 1:01 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: Justin Piszcz, Linux-Raid
On Mon, November 17, 2008 11:29 am, H. Peter Anvin wrote:
> Justin Piszcz wrote:
>>
>> The sync_speed_max was the default as well until I changed it, once I
>> lowered
>> the speed, the system was functional again. By default its set quite
>> high,
>> this appeared to be the root cause of the problem.
>>
>
> You probably ran out of CPU. 2-disk RAID-6 recovery is very CPU
> intensive.
So maybe we need to back-off resync when the CPU is busy ??
I wonder how you measure "am I taking CPU from something else important"??
Maybe we just set the scheduling priority quite low. Maybe
set_user_nice(current, 15);
near the top of md_do_sync().
??
NeilBrown
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: RAID-6 question.
2008-11-17 0:29 ` H. Peter Anvin
2008-11-17 1:01 ` NeilBrown
@ 2008-11-17 9:42 ` Justin Piszcz
2008-11-17 16:09 ` H. Peter Anvin
1 sibling, 1 reply; 6+ messages in thread
From: Justin Piszcz @ 2008-11-17 9:42 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: Linux-Raid
On Sun, 16 Nov 2008, H. Peter Anvin wrote:
> Justin Piszcz wrote:
>>
>> The sync_speed_max was the default as well until I changed it, once I
>> lowered
>> the speed, the system was functional again. By default its set quite high,
>> this appeared to be the root cause of the problem.
>>
>
> You probably ran out of CPU. 2-disk RAID-6 recovery is very CPU intensive.
>
> -hpa
>
Yeah, one core was pegged at 100% for the resync, another (md_raidX) process
was at ~35% on the second core. The third and fourth core were unused but
immediately when I start the resync it definitley slowed down the system
quite a bit and hurt interactivity and as I mentioned before *without*
specifying a lower rebuild speed than the system's maximum I/O the system
is unusable until the rebuild completes, perhaps its a combination of
using all of the one core for the resync where raid5 may not? I recall
w/raid5 I did not have to specify a lower limit for max_rebuild KiB/s..
Justin.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: RAID-6 question.
2008-11-17 9:42 ` Justin Piszcz
@ 2008-11-17 16:09 ` H. Peter Anvin
0 siblings, 0 replies; 6+ messages in thread
From: H. Peter Anvin @ 2008-11-17 16:09 UTC (permalink / raw)
To: Justin Piszcz; +Cc: Linux-Raid
Justin Piszcz wrote:
>
> Yeah, one core was pegged at 100% for the resync, another (md_raidX)
> process
> was at ~35% on the second core. The third and fourth core were unused
> but immediately when I start the resync it definitley slowed down the
> system quite a bit and hurt interactivity and as I mentioned before
> *without* specifying a lower rebuild speed than the system's maximum I/O
> the system is unusable until the rebuild completes, perhaps its a
> combination of using all of the one core for the resync where raid5 may
> not? I recall w/raid5 I did not have to specify a lower limit for
> max_rebuild KiB/s..
>
RAID-5 recovery can use the normal accelerated functions, whereas RAID-6
recovery can't (only a handful of CPUs have the operations needed to
accelerate dual-disk recovery, and even for those it is not implemented
at this point.)
-hpa
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2008-11-17 16:09 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-10 11:56 RAID-6 question Justin Piszcz
2008-11-10 16:02 ` Justin Piszcz
2008-11-17 0:29 ` H. Peter Anvin
2008-11-17 1:01 ` NeilBrown
2008-11-17 9:42 ` Justin Piszcz
2008-11-17 16:09 ` H. Peter Anvin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).