* RAID5 rebuild question
@ 2005-07-03 6:20 Christopher Smith
2005-07-03 6:41 ` Guy
0 siblings, 1 reply; 5+ messages in thread
From: Christopher Smith @ 2005-07-03 6:20 UTC (permalink / raw)
To: linux-raid
While waiting for a rather large RAID5 array to build, I noticed the
following output from iostat -k 1:
Linux 2.6.11-1.1369_FC4smp (justinstalled.syd.nighthawkrad.net)
04/07/05
avg-cpu: %user %nice %sys %iowait %idle
1.10 0.00 5.24 2.45 91.21
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
hda 7.79 58.17 46.26 82741 65802
sda 86.70 8221.20 391.64 11693016 557032
sdb 81.11 8221.16 15.06 11692952 21416
sdc 80.85 8221.18 14.16 11692980 20136
sdd 80.93 8221.20 15.06 11693016 21416
sde 81.01 8221.20 15.37 11693016 21864
sdf 80.79 8221.20 14.16 11693016 20136
sdg 80.91 8221.20 14.52 11693016 20648
sdh 79.67 8221.16 6.91 11692952 9832
sdi 78.95 8221.20 0.03 11693016 40
sdj 79.04 8221.20 0.03 11693016 40
sdk 79.48 8221.20 0.03 11693016 40
sdl 93.28 0.33 8269.91 472 11762288
md0 1.60 0.00 102.28 0 145472
avg-cpu: %user %nice %sys %iowait %idle
0.49 0.00 7.35 0.00 92.16
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
hda 0.00 0.00 0.00 0 0
sda 100.99 9417.82 0.00 9512 0
sdb 101.98 9417.82 0.00 9512 0
sdc 100.00 9417.82 0.00 9512 0
sdd 98.02 9417.82 0.00 9512 0
sde 96.04 9417.82 0.00 9512 0
sdf 96.04 9417.82 0.00 9512 0
sdg 96.04 9417.82 0.00 9512 0
sdh 96.04 9417.82 0.00 9512 0
sdi 99.01 9417.82 0.00 9512 0
sdj 100.00 9417.82 0.00 9512 0
sdk 99.01 9417.82 0.00 9512 0
sdl 109.90 0.00 9504.95 0 9600
md0 0.00 0.00 0.00 0 0
avg-cpu: %user %nice %sys %iowait %idle
0.00 0.00 5.53 0.00 94.47
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
hda 0.00 0.00 0.00 0 0
sda 102.02 9765.66 0.00 9668 0
sdb 108.08 9765.66 0.00 9668 0
sdc 108.08 9765.66 0.00 9668 0
sdd 108.08 9765.66 0.00 9668 0
sde 103.03 9765.66 0.00 9668 0
sdf 103.03 9765.66 0.00 9668 0
sdg 103.03 9765.66 0.00 9668 0
sdh 102.02 9765.66 0.00 9668 0
sdi 105.05 9765.66 0.00 9668 0
sdj 105.05 9765.66 0.00 9668 0
sdk 103.03 9765.66 0.00 9668 0
sdl 120.20 0.00 9696.97 0 9600
md0 0.00 0.00 0.00 0 0
avg-cpu: %user %nice %sys %iowait %idle
0.00 0.00 6.00 0.00 94.00
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
hda 0.00 0.00 0.00 0 0
sda 109.90 9500.99 0.00 9596 0
sdb 103.96 9500.99 0.00 9596 0
sdc 107.92 9500.99 0.00 9596 0
sdd 106.93 9500.99 0.00 9596 0
sde 104.95 9500.99 0.00 9596 0
sdf 102.97 9500.99 0.00 9596 0
sdg 104.95 9500.99 0.00 9596 0
sdh 102.97 9500.99 0.00 9596 0
sdi 101.98 9500.99 0.00 9596 0
sdj 101.98 9500.99 0.00 9596 0
sdk 101.98 9500.99 0.00 9596 0
sdl 154.46 0.00 9536.63 0 9632
md0 0.00 0.00 0.00 0 0
avg-cpu: %user %nice %sys %iowait %idle
0.00 0.00 5.50 0.00 94.50
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
hda 0.00 0.00 0.00 0 0
sda 100.99 9401.98 0.00 9496 0
sdb 100.00 9401.98 0.00 9496 0
sdc 98.02 9401.98 0.00 9496 0
sdd 100.00 9401.98 0.00 9496 0
sde 97.03 9401.98 0.00 9496 0
sdf 94.06 9401.98 0.00 9496 0
sdg 95.05 9401.98 0.00 9496 0
sdh 96.04 9401.98 0.00 9496 0
sdi 96.04 9401.98 0.00 9496 0
sdj 95.05 9401.98 0.00 9496 0
sdk 97.03 9401.98 0.00 9496 0
sdl 127.72 0.00 9600.00 0 9696
md0 0.00 0.00 0.00 0 0
avg-cpu: %user %nice %sys %iowait %idle
0.00 0.00 5.97 0.00 94.03
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
hda 2.00 0.00 32.00 0 32
sda 90.00 9676.00 0.00 9676 0
sdb 91.00 9676.00 0.00 9676 0
sdc 90.00 9676.00 0.00 9676 0
sdd 90.00 9676.00 0.00 9676 0
sde 90.00 9676.00 0.00 9676 0
sdf 89.00 9676.00 0.00 9676 0
sdg 89.00 9676.00 0.00 9676 0
sdh 89.00 9676.00 0.00 9676 0
sdi 89.00 9676.00 0.00 9676 0
sdj 89.00 9676.00 0.00 9676 0
sdk 89.00 9676.00 0.00 9676 0
sdl 124.00 0.00 9600.00 0 9600
md0 0.00 0.00 0.00 0 0
Devices sd[a-l] make up /dev/md0:
[root@justinstalled ~]# cat /proc/mdstat
Personalities : [raid5]
md0 : active raid5 sdl[12] sdk[10] sdj[9] sdi[8] sdh[7] sdg[6] sdf[5]
sde[4] sdd[3] sdc[2] sdb[1] sda[0]
1719198976 blocks level 5, 128k chunk, algorithm 2 [12/11]
[UUUUUUUUUUU_]
[>....................] recovery = 2.4% (3837952/156290816)
finish=256.7min speed=9895K/sec
unused devices: <none>
[root@justinstalled ~]#
Why are all the writes concentrated on a single drive ? Shouldn't the
reads and writes be being distributed evenly amongst all the drives ?
Or is this just something unique to the rebuild phase ?
CS
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: RAID5 rebuild question
2005-07-03 6:20 RAID5 rebuild question Christopher Smith
@ 2005-07-03 6:41 ` Guy
2005-07-04 1:20 ` Neil Brown
0 siblings, 1 reply; 5+ messages in thread
From: Guy @ 2005-07-03 6:41 UTC (permalink / raw)
To: 'Christopher Smith', linux-raid
It looks like it is rebuilding to a spare or new disk.
If this is a new array, I would think that create would be writing to all
disks, but not sure.
I noticed the speed is about 10000K/sec/disk
Maybe it can go faster, try this:
To see current limit:
cat /proc/sys/dev/raid/speed_limit_max
To set new limit:
echo 100000 > /proc/sys/dev/raid/speed_limit_max
for details:
man md
Guy
> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Christopher Smith
> Sent: Sunday, July 03, 2005 2:20 AM
> To: linux-raid@vger.kernel.org
> Subject: RAID5 rebuild question
>
> While waiting for a rather large RAID5 array to build, I noticed the
> following output from iostat -k 1:
>
> Linux 2.6.11-1.1369_FC4smp (justinstalled.syd.nighthawkrad.net)
> 04/07/05
>
> avg-cpu: %user %nice %sys %iowait %idle
> 1.10 0.00 5.24 2.45 91.21
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> hda 7.79 58.17 46.26 82741 65802
> sda 86.70 8221.20 391.64 11693016 557032
> sdb 81.11 8221.16 15.06 11692952 21416
> sdc 80.85 8221.18 14.16 11692980 20136
> sdd 80.93 8221.20 15.06 11693016 21416
> sde 81.01 8221.20 15.37 11693016 21864
> sdf 80.79 8221.20 14.16 11693016 20136
> sdg 80.91 8221.20 14.52 11693016 20648
> sdh 79.67 8221.16 6.91 11692952 9832
> sdi 78.95 8221.20 0.03 11693016 40
> sdj 79.04 8221.20 0.03 11693016 40
> sdk 79.48 8221.20 0.03 11693016 40
> sdl 93.28 0.33 8269.91 472 11762288
> md0 1.60 0.00 102.28 0 145472
>
> avg-cpu: %user %nice %sys %iowait %idle
> 0.49 0.00 7.35 0.00 92.16
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> hda 0.00 0.00 0.00 0 0
> sda 100.99 9417.82 0.00 9512 0
> sdb 101.98 9417.82 0.00 9512 0
> sdc 100.00 9417.82 0.00 9512 0
> sdd 98.02 9417.82 0.00 9512 0
> sde 96.04 9417.82 0.00 9512 0
> sdf 96.04 9417.82 0.00 9512 0
> sdg 96.04 9417.82 0.00 9512 0
> sdh 96.04 9417.82 0.00 9512 0
> sdi 99.01 9417.82 0.00 9512 0
> sdj 100.00 9417.82 0.00 9512 0
> sdk 99.01 9417.82 0.00 9512 0
> sdl 109.90 0.00 9504.95 0 9600
> md0 0.00 0.00 0.00 0 0
>
> avg-cpu: %user %nice %sys %iowait %idle
> 0.00 0.00 5.53 0.00 94.47
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> hda 0.00 0.00 0.00 0 0
> sda 102.02 9765.66 0.00 9668 0
> sdb 108.08 9765.66 0.00 9668 0
> sdc 108.08 9765.66 0.00 9668 0
> sdd 108.08 9765.66 0.00 9668 0
> sde 103.03 9765.66 0.00 9668 0
> sdf 103.03 9765.66 0.00 9668 0
> sdg 103.03 9765.66 0.00 9668 0
> sdh 102.02 9765.66 0.00 9668 0
> sdi 105.05 9765.66 0.00 9668 0
> sdj 105.05 9765.66 0.00 9668 0
> sdk 103.03 9765.66 0.00 9668 0
> sdl 120.20 0.00 9696.97 0 9600
> md0 0.00 0.00 0.00 0 0
>
> avg-cpu: %user %nice %sys %iowait %idle
> 0.00 0.00 6.00 0.00 94.00
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> hda 0.00 0.00 0.00 0 0
> sda 109.90 9500.99 0.00 9596 0
> sdb 103.96 9500.99 0.00 9596 0
> sdc 107.92 9500.99 0.00 9596 0
> sdd 106.93 9500.99 0.00 9596 0
> sde 104.95 9500.99 0.00 9596 0
> sdf 102.97 9500.99 0.00 9596 0
> sdg 104.95 9500.99 0.00 9596 0
> sdh 102.97 9500.99 0.00 9596 0
> sdi 101.98 9500.99 0.00 9596 0
> sdj 101.98 9500.99 0.00 9596 0
> sdk 101.98 9500.99 0.00 9596 0
> sdl 154.46 0.00 9536.63 0 9632
> md0 0.00 0.00 0.00 0 0
>
> avg-cpu: %user %nice %sys %iowait %idle
> 0.00 0.00 5.50 0.00 94.50
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> hda 0.00 0.00 0.00 0 0
> sda 100.99 9401.98 0.00 9496 0
> sdb 100.00 9401.98 0.00 9496 0
> sdc 98.02 9401.98 0.00 9496 0
> sdd 100.00 9401.98 0.00 9496 0
> sde 97.03 9401.98 0.00 9496 0
> sdf 94.06 9401.98 0.00 9496 0
> sdg 95.05 9401.98 0.00 9496 0
> sdh 96.04 9401.98 0.00 9496 0
> sdi 96.04 9401.98 0.00 9496 0
> sdj 95.05 9401.98 0.00 9496 0
> sdk 97.03 9401.98 0.00 9496 0
> sdl 127.72 0.00 9600.00 0 9696
> md0 0.00 0.00 0.00 0 0
>
> avg-cpu: %user %nice %sys %iowait %idle
> 0.00 0.00 5.97 0.00 94.03
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> hda 2.00 0.00 32.00 0 32
> sda 90.00 9676.00 0.00 9676 0
> sdb 91.00 9676.00 0.00 9676 0
> sdc 90.00 9676.00 0.00 9676 0
> sdd 90.00 9676.00 0.00 9676 0
> sde 90.00 9676.00 0.00 9676 0
> sdf 89.00 9676.00 0.00 9676 0
> sdg 89.00 9676.00 0.00 9676 0
> sdh 89.00 9676.00 0.00 9676 0
> sdi 89.00 9676.00 0.00 9676 0
> sdj 89.00 9676.00 0.00 9676 0
> sdk 89.00 9676.00 0.00 9676 0
> sdl 124.00 0.00 9600.00 0 9600
> md0 0.00 0.00 0.00 0 0
>
>
> Devices sd[a-l] make up /dev/md0:
> [root@justinstalled ~]# cat /proc/mdstat
> Personalities : [raid5]
> md0 : active raid5 sdl[12] sdk[10] sdj[9] sdi[8] sdh[7] sdg[6] sdf[5]
> sde[4] sdd[3] sdc[2] sdb[1] sda[0]
> 1719198976 blocks level 5, 128k chunk, algorithm 2 [12/11]
> [UUUUUUUUUUU_]
> [>....................] recovery = 2.4% (3837952/156290816)
> finish=256.7min speed=9895K/sec
> unused devices: <none>
> [root@justinstalled ~]#
>
>
> Why are all the writes concentrated on a single drive ? Shouldn't the
> reads and writes be being distributed evenly amongst all the drives ?
> Or is this just something unique to the rebuild phase ?
>
> CS
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: RAID5 rebuild question
2005-07-03 6:41 ` Guy
@ 2005-07-04 1:20 ` Neil Brown
2005-07-04 3:41 ` Guy
2005-07-07 20:48 ` David Greaves
0 siblings, 2 replies; 5+ messages in thread
From: Neil Brown @ 2005-07-04 1:20 UTC (permalink / raw)
To: Guy; +Cc: 'Christopher Smith', linux-raid
On Sunday July 3, bugzilla@watkins-home.com wrote:
> It looks like it is rebuilding to a spare or new disk.
Yep.
> If this is a new array, I would think that create would be writing to all
> disks, but not sure.
Nope....
When creating a new raid5 array, we need to make sure the parity
blocks are all correct (obviously). There are several ways to do
this.
1/ write zeros to all drives. This would make the array unusable
until the clearing is complete, so isn't a good option.
2/ Read all the data blocks, compute the parity block, and then write
out the parity block. This works, but is not optimal. Remembering
that the parity block is on a different drive for each 'stripe',
think about what the read/write heads are doing.
The heads on the 'reading' drives will be somewhere ahead of the
heads on the 'writing' drive. Every time we step to a new stripe
and change which is the 'writing' head, the other reading heads
have to wait for the head that has just changes from 'writing' to
'reading' to catch up (finish writing, then start reading).
Waiting slows things down, so this is uniformly sub-optimal.
3/ read all data blocks and parity blocks, check the parity block to
see if it is correct, and only write out a new block if it wasn't.
This works quite well if most of the parity blocks are correct as
all heads are reading in parallel and are pretty-much synchronised.
This is how the raid5 'resync' process in md works. It happens
after an unclean shutdown if the array was active at crash-time.
However if most or even many of the parity blocks are wrong, this
process will be quite slow as the parity-block drive will have to
read-a-bunch, step-back, write-a-bunch. So it isn't good for
initially setting the parity.
4/ Assume that the parity blocks are all correct, but that one drive
is missing (i.e. the array is degraded). This is repaired by
reconstructing what should have been on the missing drive, onto a
spare. This involves reading all the 'good' drives in parallel,
calculating them missing block (whether data or parity) and writing
it to the 'spare' drive. The 'spare' will be written to a few (10s
or 100s of) blocks behind the blocks being read off the 'good'
drives, but each drive will run completely sequentially and so at
top speed.
On a new array where most of the parity blocks are probably bad, '4'
is clearly the best option. 'mdadm' makes sure this happens by creating
a raid5 array not with N good drives, but with N-1 good drives and one
spare. Reconstruction then happens and you should see exactly what
was reported: reads from all but the last drive, writes to that last
drives.
This should go in a FAQ. Is anyone actively maintaining an md/mdadm
FAQ at the moment, or should I start putting something together??
NeilBrown
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: RAID5 rebuild question
2005-07-04 1:20 ` Neil Brown
@ 2005-07-04 3:41 ` Guy
2005-07-07 20:48 ` David Greaves
1 sibling, 0 replies; 5+ messages in thread
From: Guy @ 2005-07-04 3:41 UTC (permalink / raw)
To: 'Neil Brown'; +Cc: 'Christopher Smith', linux-raid
This is worth saving!!!!
I did want to create a list of frequent problems, and how to correct them,
but never made the time. I don't know of any FAQ pages. This mailing list
is it! :)
Guy
> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Neil Brown
> Sent: Sunday, July 03, 2005 9:21 PM
> To: Guy
> Cc: 'Christopher Smith'; linux-raid@vger.kernel.org
> Subject: RE: RAID5 rebuild question
>
> On Sunday July 3, bugzilla@watkins-home.com wrote:
> > It looks like it is rebuilding to a spare or new disk.
>
> Yep.
>
> > If this is a new array, I would think that create would be writing to
> all
> > disks, but not sure.
>
> Nope....
>
> When creating a new raid5 array, we need to make sure the parity
> blocks are all correct (obviously). There are several ways to do
> this.
>
> 1/ write zeros to all drives. This would make the array unusable
> until the clearing is complete, so isn't a good option.
> 2/ Read all the data blocks, compute the parity block, and then write
> out the parity block. This works, but is not optimal. Remembering
> that the parity block is on a different drive for each 'stripe',
> think about what the read/write heads are doing.
> The heads on the 'reading' drives will be somewhere ahead of the
> heads on the 'writing' drive. Every time we step to a new stripe
> and change which is the 'writing' head, the other reading heads
> have to wait for the head that has just changes from 'writing' to
> 'reading' to catch up (finish writing, then start reading).
> Waiting slows things down, so this is uniformly sub-optimal.
> 3/ read all data blocks and parity blocks, check the parity block to
> see if it is correct, and only write out a new block if it wasn't.
> This works quite well if most of the parity blocks are correct as
> all heads are reading in parallel and are pretty-much synchronised.
> This is how the raid5 'resync' process in md works. It happens
> after an unclean shutdown if the array was active at crash-time.
> However if most or even many of the parity blocks are wrong, this
> process will be quite slow as the parity-block drive will have to
> read-a-bunch, step-back, write-a-bunch. So it isn't good for
> initially setting the parity.
> 4/ Assume that the parity blocks are all correct, but that one drive
> is missing (i.e. the array is degraded). This is repaired by
> reconstructing what should have been on the missing drive, onto a
> spare. This involves reading all the 'good' drives in parallel,
> calculating them missing block (whether data or parity) and writing
> it to the 'spare' drive. The 'spare' will be written to a few (10s
> or 100s of) blocks behind the blocks being read off the 'good'
> drives, but each drive will run completely sequentially and so at
> top speed.
>
> On a new array where most of the parity blocks are probably bad, '4'
> is clearly the best option. 'mdadm' makes sure this happens by creating
> a raid5 array not with N good drives, but with N-1 good drives and one
> spare. Reconstruction then happens and you should see exactly what
> was reported: reads from all but the last drive, writes to that last
> drives.
>
> This should go in a FAQ. Is anyone actively maintaining an md/mdadm
> FAQ at the moment, or should I start putting something together??
>
> NeilBrown
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RAID5 rebuild question
2005-07-04 1:20 ` Neil Brown
2005-07-04 3:41 ` Guy
@ 2005-07-07 20:48 ` David Greaves
1 sibling, 0 replies; 5+ messages in thread
From: David Greaves @ 2005-07-07 20:48 UTC (permalink / raw)
To: Neil Brown; +Cc: Guy, 'Christopher Smith', linux-raid
>This should go in a FAQ. Is anyone actively maintaining an md/mdadm
>FAQ at the moment, or should I start putting something together??
>
Can I suggest a wiki? Or an 'online multi-persin editable document' :)
There are a few people here who could contribute and/or edit.
David
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2005-07-07 20:48 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-03 6:20 RAID5 rebuild question Christopher Smith
2005-07-03 6:41 ` Guy
2005-07-04 1:20 ` Neil Brown
2005-07-04 3:41 ` Guy
2005-07-07 20:48 ` David Greaves
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).