RAID5 rebuild question

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID5 rebuild question
@ 2005-07-03  6:20 Christopher Smith
  2005-07-03  6:41 ` Guy
  0 siblings, 1 reply; 5+ messages in thread
From: Christopher Smith @ 2005-07-03  6:20 UTC (permalink / raw)
  To: linux-raid

While waiting for a rather large RAID5 array to build, I noticed the 
following output from iostat -k 1:

Linux 2.6.11-1.1369_FC4smp (justinstalled.syd.nighthawkrad.net) 
04/07/05

avg-cpu:  %user   %nice    %sys %iowait   %idle
            1.10    0.00    5.24    2.45   91.21

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
hda               7.79        58.17        46.26      82741      65802
sda              86.70      8221.20       391.64   11693016     557032
sdb              81.11      8221.16        15.06   11692952      21416
sdc              80.85      8221.18        14.16   11692980      20136
sdd              80.93      8221.20        15.06   11693016      21416
sde              81.01      8221.20        15.37   11693016      21864
sdf              80.79      8221.20        14.16   11693016      20136
sdg              80.91      8221.20        14.52   11693016      20648
sdh              79.67      8221.16         6.91   11692952       9832
sdi              78.95      8221.20         0.03   11693016         40
sdj              79.04      8221.20         0.03   11693016         40
sdk              79.48      8221.20         0.03   11693016         40
sdl              93.28         0.33      8269.91        472   11762288
md0               1.60         0.00       102.28          0     145472

avg-cpu:  %user   %nice    %sys %iowait   %idle
            0.49    0.00    7.35    0.00   92.16

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
hda               0.00         0.00         0.00          0          0
sda             100.99      9417.82         0.00       9512          0
sdb             101.98      9417.82         0.00       9512          0
sdc             100.00      9417.82         0.00       9512          0
sdd              98.02      9417.82         0.00       9512          0
sde              96.04      9417.82         0.00       9512          0
sdf              96.04      9417.82         0.00       9512          0
sdg              96.04      9417.82         0.00       9512          0
sdh              96.04      9417.82         0.00       9512          0
sdi              99.01      9417.82         0.00       9512          0
sdj             100.00      9417.82         0.00       9512          0
sdk              99.01      9417.82         0.00       9512          0
sdl             109.90         0.00      9504.95          0       9600
md0               0.00         0.00         0.00          0          0

avg-cpu:  %user   %nice    %sys %iowait   %idle
            0.00    0.00    5.53    0.00   94.47

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
hda               0.00         0.00         0.00          0          0
sda             102.02      9765.66         0.00       9668          0
sdb             108.08      9765.66         0.00       9668          0
sdc             108.08      9765.66         0.00       9668          0
sdd             108.08      9765.66         0.00       9668          0
sde             103.03      9765.66         0.00       9668          0
sdf             103.03      9765.66         0.00       9668          0
sdg             103.03      9765.66         0.00       9668          0
sdh             102.02      9765.66         0.00       9668          0
sdi             105.05      9765.66         0.00       9668          0
sdj             105.05      9765.66         0.00       9668          0
sdk             103.03      9765.66         0.00       9668          0
sdl             120.20         0.00      9696.97          0       9600
md0               0.00         0.00         0.00          0          0

avg-cpu:  %user   %nice    %sys %iowait   %idle
            0.00    0.00    6.00    0.00   94.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
hda               0.00         0.00         0.00          0          0
sda             109.90      9500.99         0.00       9596          0
sdb             103.96      9500.99         0.00       9596          0
sdc             107.92      9500.99         0.00       9596          0
sdd             106.93      9500.99         0.00       9596          0
sde             104.95      9500.99         0.00       9596          0
sdf             102.97      9500.99         0.00       9596          0
sdg             104.95      9500.99         0.00       9596          0
sdh             102.97      9500.99         0.00       9596          0
sdi             101.98      9500.99         0.00       9596          0
sdj             101.98      9500.99         0.00       9596          0
sdk             101.98      9500.99         0.00       9596          0
sdl             154.46         0.00      9536.63          0       9632
md0               0.00         0.00         0.00          0          0

avg-cpu:  %user   %nice    %sys %iowait   %idle
            0.00    0.00    5.50    0.00   94.50

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
hda               0.00         0.00         0.00          0          0
sda             100.99      9401.98         0.00       9496          0
sdb             100.00      9401.98         0.00       9496          0
sdc              98.02      9401.98         0.00       9496          0
sdd             100.00      9401.98         0.00       9496          0
sde              97.03      9401.98         0.00       9496          0
sdf              94.06      9401.98         0.00       9496          0
sdg              95.05      9401.98         0.00       9496          0
sdh              96.04      9401.98         0.00       9496          0
sdi              96.04      9401.98         0.00       9496          0
sdj              95.05      9401.98         0.00       9496          0
sdk              97.03      9401.98         0.00       9496          0
sdl             127.72         0.00      9600.00          0       9696
md0               0.00         0.00         0.00          0          0

avg-cpu:  %user   %nice    %sys %iowait   %idle
            0.00    0.00    5.97    0.00   94.03

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
hda               2.00         0.00        32.00          0         32
sda              90.00      9676.00         0.00       9676          0
sdb              91.00      9676.00         0.00       9676          0
sdc              90.00      9676.00         0.00       9676          0
sdd              90.00      9676.00         0.00       9676          0
sde              90.00      9676.00         0.00       9676          0
sdf              89.00      9676.00         0.00       9676          0
sdg              89.00      9676.00         0.00       9676          0
sdh              89.00      9676.00         0.00       9676          0
sdi              89.00      9676.00         0.00       9676          0
sdj              89.00      9676.00         0.00       9676          0
sdk              89.00      9676.00         0.00       9676          0
sdl             124.00         0.00      9600.00          0       9600
md0               0.00         0.00         0.00          0          0


Devices sd[a-l] make up /dev/md0:
[root@justinstalled ~]# cat /proc/mdstat
Personalities : [raid5]
md0 : active raid5 sdl[12] sdk[10] sdj[9] sdi[8] sdh[7] sdg[6] sdf[5] 
sde[4] sdd[3] sdc[2] sdb[1] sda[0]
       1719198976 blocks level 5, 128k chunk, algorithm 2 [12/11] 
[UUUUUUUUUUU_]
       [>....................]  recovery =  2.4% (3837952/156290816) 
finish=256.7min speed=9895K/sec
unused devices: <none>
[root@justinstalled ~]#


Why are all the writes concentrated on a single drive ?  Shouldn't the 
reads and writes be being distributed evenly amongst all the drives ? 
Or is this just something unique to the rebuild phase ?

CS

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: RAID5 rebuild question
  2005-07-03  6:20 RAID5 rebuild question Christopher Smith
@ 2005-07-03  6:41 ` Guy
  2005-07-04  1:20   ` Neil Brown
  0 siblings, 1 reply; 5+ messages in thread
From: Guy @ 2005-07-03  6:41 UTC (permalink / raw)
  To: 'Christopher Smith', linux-raid

It looks like it is rebuilding to a spare or new disk.
If this is a new array, I would think that create would be writing to all
disks, but not sure.

I noticed the speed is about 10000K/sec/disk
Maybe it can go faster, try this:

To see current limit:
cat /proc/sys/dev/raid/speed_limit_max

To set new limit:
echo 100000 > /proc/sys/dev/raid/speed_limit_max

for details:
man md

Guy

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Christopher Smith
> Sent: Sunday, July 03, 2005 2:20 AM
> To: linux-raid@vger.kernel.org
> Subject: RAID5 rebuild question
> 
> While waiting for a rather large RAID5 array to build, I noticed the
> following output from iostat -k 1:
> 
> Linux 2.6.11-1.1369_FC4smp (justinstalled.syd.nighthawkrad.net)
> 04/07/05
> 
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>             1.10    0.00    5.24    2.45   91.21
> 
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> hda               7.79        58.17        46.26      82741      65802
> sda              86.70      8221.20       391.64   11693016     557032
> sdb              81.11      8221.16        15.06   11692952      21416
> sdc              80.85      8221.18        14.16   11692980      20136
> sdd              80.93      8221.20        15.06   11693016      21416
> sde              81.01      8221.20        15.37   11693016      21864
> sdf              80.79      8221.20        14.16   11693016      20136
> sdg              80.91      8221.20        14.52   11693016      20648
> sdh              79.67      8221.16         6.91   11692952       9832
> sdi              78.95      8221.20         0.03   11693016         40
> sdj              79.04      8221.20         0.03   11693016         40
> sdk              79.48      8221.20         0.03   11693016         40
> sdl              93.28         0.33      8269.91        472   11762288
> md0               1.60         0.00       102.28          0     145472
> 
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>             0.49    0.00    7.35    0.00   92.16
> 
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> hda               0.00         0.00         0.00          0          0
> sda             100.99      9417.82         0.00       9512          0
> sdb             101.98      9417.82         0.00       9512          0
> sdc             100.00      9417.82         0.00       9512          0
> sdd              98.02      9417.82         0.00       9512          0
> sde              96.04      9417.82         0.00       9512          0
> sdf              96.04      9417.82         0.00       9512          0
> sdg              96.04      9417.82         0.00       9512          0
> sdh              96.04      9417.82         0.00       9512          0
> sdi              99.01      9417.82         0.00       9512          0
> sdj             100.00      9417.82         0.00       9512          0
> sdk              99.01      9417.82         0.00       9512          0
> sdl             109.90         0.00      9504.95          0       9600
> md0               0.00         0.00         0.00          0          0
> 
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>             0.00    0.00    5.53    0.00   94.47
> 
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> hda               0.00         0.00         0.00          0          0
> sda             102.02      9765.66         0.00       9668          0
> sdb             108.08      9765.66         0.00       9668          0
> sdc             108.08      9765.66         0.00       9668          0
> sdd             108.08      9765.66         0.00       9668          0
> sde             103.03      9765.66         0.00       9668          0
> sdf             103.03      9765.66         0.00       9668          0
> sdg             103.03      9765.66         0.00       9668          0
> sdh             102.02      9765.66         0.00       9668          0
> sdi             105.05      9765.66         0.00       9668          0
> sdj             105.05      9765.66         0.00       9668          0
> sdk             103.03      9765.66         0.00       9668          0
> sdl             120.20         0.00      9696.97          0       9600
> md0               0.00         0.00         0.00          0          0
> 
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>             0.00    0.00    6.00    0.00   94.00
> 
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> hda               0.00         0.00         0.00          0          0
> sda             109.90      9500.99         0.00       9596          0
> sdb             103.96      9500.99         0.00       9596          0
> sdc             107.92      9500.99         0.00       9596          0
> sdd             106.93      9500.99         0.00       9596          0
> sde             104.95      9500.99         0.00       9596          0
> sdf             102.97      9500.99         0.00       9596          0
> sdg             104.95      9500.99         0.00       9596          0
> sdh             102.97      9500.99         0.00       9596          0
> sdi             101.98      9500.99         0.00       9596          0
> sdj             101.98      9500.99         0.00       9596          0
> sdk             101.98      9500.99         0.00       9596          0
> sdl             154.46         0.00      9536.63          0       9632
> md0               0.00         0.00         0.00          0          0
> 
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>             0.00    0.00    5.50    0.00   94.50
> 
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> hda               0.00         0.00         0.00          0          0
> sda             100.99      9401.98         0.00       9496          0
> sdb             100.00      9401.98         0.00       9496          0
> sdc              98.02      9401.98         0.00       9496          0
> sdd             100.00      9401.98         0.00       9496          0
> sde              97.03      9401.98         0.00       9496          0
> sdf              94.06      9401.98         0.00       9496          0
> sdg              95.05      9401.98         0.00       9496          0
> sdh              96.04      9401.98         0.00       9496          0
> sdi              96.04      9401.98         0.00       9496          0
> sdj              95.05      9401.98         0.00       9496          0
> sdk              97.03      9401.98         0.00       9496          0
> sdl             127.72         0.00      9600.00          0       9696
> md0               0.00         0.00         0.00          0          0
> 
> avg-cpu:  %user   %nice    %sys %iowait   %idle
>             0.00    0.00    5.97    0.00   94.03
> 
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> hda               2.00         0.00        32.00          0         32
> sda              90.00      9676.00         0.00       9676          0
> sdb              91.00      9676.00         0.00       9676          0
> sdc              90.00      9676.00         0.00       9676          0
> sdd              90.00      9676.00         0.00       9676          0
> sde              90.00      9676.00         0.00       9676          0
> sdf              89.00      9676.00         0.00       9676          0
> sdg              89.00      9676.00         0.00       9676          0
> sdh              89.00      9676.00         0.00       9676          0
> sdi              89.00      9676.00         0.00       9676          0
> sdj              89.00      9676.00         0.00       9676          0
> sdk              89.00      9676.00         0.00       9676          0
> sdl             124.00         0.00      9600.00          0       9600
> md0               0.00         0.00         0.00          0          0
> 
> 
> Devices sd[a-l] make up /dev/md0:
> [root@justinstalled ~]# cat /proc/mdstat
> Personalities : [raid5]
> md0 : active raid5 sdl[12] sdk[10] sdj[9] sdi[8] sdh[7] sdg[6] sdf[5]
> sde[4] sdd[3] sdc[2] sdb[1] sda[0]
>        1719198976 blocks level 5, 128k chunk, algorithm 2 [12/11]
> [UUUUUUUUUUU_]
>        [>....................]  recovery =  2.4% (3837952/156290816)
> finish=256.7min speed=9895K/sec
> unused devices: <none>
> [root@justinstalled ~]#
> 
> 
> Why are all the writes concentrated on a single drive ?  Shouldn't the
> reads and writes be being distributed evenly amongst all the drives ?
> Or is this just something unique to the rebuild phase ?
> 
> CS
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: RAID5 rebuild question
  2005-07-03  6:41 ` Guy
@ 2005-07-04  1:20   ` Neil Brown
  2005-07-04  3:41     ` Guy
  2005-07-07 20:48     ` David Greaves
  0 siblings, 2 replies; 5+ messages in thread
From: Neil Brown @ 2005-07-04  1:20 UTC (permalink / raw)
  To: Guy; +Cc: 'Christopher Smith', linux-raid

On Sunday July 3, bugzilla@watkins-home.com wrote:
> It looks like it is rebuilding to a spare or new disk.

Yep.

> If this is a new array, I would think that create would be writing to all
> disks, but not sure.

Nope....

When creating a new raid5 array, we need to make sure the parity
blocks are all correct (obviously).  There are several ways to do
this.

1/ write zeros to all drives.  This would make the array unusable
   until the clearing is complete, so isn't a good option.
2/ Read all the data blocks, compute the parity block, and then write
   out the parity block.  This works, but is not optimal.  Remembering
   that the parity block is on a different drive for each 'stripe',
   think about what the read/write heads are doing.
   The heads on the 'reading' drives will be somewhere ahead of the
   heads on the 'writing' drive.  Every time we step to a new stripe
   and change which is the 'writing' head, the other reading heads
   have to wait for the head that has just changes from 'writing' to
   'reading' to catch up (finish writing, then start reading).
   Waiting slows things down, so this is uniformly sub-optimal.
3/ read all data blocks and parity blocks, check the parity block to
   see if it is correct, and only write out a new block if it wasn't.
   This works quite well if most of the parity blocks are correct as
   all heads are reading in parallel and are pretty-much synchronised.
   This is how the raid5 'resync' process in md works.  It happens
   after an unclean shutdown if the array was active at crash-time.
   However if most or even many of the parity blocks are wrong, this
   process will be quite slow as the parity-block drive will have to
   read-a-bunch, step-back, write-a-bunch.  So it isn't good for
   initially setting the parity.
4/ Assume that the parity blocks are all correct, but that one drive
   is missing (i.e. the array is degraded).  This is repaired by
   reconstructing what should have been on the missing drive, onto a
   spare.  This involves reading all the 'good' drives in parallel,
   calculating them missing block (whether data or parity) and writing
   it to the 'spare' drive.  The 'spare' will be written to a few (10s
   or 100s of) blocks behind the blocks being read off the 'good'
   drives, but each drive will run completely sequentially and so at
   top speed.

On a new array where most of the parity blocks are probably bad, '4'
is clearly the best option. 'mdadm' makes sure this happens by creating
a raid5 array not with N good drives, but with N-1 good drives and one
spare.  Reconstruction then happens and you should see exactly what
was reported: reads from all but the last drive, writes to that last
drives.

This should go in a FAQ.  Is anyone actively maintaining an md/mdadm
FAQ at the moment, or should I start putting something together??

NeilBrown

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: RAID5 rebuild question
  2005-07-04  1:20   ` Neil Brown
@ 2005-07-04  3:41     ` Guy
  2005-07-07 20:48     ` David Greaves
  1 sibling, 0 replies; 5+ messages in thread
From: Guy @ 2005-07-04  3:41 UTC (permalink / raw)
  To: 'Neil Brown'; +Cc: 'Christopher Smith', linux-raid

This is worth saving!!!!

I did want to create a list of frequent problems, and how to correct them,
but never made the time.  I don't know of any FAQ pages.  This mailing list
is it!  :)

Guy

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Neil Brown
> Sent: Sunday, July 03, 2005 9:21 PM
> To: Guy
> Cc: 'Christopher Smith'; linux-raid@vger.kernel.org
> Subject: RE: RAID5 rebuild question
> 
> On Sunday July 3, bugzilla@watkins-home.com wrote:
> > It looks like it is rebuilding to a spare or new disk.
> 
> Yep.
> 
> > If this is a new array, I would think that create would be writing to
> all
> > disks, but not sure.
> 
> Nope....
> 
> When creating a new raid5 array, we need to make sure the parity
> blocks are all correct (obviously).  There are several ways to do
> this.
> 
> 1/ write zeros to all drives.  This would make the array unusable
>    until the clearing is complete, so isn't a good option.
> 2/ Read all the data blocks, compute the parity block, and then write
>    out the parity block.  This works, but is not optimal.  Remembering
>    that the parity block is on a different drive for each 'stripe',
>    think about what the read/write heads are doing.
>    The heads on the 'reading' drives will be somewhere ahead of the
>    heads on the 'writing' drive.  Every time we step to a new stripe
>    and change which is the 'writing' head, the other reading heads
>    have to wait for the head that has just changes from 'writing' to
>    'reading' to catch up (finish writing, then start reading).
>    Waiting slows things down, so this is uniformly sub-optimal.
> 3/ read all data blocks and parity blocks, check the parity block to
>    see if it is correct, and only write out a new block if it wasn't.
>    This works quite well if most of the parity blocks are correct as
>    all heads are reading in parallel and are pretty-much synchronised.
>    This is how the raid5 'resync' process in md works.  It happens
>    after an unclean shutdown if the array was active at crash-time.
>    However if most or even many of the parity blocks are wrong, this
>    process will be quite slow as the parity-block drive will have to
>    read-a-bunch, step-back, write-a-bunch.  So it isn't good for
>    initially setting the parity.
> 4/ Assume that the parity blocks are all correct, but that one drive
>    is missing (i.e. the array is degraded).  This is repaired by
>    reconstructing what should have been on the missing drive, onto a
>    spare.  This involves reading all the 'good' drives in parallel,
>    calculating them missing block (whether data or parity) and writing
>    it to the 'spare' drive.  The 'spare' will be written to a few (10s
>    or 100s of) blocks behind the blocks being read off the 'good'
>    drives, but each drive will run completely sequentially and so at
>    top speed.
> 
> On a new array where most of the parity blocks are probably bad, '4'
> is clearly the best option. 'mdadm' makes sure this happens by creating
> a raid5 array not with N good drives, but with N-1 good drives and one
> spare.  Reconstruction then happens and you should see exactly what
> was reported: reads from all but the last drive, writes to that last
> drives.
> 
> This should go in a FAQ.  Is anyone actively maintaining an md/mdadm
> FAQ at the moment, or should I start putting something together??
> 
> NeilBrown
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID5 rebuild question
  2005-07-04  1:20   ` Neil Brown
  2005-07-04  3:41     ` Guy
@ 2005-07-07 20:48     ` David Greaves
  1 sibling, 0 replies; 5+ messages in thread
From: David Greaves @ 2005-07-07 20:48 UTC (permalink / raw)
  To: Neil Brown; +Cc: Guy, 'Christopher Smith', linux-raid


>This should go in a FAQ.  Is anyone actively maintaining an md/mdadm
>FAQ at the moment, or should I start putting something together??
>
Can I suggest a wiki? Or an 'online multi-persin editable document' :)

There are a few people here who could contribute and/or edit.

David

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2005-07-07 20:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-03  6:20 RAID5 rebuild question Christopher Smith
2005-07-03  6:41 ` Guy
2005-07-04  1:20   ` Neil Brown
2005-07-04  3:41     ` Guy
2005-07-07 20:48     ` David Greaves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).