Roadmap for md/raid ???

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Roadmap for md/raid ???
@ 2008-12-19  4:10 Neil Brown
  2008-12-19 15:44 ` Chris Worley
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Neil Brown @ 2008-12-19  4:10 UTC (permalink / raw)
  To: linux-raid

Not really a roadmap, more a few tourist attractions that you might
see on the way if you stick around (and if I stick around)...

Comments welcome.

NeilBrown

- Bad block list
  The idea here is to maintain and store on each device a list of
  blocks that are known to be 'bad'.  This effectively allows us to
  fail a single block rather than a whole device when we get a media
  write error.  Of course if updating the bad-block-list gives an
  error we then have to fail the device.

  We would also record a bad block if we get a read error on a degraded
  array.  This would e.g. allow recovery for a degraded raid1 where the
  sole remaining device has a bad block.

  An array could have multiple errors on different devices and just
  those stripes would be considered to be "degraded".  As long a no
  single stripe had too many bad blocks, the data would still be safe.
  Naturally as soon as you get one bad block, the array becomes
  susceptible to data loss on a single device failure, so it wouldn't
  be advisable to run with non-empty badblock lists for an extended
  length of time,  However it might provide breathing space until
  drive replacement can be achieved.

- hot-device-replace
  This is probably the most asked for feature of late.  It would allow
  a device to be 'recovered' while the original was still in service. 
  So instead of failing out a device and adding a spare, you can add
  the spare, build the data onto it, then fail out the device.

  This meshes well with the bad block list.  When we find a bad block,
  we start a hot-replace onto a spare (if one exists).  If sleeping
  bad blocks are discovered during the hot-replace process, we don't
  lose the data unless we find two bad blocks in the same stripe.
  And then we just lose data in that stripe.

  Recording in the metadata that a hot-replace was happening might be
  a little tricky, so it could be that if you reboot in the middle,
  you would have to restart from the beginning.  Similarly there would
  be no 'intent' bitmap involved for this resync.

  Each personality would have to implement much of this independently,
  effectively providing a mini raid1 implementation.  It would be very
  minimal without e.g. read balancing or write-behind etc.

  There would be no point implementing this in raid1.  Just
  raid456 and raid10.
  It could conceivably make sense for raid0 and linear, but that is
  very unlikely to be implemented.

- split-mirror
  This is really a function of mdadm rather than md.  It is already
  quite possible to break a mirror into two separate single-device
  arrays.  However it is a sufficiently common operation that it is
  probably making it very easy to do with mdadm.
  I'm thinking something like
      mdadm --create /dev/md/new --split /dev/md/old

  will create a new raid1 by taking one device off /dev/md/old (which
  must be a raid1) and making an array with exactly the right metadata
  and size.

- raid5->raid6 conversion.
   This is also a fairly commonly asked for feature.
   The first step would be to define a raid6 layout where the Q block
   was not rotated around the devices but was always on the last
   device.  Then we could change a raid5 to a singly-degraded raid6
   without moving any data.

   The next step would be to implement in-place restriping. 
   This involves 
      - freezing a section of the array (all IO blocks)
      - copying the data out to a safe backup
      - copying it back in with the new layout
      - updating the metadata to indicate that the restripe has
        progressed.
      - repeat.

   This would probably be quite slow but it would achieve the desired
   result. 

   Once we have in-place restriping we could change chunksize as
   well.

- raid5 reduce number of devices.
   We can currently restripe a raid5 (or 6) over a larger number of
   devices but not over a smaller number of devices.  That means you
   cannot undo an increase that you didn't want.

   It might be nice to allow this to happen at the same time as
   increasing --size (if the devices are big enough) to allow the
   array to be restriped without changing the available space.

- cluster raid1
   Allow a raid1 to be assembled on multiple hosts that share some
   drives, so a cluster filesystem (e.g. ocfs2) can be run over it.
   It requires co-ordination to handle failure events and
   resync/recovery.  Most of this would probably be done in userspace.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2008-12-19  4:10 Roadmap for md/raid ??? Neil Brown
@ 2008-12-19 15:44 ` Chris Worley
  2008-12-19 15:51   ` Justin Piszcz
  2008-12-30 18:12 ` Janek Kozicki
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Chris Worley @ 2008-12-19 15:44 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

How about "parallelized parity calculation"... given SSD I/O
performance, parity calculations are now the performance bottleneck.
Most systems have plenty of CPU's to do parity calculations in
parallel.  Parity calculations are embarrassingly parallel (no
dependence between the domains in a domain distribution).

Chris
On Thu, Dec 18, 2008 at 9:10 PM, Neil Brown <neilb@suse.de> wrote:
>
>
> Not really a roadmap, more a few tourist attractions that you might
> see on the way if you stick around (and if I stick around)...
>
> Comments welcome.
>
> NeilBrown
>
>
> - Bad block list
>  The idea here is to maintain and store on each device a list of
>  blocks that are known to be 'bad'.  This effectively allows us to
>  fail a single block rather than a whole device when we get a media
>  write error.  Of course if updating the bad-block-list gives an
>  error we then have to fail the device.
>
>  We would also record a bad block if we get a read error on a degraded
>  array.  This would e.g. allow recovery for a degraded raid1 where the
>  sole remaining device has a bad block.
>
>  An array could have multiple errors on different devices and just
>  those stripes would be considered to be "degraded".  As long a no
>  single stripe had too many bad blocks, the data would still be safe.
>  Naturally as soon as you get one bad block, the array becomes
>  susceptible to data loss on a single device failure, so it wouldn't
>  be advisable to run with non-empty badblock lists for an extended
>  length of time,  However it might provide breathing space until
>  drive replacement can be achieved.
>
> - hot-device-replace
>  This is probably the most asked for feature of late.  It would allow
>  a device to be 'recovered' while the original was still in service.
>  So instead of failing out a device and adding a spare, you can add
>  the spare, build the data onto it, then fail out the device.
>
>  This meshes well with the bad block list.  When we find a bad block,
>  we start a hot-replace onto a spare (if one exists).  If sleeping
>  bad blocks are discovered during the hot-replace process, we don't
>  lose the data unless we find two bad blocks in the same stripe.
>  And then we just lose data in that stripe.
>
>  Recording in the metadata that a hot-replace was happening might be
>  a little tricky, so it could be that if you reboot in the middle,
>  you would have to restart from the beginning.  Similarly there would
>  be no 'intent' bitmap involved for this resync.
>
>  Each personality would have to implement much of this independently,
>  effectively providing a mini raid1 implementation.  It would be very
>  minimal without e.g. read balancing or write-behind etc.
>
>  There would be no point implementing this in raid1.  Just
>  raid456 and raid10.
>  It could conceivably make sense for raid0 and linear, but that is
>  very unlikely to be implemented.
>
> - split-mirror
>  This is really a function of mdadm rather than md.  It is already
>  quite possible to break a mirror into two separate single-device
>  arrays.  However it is a sufficiently common operation that it is
>  probably making it very easy to do with mdadm.
>  I'm thinking something like
>      mdadm --create /dev/md/new --split /dev/md/old
>
>  will create a new raid1 by taking one device off /dev/md/old (which
>  must be a raid1) and making an array with exactly the right metadata
>  and size.
>
> - raid5->raid6 conversion.
>   This is also a fairly commonly asked for feature.
>   The first step would be to define a raid6 layout where the Q block
>   was not rotated around the devices but was always on the last
>   device.  Then we could change a raid5 to a singly-degraded raid6
>   without moving any data.
>
>   The next step would be to implement in-place restriping.
>   This involves
>      - freezing a section of the array (all IO blocks)
>      - copying the data out to a safe backup
>      - copying it back in with the new layout
>      - updating the metadata to indicate that the restripe has
>        progressed.
>      - repeat.
>
>   This would probably be quite slow but it would achieve the desired
>   result.
>
>   Once we have in-place restriping we could change chunksize as
>   well.
>
> - raid5 reduce number of devices.
>   We can currently restripe a raid5 (or 6) over a larger number of
>   devices but not over a smaller number of devices.  That means you
>   cannot undo an increase that you didn't want.
>
>   It might be nice to allow this to happen at the same time as
>   increasing --size (if the devices are big enough) to allow the
>   array to be restriped without changing the available space.
>
> - cluster raid1
>   Allow a raid1 to be assembled on multiple hosts that share some
>   drives, so a cluster filesystem (e.g. ocfs2) can be run over it.
>   It requires co-ordination to handle failure events and
>   resync/recovery.  Most of this would probably be done in userspace.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2008-12-19 15:44 ` Chris Worley
@ 2008-12-19 15:51   ` Justin Piszcz
  2008-12-19 16:13     ` Bernd Schubert
  0 siblings, 1 reply; 23+ messages in thread
From: Justin Piszcz @ 2008-12-19 15:51 UTC (permalink / raw)
  To: Chris Worley; +Cc: Neil Brown, linux-raid

Or, before that, allow multiple arrays to rebuild on each core of the 
CPU(s), one per array.

Justin.

On Fri, 19 Dec 2008, Chris Worley wrote:

> How about "parallelized parity calculation"... given SSD I/O
> performance, parity calculations are now the performance bottleneck.
> Most systems have plenty of CPU's to do parity calculations in
> parallel.  Parity calculations are embarrassingly parallel (no
> dependence between the domains in a domain distribution).
>
> Chris
> On Thu, Dec 18, 2008 at 9:10 PM, Neil Brown <neilb@suse.de> wrote:
>>
>>
>> Not really a roadmap, more a few tourist attractions that you might
>> see on the way if you stick around (and if I stick around)...
>>
>> Comments welcome.
>>
>> NeilBrown
>>
>>
>> - Bad block list
>>  The idea here is to maintain and store on each device a list of
>>  blocks that are known to be 'bad'.  This effectively allows us to
>>  fail a single block rather than a whole device when we get a media
>>  write error.  Of course if updating the bad-block-list gives an
>>  error we then have to fail the device.
>>
>>  We would also record a bad block if we get a read error on a degraded
>>  array.  This would e.g. allow recovery for a degraded raid1 where the
>>  sole remaining device has a bad block.
>>
>>  An array could have multiple errors on different devices and just
>>  those stripes would be considered to be "degraded".  As long a no
>>  single stripe had too many bad blocks, the data would still be safe.
>>  Naturally as soon as you get one bad block, the array becomes
>>  susceptible to data loss on a single device failure, so it wouldn't
>>  be advisable to run with non-empty badblock lists for an extended
>>  length of time,  However it might provide breathing space until
>>  drive replacement can be achieved.
>>
>> - hot-device-replace
>>  This is probably the most asked for feature of late.  It would allow
>>  a device to be 'recovered' while the original was still in service.
>>  So instead of failing out a device and adding a spare, you can add
>>  the spare, build the data onto it, then fail out the device.
>>
>>  This meshes well with the bad block list.  When we find a bad block,
>>  we start a hot-replace onto a spare (if one exists).  If sleeping
>>  bad blocks are discovered during the hot-replace process, we don't
>>  lose the data unless we find two bad blocks in the same stripe.
>>  And then we just lose data in that stripe.
>>
>>  Recording in the metadata that a hot-replace was happening might be
>>  a little tricky, so it could be that if you reboot in the middle,
>>  you would have to restart from the beginning.  Similarly there would
>>  be no 'intent' bitmap involved for this resync.
>>
>>  Each personality would have to implement much of this independently,
>>  effectively providing a mini raid1 implementation.  It would be very
>>  minimal without e.g. read balancing or write-behind etc.
>>
>>  There would be no point implementing this in raid1.  Just
>>  raid456 and raid10.
>>  It could conceivably make sense for raid0 and linear, but that is
>>  very unlikely to be implemented.
>>
>> - split-mirror
>>  This is really a function of mdadm rather than md.  It is already
>>  quite possible to break a mirror into two separate single-device
>>  arrays.  However it is a sufficiently common operation that it is
>>  probably making it very easy to do with mdadm.
>>  I'm thinking something like
>>      mdadm --create /dev/md/new --split /dev/md/old
>>
>>  will create a new raid1 by taking one device off /dev/md/old (which
>>  must be a raid1) and making an array with exactly the right metadata
>>  and size.
>>
>> - raid5->raid6 conversion.
>>   This is also a fairly commonly asked for feature.
>>   The first step would be to define a raid6 layout where the Q block
>>   was not rotated around the devices but was always on the last
>>   device.  Then we could change a raid5 to a singly-degraded raid6
>>   without moving any data.
>>
>>   The next step would be to implement in-place restriping.
>>   This involves
>>      - freezing a section of the array (all IO blocks)
>>      - copying the data out to a safe backup
>>      - copying it back in with the new layout
>>      - updating the metadata to indicate that the restripe has
>>        progressed.
>>      - repeat.
>>
>>   This would probably be quite slow but it would achieve the desired
>>   result.
>>
>>   Once we have in-place restriping we could change chunksize as
>>   well.
>>
>> - raid5 reduce number of devices.
>>   We can currently restripe a raid5 (or 6) over a larger number of
>>   devices but not over a smaller number of devices.  That means you
>>   cannot undo an increase that you didn't want.
>>
>>   It might be nice to allow this to happen at the same time as
>>   increasing --size (if the devices are big enough) to allow the
>>   array to be restriped without changing the available space.
>>
>> - cluster raid1
>>   Allow a raid1 to be assembled on multiple hosts that share some
>>   drives, so a cluster filesystem (e.g. ocfs2) can be run over it.
>>   It requires co-ordination to handle failure events and
>>   resync/recovery.  Most of this would probably be done in userspace.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2008-12-19 15:51   ` Justin Piszcz
@ 2008-12-19 16:13     ` Bernd Schubert
  0 siblings, 0 replies; 23+ messages in thread
From: Bernd Schubert @ 2008-12-19 16:13 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Chris Worley, Neil Brown, linux-raid

But multiple rebuilds are already supported. If you have multiple arrays on 
drive partitions and the CPU is the limit, you may want to 
set /sys/block/mdX/md/sync_force_parallel to one.

Cheers,
Bernd

On Friday 19 December 2008 16:51:24 Justin Piszcz wrote:
> Or, before that, allow multiple arrays to rebuild on each core of the
> CPU(s), one per array.
>
> Justin.
>
> On Fri, 19 Dec 2008, Chris Worley wrote:
> > How about "parallelized parity calculation"... given SSD I/O
> > performance, parity calculations are now the performance bottleneck.
> > Most systems have plenty of CPU's to do parity calculations in
> > parallel.  Parity calculations are embarrassingly parallel (no
> > dependence between the domains in a domain distribution).
> >
> > Chris
> >
> > On Thu, Dec 18, 2008 at 9:10 PM, Neil Brown <neilb@suse.de> wrote:
> >> Not really a roadmap, more a few tourist attractions that you might
> >> see on the way if you stick around (and if I stick around)...
> >>
> >> Comments welcome.
> >>
> >> NeilBrown
> >>
> >>
> >> - Bad block list
> >>  The idea here is to maintain and store on each device a list of
> >>  blocks that are known to be 'bad'.  This effectively allows us to
> >>  fail a single block rather than a whole device when we get a media
> >>  write error.  Of course if updating the bad-block-list gives an
> >>  error we then have to fail the device.
> >>
> >>  We would also record a bad block if we get a read error on a degraded
> >>  array.  This would e.g. allow recovery for a degraded raid1 where the
> >>  sole remaining device has a bad block.
> >>
> >>  An array could have multiple errors on different devices and just
> >>  those stripes would be considered to be "degraded".  As long a no
> >>  single stripe had too many bad blocks, the data would still be safe.
> >>  Naturally as soon as you get one bad block, the array becomes
> >>  susceptible to data loss on a single device failure, so it wouldn't
> >>  be advisable to run with non-empty badblock lists for an extended
> >>  length of time,  However it might provide breathing space until
> >>  drive replacement can be achieved.
> >>
> >> - hot-device-replace
> >>  This is probably the most asked for feature of late.  It would allow
> >>  a device to be 'recovered' while the original was still in service.
> >>  So instead of failing out a device and adding a spare, you can add
> >>  the spare, build the data onto it, then fail out the device.
> >>
> >>  This meshes well with the bad block list.  When we find a bad block,
> >>  we start a hot-replace onto a spare (if one exists).  If sleeping
> >>  bad blocks are discovered during the hot-replace process, we don't
> >>  lose the data unless we find two bad blocks in the same stripe.
> >>  And then we just lose data in that stripe.
> >>
> >>  Recording in the metadata that a hot-replace was happening might be
> >>  a little tricky, so it could be that if you reboot in the middle,
> >>  you would have to restart from the beginning.  Similarly there would
> >>  be no 'intent' bitmap involved for this resync.
> >>
> >>  Each personality would have to implement much of this independently,
> >>  effectively providing a mini raid1 implementation.  It would be very
> >>  minimal without e.g. read balancing or write-behind etc.
> >>
> >>  There would be no point implementing this in raid1.  Just
> >>  raid456 and raid10.
> >>  It could conceivably make sense for raid0 and linear, but that is
> >>  very unlikely to be implemented.
> >>
> >> - split-mirror
> >>  This is really a function of mdadm rather than md.  It is already
> >>  quite possible to break a mirror into two separate single-device
> >>  arrays.  However it is a sufficiently common operation that it is
> >>  probably making it very easy to do with mdadm.
> >>  I'm thinking something like
> >>      mdadm --create /dev/md/new --split /dev/md/old
> >>
> >>  will create a new raid1 by taking one device off /dev/md/old (which
> >>  must be a raid1) and making an array with exactly the right metadata
> >>  and size.
> >>
> >> - raid5->raid6 conversion.
> >>   This is also a fairly commonly asked for feature.
> >>   The first step would be to define a raid6 layout where the Q block
> >>   was not rotated around the devices but was always on the last
> >>   device.  Then we could change a raid5 to a singly-degraded raid6
> >>   without moving any data.
> >>
> >>   The next step would be to implement in-place restriping.
> >>   This involves
> >>      - freezing a section of the array (all IO blocks)
> >>      - copying the data out to a safe backup
> >>      - copying it back in with the new layout
> >>      - updating the metadata to indicate that the restripe has
> >>        progressed.
> >>      - repeat.
> >>
> >>   This would probably be quite slow but it would achieve the desired
> >>   result.
> >>
> >>   Once we have in-place restriping we could change chunksize as
> >>   well.
> >>
> >> - raid5 reduce number of devices.
> >>   We can currently restripe a raid5 (or 6) over a larger number of
> >>   devices but not over a smaller number of devices.  That means you
> >>   cannot undo an increase that you didn't want.
> >>
> >>   It might be nice to allow this to happen at the same time as
> >>   increasing --size (if the devices are big enough) to allow the
> >>   array to be restriped without changing the available space.
> >>
> >> - cluster raid1
> >>   Allow a raid1 to be assembled on multiple hosts that share some
> >>   drives, so a cluster filesystem (e.g. ocfs2) can be run over it.
> >>   It requires co-ordination to handle failure events and
> >>   resync/recovery.  Most of this would probably be done in userspace.
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2008-12-19  4:10 Roadmap for md/raid ??? Neil Brown
  2008-12-19 15:44 ` Chris Worley
@ 2008-12-30 18:12 ` Janek Kozicki
  2008-12-30 18:15   ` Janek Kozicki
  2009-01-19  0:54   ` Neil Brown
  2009-01-11 18:14 ` Piergiorgio Sartor
  2009-01-14 20:43 ` Bill Davidsen
  3 siblings, 2 replies; 23+ messages in thread
From: Janek Kozicki @ 2008-12-30 18:12 UTC (permalink / raw)
  To: linux-raid

Neil Brown said:     (by the date of Fri, 19 Dec 2008 15:10:49 +1100)

> - raid5->raid6 conversion.
>    This is also a fairly commonly asked for feature.

I can be one of the first ones to test that :)

In my raid5 home backup server I had two failures in one month.
First a SATA cable of one drive stopped working, and few weeks 
later another drive died (hdc: DMA timeout error - ad infinitum,
and system stopped responding - trying to read that drive;
also smartctl was reporting errors constantly).

BTW: I thought that smartctl and mdadm were cooperating together to
find out about damaged drives, but apparently it's not the case.

Now I feel so uneasy now with this raid5, that I'm going to switch into
raid6 ASAP.

-- 
Janek Kozicki                                                         |

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2008-12-30 18:12 ` Janek Kozicki
@ 2008-12-30 18:15   ` Janek Kozicki
  2009-01-19  0:54   ` Neil Brown
  1 sibling, 0 replies; 23+ messages in thread
From: Janek Kozicki @ 2008-12-30 18:15 UTC (permalink / raw)
  To: linux-raid

Janek Kozicki said:     (by the date of Tue, 30 Dec 2008 19:12:50 +0100)

> BTW: I thought that smartctl and mdadm were cooperating together to
> find out about damaged drives, but apparently it's not the case.

I meant smartmontools daemon, not smartctl itself.

-- 
Janek Kozicki                                                         |

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2008-12-30 18:12 ` Janek Kozicki
  2008-12-30 18:15   ` Janek Kozicki
@ 2009-01-19  0:54   ` Neil Brown
  2009-01-19 12:25     ` Keld Jørn Simonsen
  1 sibling, 1 reply; 23+ messages in thread
From: Neil Brown @ 2009-01-19  0:54 UTC (permalink / raw)
  To: Janek Kozicki; +Cc: linux-raid

On Tuesday December 30, janek_listy@wp.pl wrote:
> Neil Brown said:     (by the date of Fri, 19 Dec 2008 15:10:49 +1100)
> 
> > - raid5->raid6 conversion.
> >    This is also a fairly commonly asked for feature.
> 
> I can be one of the first ones to test that :)
> 

How brave do you feel?
If you clone a recent kernel then
  git pull git://neil.brown.name/md md-scratch

and compile that, you can convert a raid5 to a raid6.
Or maybe you will corrupt your data....
It worked OK for me, but I make no promises.

You cannot use mdadm.  You need to

  echo raid6 > /sys/block/mdX/md/level

The raid6 layout it uses is not the standard layout in that the
Q block is always stored on the last device rather than being
rotated around.
I plan to add support for restriping the array to get the standard
layout.  That will be a slow process and may not be worth the
effort in all cases.

There are issues with the code that make it not ready for
upstream yet.  But it is worth experimenting with if
anyone is interested...

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2009-01-19  0:54   ` Neil Brown
@ 2009-01-19 12:25     ` Keld Jørn Simonsen
  2009-01-19 19:03       ` thomas62186218
  0 siblings, 1 reply; 23+ messages in thread
From: Keld Jørn Simonsen @ 2009-01-19 12:25 UTC (permalink / raw)
  To: Neil Brown; +Cc: Janek Kozicki, linux-raid

I would like to see some enhancements to  the raid code:

1. removal of need to set the readahead to say 32 MiB for certain 
   raid types to get the desired performance

2. getting resync and other performance up for raid10, possibly
   using bigger buffers.

3. growing raid10.

I wonder where this is placed in the roadmap. Neil?

best regards
keld

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2009-01-19 12:25     ` Keld Jørn Simonsen
@ 2009-01-19 19:03       ` thomas62186218
  2009-01-19 20:00         ` Jon Nelson
  0 siblings, 1 reply; 23+ messages in thread
From: thomas62186218 @ 2009-01-19 19:03 UTC (permalink / raw)
  To: keld, neilb; +Cc: janek_listy, linux-raid

I agree with Keld on the need for RAID 10 expansion and increasing RAID 
10 performance. Those are the hottest issues I see right now.

-Thomas


-----Original Message-----
From: Keld Jørn Simonsen <keld@dkuug.dk>
To: Neil Brown <neilb@suse.de>
Cc: Janek Kozicki <janek_listy@wp.pl>; linux-raid@vger.kernel.org
Sent: Mon, 19 Jan 2009 4:25 am
Subject: Re: Roadmap for md/raid ???










I would like to see some enhancements to  the raid code:

1. removal of need to set the readahead to say 32 MiB for certain
   raid types to get the desired performance

2. getting resync and other performance up for raid10, possibly
   using bigger buffers.

3. growing raid10.

I wonder where this is placed in the roadmap. Neil?

best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2009-01-19 19:03       ` thomas62186218
@ 2009-01-19 20:00         ` Jon Nelson
  2009-01-19 20:18           ` Greg Freemyer
  0 siblings, 1 reply; 23+ messages in thread
From: Jon Nelson @ 2009-01-19 20:00 UTC (permalink / raw)
  To: thomas62186218; +Cc: keld, neilb, janek_listy, linux-raid

I guess what I'd like to see more than anything else is not raid5 or
raid6 but raidN where N can be specified at the start and grown. While
I'm not a fan of ZFS's rulebreaking one thing it does (or claims to
do) strikes me as "the future" - the ability to specify X protection
bits and to increase or decrease X as needs see fit. It also strikes
me that there are several ways to do this but fundamentally it boils
down to "we don't trust our drives anymore and there we wish to
protect our data against their failure". Given 5 or 10 or 50 drives
how does one really protect their data effectively and allow their
data pool to grow without large quantities of hoop-jumping?

I'd really like to see a re-thinking of data protection (parity or
data duplication) at the block layer - it need not be RAID as we know
it but IMO something has to be done - rapidly do we near the
precipice!

On Mon, Jan 19, 2009 at 1:03 PM,  <thomas62186218@aol.com> wrote:
> I agree with Keld on the need for RAID 10 expansion and increasing RAID 10
> performance. Those are the hottest issues I see right now.
>
> -Thomas
>
>
> -----Original Message-----
> From: Keld Jørn Simonsen <keld@dkuug.dk>
> To: Neil Brown <neilb@suse.de>
> Cc: Janek Kozicki <janek_listy@wp.pl>; linux-raid@vger.kernel.org
> Sent: Mon, 19 Jan 2009 4:25 am
> Subject: Re: Roadmap for md/raid ???
>
>
>
>
>
>
>
>
>
>
> I would like to see some enhancements to  the raid code:
>
> 1. removal of need to set the readahead to say 32 MiB for certain
>  raid types to get the desired performance
>
> 2. getting resync and other performance up for raid10, possibly
>  using bigger buffers.
>
> 3. growing raid10.
>
> I wonder where this is placed in the roadmap. Neil?
>
> best regards
> keld
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
Jon
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2009-01-19 20:00         ` Jon Nelson
@ 2009-01-19 20:18           ` Greg Freemyer
  2009-01-19 20:30             ` Jon Nelson
  0 siblings, 1 reply; 23+ messages in thread
From: Greg Freemyer @ 2009-01-19 20:18 UTC (permalink / raw)
  To: Jon Nelson; +Cc: thomas62186218, keld, neilb, janek_listy, linux-raid

On Mon, Jan 19, 2009 at 3:00 PM, Jon Nelson
<jnelson-linux-raid@jamponi.net> wrote:
> I guess what I'd like to see more than anything else is not raid5 or
> raid6 but raidN where N can be specified at the start and grown. While
> I'm not a fan of ZFS's rulebreaking one thing it does (or claims to
> do) strikes me as "the future" - the ability to specify X protection
> bits and to increase or decrease X as needs see fit. It also strikes
> me that there are several ways to do this but fundamentally it boils
> down to "we don't trust our drives anymore and there we wish to
> protect our data against their failure". Given 5 or 10 or 50 drives
> how does one really protect their data effectively and allow their
> data pool to grow without large quantities of hoop-jumping?
>
> I'd really like to see a re-thinking of data protection (parity or
> data duplication) at the block layer - it need not be RAID as we know
> it but IMO something has to be done - rapidly do we near the
> precipice!

If I understand what your saying, HP started supporting "Raid
Equivalent" protection in some of their storage arrays years ago.

You simply put a bunch of disk drives in the unit, then tell you want
a 50GB logical volume with Raid 5 equivalent protection, etc.

I think it might for example allocate 10 GB from each of 6 disk
drives.  (ie. 5 for data + 1 for parity).

Then you ask for 30 GB with raid 10 equivalent protection and it might
allocate 10GB more from those same 6 drives.

Then you go back and increase the size of th 50GB raid5 and it would
allocate more space on the drives as required, but always ensuring the
data was protected at least at raid5 levels.

I sort of thought of it as an integrated LVM and Raid manager.

I suspect putting that together is a pretty large amount of effort.

Greg
-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2009-01-19 20:18           ` Greg Freemyer
@ 2009-01-19 20:30             ` Jon Nelson
  0 siblings, 0 replies; 23+ messages in thread
From: Jon Nelson @ 2009-01-19 20:30 UTC (permalink / raw)
  To: Greg Freemyer; +Cc: thomas62186218, keld, neilb, janek_listy, linux-raid

It needn't be all that difficult. Consider the following:

you have a block layer driver which performs the following duties:

it manages a pool of N block devices
it exposes a portion of that storage as a single block device
the unexposed portion goes towards parity (in the vein of providing at
least N bits of parity protection)
the level of parity protection is applied to each block of size X. For
example a 64K block might consist of 48K of data and 16K of parity.
on top of that, the data+parity could be mirrored on N other devices,
anywhere, so long as it is mirrored on at least M other devices.

I'm just kinda thinking out loud. The duplicate block stuff could just
as easily be an md1 or md10 layer, so a new raid level might only have
to handle the parity stuff.


On Mon, Jan 19, 2009 at 2:18 PM, Greg Freemyer <greg.freemyer@gmail.com> wrote:
> On Mon, Jan 19, 2009 at 3:00 PM, Jon Nelson
> <jnelson-linux-raid@jamponi.net> wrote:
>> I guess what I'd like to see more than anything else is not raid5 or
>> raid6 but raidN where N can be specified at the start and grown. While
>> I'm not a fan of ZFS's rulebreaking one thing it does (or claims to
>> do) strikes me as "the future" - the ability to specify X protection
>> bits and to increase or decrease X as needs see fit. It also strikes
>> me that there are several ways to do this but fundamentally it boils
>> down to "we don't trust our drives anymore and there we wish to
>> protect our data against their failure". Given 5 or 10 or 50 drives
>> how does one really protect their data effectively and allow their
>> data pool to grow without large quantities of hoop-jumping?
>>
>> I'd really like to see a re-thinking of data protection (parity or
>> data duplication) at the block layer - it need not be RAID as we know
>> it but IMO something has to be done - rapidly do we near the
>> precipice!
>
> If I understand what your saying, HP started supporting "Raid
> Equivalent" protection in some of their storage arrays years ago.
>
> You simply put a bunch of disk drives in the unit, then tell you want
> a 50GB logical volume with Raid 5 equivalent protection, etc.
>
> I think it might for example allocate 10 GB from each of 6 disk
> drives.  (ie. 5 for data + 1 for parity).
>
> Then you ask for 30 GB with raid 10 equivalent protection and it might
> allocate 10GB more from those same 6 drives.
>
> Then you go back and increase the size of th 50GB raid5 and it would
> allocate more space on the drives as required, but always ensuring the
> data was protected at least at raid5 levels.
>
> I sort of thought of it as an integrated LVM and Raid manager.
>
> I suspect putting that together is a pretty large amount of effort.
>
> Greg
> --
> Greg Freemyer
> Litigation Triage Solutions Specialist
> http://www.linkedin.com/in/gregfreemyer
> First 99 Days Litigation White Paper -
> http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf
>
> The Norcross Group
> The Intersection of Evidence & Technology
> http://www.norcrossgroup.com
>



-- 
Jon

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2008-12-19  4:10 Roadmap for md/raid ??? Neil Brown
  2008-12-19 15:44 ` Chris Worley
  2008-12-30 18:12 ` Janek Kozicki
@ 2009-01-11 18:14 ` Piergiorgio Sartor
  2009-01-19  1:40   ` Neil Brown
  2009-01-14 20:43 ` Bill Davidsen
  3 siblings, 1 reply; 23+ messages in thread
From: Piergiorgio Sartor @ 2009-01-11 18:14 UTC (permalink / raw)
  To: linux-raid

Hi,

something else you can add to your "todo list",
or not, as you like.

RAID-5/6 with heterogeneous devices.
The idea would be to build a RAID-5/6 with devices
of different size, but using, whenever possible,
the complete available space.
Example: let's say we have 6 HDs, 3x 500GB and 3x 1TB.
The idea is to have a single md device, in RAID-5,
where the first part of the device (500GB range) uses
the whole 6 HDs, while the second part uses only 3.
Specifically, the first part will have 500GB x5
and the second 500GB x2.
This is already doable by partitioning the HDs properly
before and creating 2 md devices.
Nevertheless, given the "grow" possibility, in both
directions (disk number, disk size), the RAID only
solution would be interesting.

RAID-5/6 pre-caching.
Maybe this one is already there, in any case the
idea would be to try to cache a complete stripe
set, in case of RAID-5/6, in order to avoid sub
sequent reads in case of write.
This could be a user switchable parameter, for
example when the user knows there will be a lot
of read-modify-write to the array.

Hope this helps.

bye,

pg

On Fri, Dec 19, 2008 at 03:10:49PM +1100, Neil Brown wrote:
> 
> 
> Not really a roadmap, more a few tourist attractions that you might
> see on the way if you stick around (and if I stick around)...
> 
> Comments welcome.
> 
> NeilBrown
> 
> 
> - Bad block list
>   The idea here is to maintain and store on each device a list of
>   blocks that are known to be 'bad'.  This effectively allows us to
>   fail a single block rather than a whole device when we get a media
>   write error.  Of course if updating the bad-block-list gives an
>   error we then have to fail the device.
> 
>   We would also record a bad block if we get a read error on a degraded
>   array.  This would e.g. allow recovery for a degraded raid1 where the
>   sole remaining device has a bad block.
> 
>   An array could have multiple errors on different devices and just
>   those stripes would be considered to be "degraded".  As long a no
>   single stripe had too many bad blocks, the data would still be safe.
>   Naturally as soon as you get one bad block, the array becomes
>   susceptible to data loss on a single device failure, so it wouldn't
>   be advisable to run with non-empty badblock lists for an extended
>   length of time,  However it might provide breathing space until
>   drive replacement can be achieved.
> 
> - hot-device-replace
>   This is probably the most asked for feature of late.  It would allow
>   a device to be 'recovered' while the original was still in service. 
>   So instead of failing out a device and adding a spare, you can add
>   the spare, build the data onto it, then fail out the device.
> 
>   This meshes well with the bad block list.  When we find a bad block,
>   we start a hot-replace onto a spare (if one exists).  If sleeping
>   bad blocks are discovered during the hot-replace process, we don't
>   lose the data unless we find two bad blocks in the same stripe.
>   And then we just lose data in that stripe.
> 
>   Recording in the metadata that a hot-replace was happening might be
>   a little tricky, so it could be that if you reboot in the middle,
>   you would have to restart from the beginning.  Similarly there would
>   be no 'intent' bitmap involved for this resync.
> 
>   Each personality would have to implement much of this independently,
>   effectively providing a mini raid1 implementation.  It would be very
>   minimal without e.g. read balancing or write-behind etc.
> 
>   There would be no point implementing this in raid1.  Just
>   raid456 and raid10.
>   It could conceivably make sense for raid0 and linear, but that is
>   very unlikely to be implemented.
> 
> - split-mirror
>   This is really a function of mdadm rather than md.  It is already
>   quite possible to break a mirror into two separate single-device
>   arrays.  However it is a sufficiently common operation that it is
>   probably making it very easy to do with mdadm.
>   I'm thinking something like
>       mdadm --create /dev/md/new --split /dev/md/old
> 
>   will create a new raid1 by taking one device off /dev/md/old (which
>   must be a raid1) and making an array with exactly the right metadata
>   and size.
> 
> - raid5->raid6 conversion.
>    This is also a fairly commonly asked for feature.
>    The first step would be to define a raid6 layout where the Q block
>    was not rotated around the devices but was always on the last
>    device.  Then we could change a raid5 to a singly-degraded raid6
>    without moving any data.
> 
>    The next step would be to implement in-place restriping. 
>    This involves 
>       - freezing a section of the array (all IO blocks)
>       - copying the data out to a safe backup
>       - copying it back in with the new layout
>       - updating the metadata to indicate that the restripe has
>         progressed.
>       - repeat.
> 
>    This would probably be quite slow but it would achieve the desired
>    result. 
> 
>    Once we have in-place restriping we could change chunksize as
>    well.
> 
> - raid5 reduce number of devices.
>    We can currently restripe a raid5 (or 6) over a larger number of
>    devices but not over a smaller number of devices.  That means you
>    cannot undo an increase that you didn't want.
> 
>    It might be nice to allow this to happen at the same time as
>    increasing --size (if the devices are big enough) to allow the
>    array to be restriped without changing the available space.
> 
> - cluster raid1
>    Allow a raid1 to be assembled on multiple hosts that share some
>    drives, so a cluster filesystem (e.g. ocfs2) can be run over it.
>    It requires co-ordination to handle failure events and
>    resync/recovery.  Most of this would probably be done in userspace.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2009-01-11 18:14 ` Piergiorgio Sartor
@ 2009-01-19  1:40   ` Neil Brown
  2009-01-19 18:19     ` Piergiorgio Sartor
  0 siblings, 1 reply; 23+ messages in thread
From: Neil Brown @ 2009-01-19  1:40 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

On Sunday January 11, piergiorgio.sartor@nexgo.de wrote:
> Hi,
> 
> something else you can add to your "todo list",
> or not, as you like.
> 
> RAID-5/6 with heterogeneous devices.
> The idea would be to build a RAID-5/6 with devices
> of different size, but using, whenever possible,
> the complete available space.
> Example: let's say we have 6 HDs, 3x 500GB and 3x 1TB.
> The idea is to have a single md device, in RAID-5,
> where the first part of the device (500GB range) uses
> the whole 6 HDs, while the second part uses only 3.
> Specifically, the first part will have 500GB x5
> and the second 500GB x2.
> This is already doable by partitioning the HDs properly
> before and creating 2 md devices.
> Nevertheless, given the "grow" possibility, in both
> directions (disk number, disk size), the RAID only
> solution would be interesting.

I've thought about this occasionally but don't think much of the idea.
It seems nice until you think about what happens when devices fail
and you need to integrate hot spares.
Clearly any spare will need to be as big as the largest device.
When that get integrated in place of a small device, you will be
wasting space on it, and then someone will want to be able to
grow the array to use that extra space, which would be rather
messy.

I think it is best to assume that all devices are the same size.
Trying to support anything else in a useful way would just add
complexity with little value.

> 
> RAID-5/6 pre-caching.  Maybe this one is already there, in any case
> the idea would be to try to cache a complete stripe set, in case of
> RAID-5/6, in order to avoid sub sequent reads in case of write.
> This could be a user switchable parameter, for example when the user
> knows there will be a lot of read-modify-write to the array.

We already do some degree of caching. This primarily exists so that
we can be working on multiple stripes at once.  If there are multiple
write accesses to a stripe while it stays in cache you could save some
reads, but I suspect that most times stripes fall out of cache before
they are used again.  That cache can be made bigger, but I'm not sure
it would help a lot..

Thanks for the ideas.

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2009-01-19  1:40   ` Neil Brown
@ 2009-01-19 18:19     ` Piergiorgio Sartor
  2009-01-19 18:26       ` Peter Rabbitson
  2009-01-19 21:08       ` Keld Jørn Simonsen
  0 siblings, 2 replies; 23+ messages in thread
From: Piergiorgio Sartor @ 2009-01-19 18:19 UTC (permalink / raw)
  To: linux-raid

Hi,

> > RAID-5/6 with heterogeneous devices.
[...]
> I've thought about this occasionally but don't think much of the idea.
> It seems nice until you think about what happens when devices fail
> and you need to integrate hot spares.
> Clearly any spare will need to be as big as the largest device.
> When that get integrated in place of a small device, you will be
> wasting space on it, and then someone will want to be able to
> grow the array to use that extra space, which would be rather
> messy.
> 
> I think it is best to assume that all devices are the same size.
> Trying to support anything else in a useful way would just add
> complexity with little value.

I see your point and I also agree that complexity
might be too much.

Nevertheless I disagree about the "wasting space",
since I can see a scenario where there is even
more wasting.

Let's assume we have a RAID-5 with 7 disks.
Each time an HD fails, we will replace it with a
larger one, since this will be cheaper.
Once we replaced 3 HDs, with could use the unused
space in RAID-5 again.
Unfortunately, with the current features, we will
have to wait to fail all the 7 disks.
So, with 6 HDs replaced with larger ones, we will
have a lot of wasted space.
Of course, unless we did a clever partitioning at
very the beginning.

But again, I do agree the complexity might not
pay off the advantages, maybe better the "clever
partitioning".

Thanks,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2009-01-19 18:19     ` Piergiorgio Sartor
@ 2009-01-19 18:26       ` Peter Rabbitson
  2009-01-19 18:41         ` Piergiorgio Sartor
  2009-01-19 21:08       ` Keld Jørn Simonsen
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Rabbitson @ 2009-01-19 18:26 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

Piergiorgio Sartor wrote:
> Hi,
> 
>>> RAID-5/6 with heterogeneous devices.
> [...]
>> I've thought about this occasionally but don't think much of the idea.
>> It seems nice until you think about what happens when devices fail
>> and you need to integrate hot spares.
>> Clearly any spare will need to be as big as the largest device.
>> When that get integrated in place of a small device, you will be
>> wasting space on it, and then someone will want to be able to
>> grow the array to use that extra space, which would be rather
>> messy.
>>
>> I think it is best to assume that all devices are the same size.
>> Trying to support anything else in a useful way would just add
>> complexity with little value.
> 
> I see your point and I also agree that complexity
> might be too much.
> 
> Nevertheless I disagree about the "wasting space",
> since I can see a scenario where there is even
> more wasting.
> 
> Let's assume we have a RAID-5 with 7 disks.
> Each time an HD fails, we will replace it with a
> larger one, since this will be cheaper.
> Once we replaced 3 HDs, with could use the unused
> space in RAID-5 again.
> Unfortunately, with the current features, we will
> have to wait to fail all the 7 disks.
> So, with 6 HDs replaced with larger ones, we will
> have a lot of wasted space.
> Of course, unless we did a clever partitioning at
> very the beginning.
> 

Or alternatively you use LVM from the start. Then you only need to
"cleverly partition" the new drives, use the first partition in the old
RAID, use the second partition for a new smaller raid, join the new raid
to the VG of the old raid and voilla - no wasted space.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2009-01-19 18:26       ` Peter Rabbitson
@ 2009-01-19 18:41         ` Piergiorgio Sartor
  0 siblings, 0 replies; 23+ messages in thread
From: Piergiorgio Sartor @ 2009-01-19 18:41 UTC (permalink / raw)
  To: linux-raid

Hi,

> Or alternatively you use LVM from the start. Then you only need to
> "cleverly partition" the new drives, use the first partition in the old
> RAID, use the second partition for a new smaller raid, join the new raid
> to the VG of the old raid and voilla - no wasted space.

I think that's exactly what I'm doing right now
with my test array.

Basically I've 7 disks, all, but two, different
in size.
I created partitions, of equal size between disks,
than I configured 4 RAID-6 devices (with 7, 6, 5, 4
partitions respectively) and put everything together
with LVM.

This works fine, it is possible to add HDs, or change
the old ones with bigger new ones.

One issue is that, with time, the larger HDs will
have many partitions, this led to use GPT table
(max 128 partitions) instead of DOS.
Metadata needs to be 1.x, of course.

The second point is the "annoyance" of creating
those partitions on the new HDs and adding them
to the proper md devices.

Of course, nothing particularly difficult, nevertheless
it would be nice to have everything integrated into
the md driver and transparent to the user.

Just an idea of improvement, nothing more.

Side note: if anybody has suggestions on how to script
the complete procedure of adding/replacing an HD in
such configuration, I'll appreciate it a lot!

Thanks,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2009-01-19 18:19     ` Piergiorgio Sartor
  2009-01-19 18:26       ` Peter Rabbitson
@ 2009-01-19 21:08       ` Keld Jørn Simonsen
  1 sibling, 0 replies; 23+ messages in thread
From: Keld Jørn Simonsen @ 2009-01-19 21:08 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

On Mon, Jan 19, 2009 at 07:19:19PM +0100, Piergiorgio Sartor wrote:
> Hi,
> 
> Let's assume we have a RAID-5 with 7 disks.
> Each time an HD fails, we will replace it with a
> larger one, since this will be cheaper.
> Once we replaced 3 HDs, with could use the unused
> space in RAID-5 again.
> Unfortunately, with the current features, we will
> have to wait to fail all the 7 disks.

Nah, when you have 2 new disks, you can put the excessive space into a
new raid, which should be able to grow, with further new disks to arrive. 
But in principle you are right, it would be nice to use dirfferent
age/size disks in a clever way.

Best regards
Keld

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2008-12-19  4:10 Roadmap for md/raid ??? Neil Brown
                   ` (2 preceding siblings ...)
  2009-01-11 18:14 ` Piergiorgio Sartor
@ 2009-01-14 20:43 ` Bill Davidsen
  2009-01-19  2:05   ` Neil Brown
  3 siblings, 1 reply; 23+ messages in thread
From: Bill Davidsen @ 2009-01-14 20:43 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Neil Brown wrote:
> Not really a roadmap, more a few tourist attractions that you might
> see on the way if you stick around (and if I stick around)...
>
>   
Thanks for sharing, although that last comment is a little worrisome.

> Comments welcome.
>
>   
Here's one, is this in some sense a prioritized list? If so I might 
comment on the order, and I'm sure others would feel even more strongly 
than I. ;-)

> NeilBrown
>
>
> - Bad block list
>   The idea here is to maintain and store on each device a list of
>   blocks that are known to be 'bad'.  This effectively allows us to
>   fail a single block rather than a whole device when we get a media
>   write error.  Of course if updating the bad-block-list gives an
>   error we then have to fail the device.
>
>   
In terms of improving reliability this sounds good, and of course it's a 
required step toward doing data relocation in md instead of depending on 
the drive to do relocation. That's a comment, not a request or even 
suggestion, but in some cases it could open possibilities.

>   We would also record a bad block if we get a read error on a degraded
>   array.  This would e.g. allow recovery for a degraded raid1 where the
>   sole remaining device has a bad block.
>
>   An array could have multiple errors on different devices and just
>   those stripes would be considered to be "degraded".  As long a no
>   single stripe had too many bad blocks, the data would still be safe.
>   Naturally as soon as you get one bad block, the array becomes
>   susceptible to data loss on a single device failure, so it wouldn't
>   be advisable to run with non-empty badblock lists for an extended
>   length of time,  However it might provide breathing space until
>   drive replacement can be achieved.
>
> - hot-device-replace
>   This is probably the most asked for feature of late.  It would allow
>   a device to be 'recovered' while the original was still in service. 
>   So instead of failing out a device and adding a spare, you can add
>   the spare, build the data onto it, then fail out the device.
>
>   This meshes well with the bad block list.  When we find a bad block,
>   we start a hot-replace onto a spare (if one exists).  If sleeping
>   bad blocks are discovered during the hot-replace process, we don't
>   lose the data unless we find two bad blocks in the same stripe.
>   And then we just lose data in that stripe.
>
>   
This certainly is a solution to some growth issues, currently this can 
pretty well be done manually from a rescue boot, but not by the average 
user.

>   Recording in the metadata that a hot-replace was happening might be
>   a little tricky, so it could be that if you reboot in the middle,
>   you would have to restart from the beginning.  Similarly there would
>   be no 'intent' bitmap involved for this resync.
>
>   Each personality would have to implement much of this independently,
>   effectively providing a mini raid1 implementation.  It would be very
>   minimal without e.g. read balancing or write-behind etc.
>
>   There would be no point implementing this in raid1.  Just
>   raid456 and raid10.
>   It could conceivably make sense for raid0 and linear, but that is
>   very unlikely to be implemented.
>
> - split-mirror
>   This is really a function of mdadm rather than md.  It is already
>   quite possible to break a mirror into two separate single-device
>   arrays.  However it is a sufficiently common operation that it is
>   probably making it very easy to do with mdadm.
>   I'm thinking something like
>       mdadm --create /dev/md/new --split /dev/md/old
>
>   will create a new raid1 by taking one device off /dev/md/old (which
>   must be a raid1) and making an array with exactly the right metadata
>   and size.
>
> - raid5->raid6 conversion.
>    This is also a fairly commonly asked for feature.
>    The first step would be to define a raid6 layout where the Q block
>    was not rotated around the devices but was always on the last
>    device.  Then we could change a raid5 to a singly-degraded raid6
>    without moving any data.
>
>    The next step would be to implement in-place restriping. 
>    This involves 
>       - freezing a section of the array (all IO blocks)
>       - copying the data out to a safe backup
>       - copying it back in with the new layout
>       - updating the metadata to indicate that the restripe has
>         progressed.
>       - repeat.
>
>   
It would seem very safe, something like
 1 - call the chunk on the new drive the available space
 2 - determine what needs to be in the available space
 3 - if data, copy the data chunk to the available chunk, mark the old 
location avail, repeat step 2
 4 - Q goes in the available chunk, calculate it and the stripe is done
I don't see the move to a safe backup if you move one chunk at a time 
until you are ready for Q, unless there are moves I'm missing. You 
always have a free space to move one chunk, when all data is in the 
right place and the P value is in place (does it move?), then Q is 
calculated and saved. In other words, no out of stripe storage needed.

>    This would probably be quite slow but it would achieve the desired
>    result. 
>
>   
It would depend on how many moves were needed, I guess, but slow seems 
likely.
>    Once we have in-place restriping we could change chunksize as
>    well.
>
> - raid5 reduce number of devices.
>    We can currently restripe a raid5 (or 6) over a larger number of
>    devices but not over a smaller number of devices.  That means you
>    cannot undo an increase that you didn't want.
>
>   
The more common case might be that the prices are in free fall, and 
drives a few months old are obsolete and should be replaces with fewer 
and larger drives to save power and boost reliability.

>    It might be nice to allow this to happen at the same time as
>    increasing --size (if the devices are big enough) to allow the
>    array to be restriped without changing the available space.
>
> - cluster raid1
>    Allow a raid1 to be assembled on multiple hosts that share some
>    drives, so a cluster filesystem (e.g. ocfs2) can be run over it.
>    It requires co-ordination to handle failure events and
>    resync/recovery.  Most of this would probably be done in userspace.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>   


-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2009-01-14 20:43 ` Bill Davidsen
@ 2009-01-19  2:05   ` Neil Brown
       [not found]     ` <49740C81.2030502@tmr.com>
  0 siblings, 1 reply; 23+ messages in thread
From: Neil Brown @ 2009-01-19  2:05 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-raid

On Wednesday January 14, davidsen@tmr.com wrote:
> Neil Brown wrote:
> > Not really a roadmap, more a few tourist attractions that you might
> > see on the way if you stick around (and if I stick around)...
> >
> >   
> Thanks for sharing, although that last comment is a little worrisome.

No one in irreplaceable... but I'm not planning on going anywhere
just now :-)

> 
> > Comments welcome.
> >
> >   
> Here's one, is this in some sense a prioritized list? If so I might 
> comment on the order, and I'm sure others would feel even more strongly 
> than I. ;-)

No, the list isn't prioritised.  I'll do things as the mood takes me.

> > - raid5->raid6 conversion.
> >    This is also a fairly commonly asked for feature.
> >    The first step would be to define a raid6 layout where the Q block
> >    was not rotated around the devices but was always on the last
> >    device.  Then we could change a raid5 to a singly-degraded raid6
> >    without moving any data.
> >
> >    The next step would be to implement in-place restriping. 
> >    This involves 
> >       - freezing a section of the array (all IO blocks)
> >       - copying the data out to a safe backup
> >       - copying it back in with the new layout
> >       - updating the metadata to indicate that the restripe has
> >         progressed.
> >       - repeat.
> >
> >   
> It would seem very safe, something like
>  1 - call the chunk on the new drive the available space
>  2 - determine what needs to be in the available space
>  3 - if data, copy the data chunk to the available chunk, mark the old 
> location avail, repeat step 2
>  4 - Q goes in the available chunk, calculate it and the stripe is done
> I don't see the move to a safe backup if you move one chunk at a time 
> until you are ready for Q, unless there are moves I'm missing. You 
> always have a free space to move one chunk, when all data is in the 
> right place and the P value is in place (does it move?), then Q is 
> calculated and saved. In other words, no out of stripe storage needed.

The difficulty is "How do you handle a crash?"
If you crash and restart in the middle of a reshape, you need to know
where all the data is.
To follow your scheme, you would need to update the metadata for
every block that is moved.

The value of having a large avail space is that you only update the
metadata only every time the space fills up.



Thanks for your thoughts.

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <49740C81.2030502@tmr.com>]

* Re: Roadmap for md/raid ???
       [not found]     ` <49740C81.2030502@tmr.com>
@ 2009-01-19 22:32       ` Neil Brown
  2009-01-21 17:04         ` Bill Davidsen
  0 siblings, 1 reply; 23+ messages in thread
From: Neil Brown @ 2009-01-19 22:32 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-raid

On Monday January 19, davidsen@tmr.com wrote:
> Neil Brown wrote:
> >
> > The difficulty is "How do you handle a crash?"
> > If you crash and restart in the middle of a reshape, you need to know
> > where all the data is.
> > To follow your scheme, you would need to update the metadata for
> > every block that is moved.
> >
> >   
> That would seem to be an issue doing it in a large task as well, if I 
> understand what you want to do. After the data have been written to the 
> spare space I believe you are going to write it back. If you update the 
> metadata before the write and something happens you have old metadata 
> but data in another format. Conversely, if you update the metadata 
> first, and something happens, you have the metadata describing a change 
> which didn't happen. So You would have to have a metadata reflecting the 
> data not being in the array, but actually somewhere else, or you would 
> have to keep the spare space inside the array, where room might not be 
> available.

I "freeze" a section of the array so that any writes block,
I copy that data to somewhere safe, and mark that copy as being a
valid copy of the moved data.
Then I copy it back in the new layout and update the array metadata to
show that the change-over point between old-layout and new-layout has
changed.  Then I invalidate the copy.
Then repeat.

When restarting an array, mdadm checks the copy (which could be in a
file on a different device) and if it is valid, it is copied on to
the array and the metadata is updated.

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2009-01-19 22:32       ` Neil Brown
@ 2009-01-21 17:04         ` Bill Davidsen
  0 siblings, 0 replies; 23+ messages in thread
From: Bill Davidsen @ 2009-01-21 17:04 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Neil Brown wrote:
> On Monday January 19, davidsen@tmr.com wrote:
>   
>> Neil Brown wrote:
>>     
>>> The difficulty is "How do you handle a crash?"
>>> If you crash and restart in the middle of a reshape, you need to know
>>> where all the data is.
>>> To follow your scheme, you would need to update the metadata for
>>> every block that is moved.
>>>
>>>   
>>>       
>> That would seem to be an issue doing it in a large task as well, if I 
>> understand what you want to do. After the data have been written to the 
>> spare space I believe you are going to write it back. If you update the 
>> metadata before the write and something happens you have old metadata 
>> but data in another format. Conversely, if you update the metadata 
>> first, and something happens, you have the metadata describing a change 
>> which didn't happen. So You would have to have a metadata reflecting the 
>> data not being in the array, but actually somewhere else, or you would 
>> have to keep the spare space inside the array, where room might not be 
>> available.
>>     
>
> I "freeze" a section of the array so that any writes block,
> I copy that data to somewhere safe, and mark that copy as being a
> valid copy of the moved data.
> Then I copy it back in the new layout and update the array metadata to
> show that the change-over point between old-layout and new-layout has
> changed.  Then I invalidate the copy.
> Then repeat.
>
> When restarting an array, mdadm checks the copy (which could be in a
> file on a different device) and if it is valid, it is copied on to
> the array and the metadata is updated.
>   

Sounds good...

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Aw: Roadmap for md/raid ???
@ 2008-12-19  9:01 piergiorgio.sartor
  2008-12-19 17:01 ` Dan Williams
  0 siblings, 1 reply; 23+ messages in thread
From: piergiorgio.sartor @ 2008-12-19  9:01 UTC (permalink / raw)
  To: neilb, linux-raid

Hi again,

you forgot the "clever RAID-6 check" :-)

Since we are at it, I would like to propose a "concept",
the idea itself is not completely clear in my mind, but
maybe you could find your way in it (or not).

Since it seems bitmaps are so beloved, I would like to
propose the possibility to provide the RAID with an
external bitmap.
For what? You'll ask...

To make the array a bit more filesystem aware.

Let's assume we have an md device which is 20%
full. Of course the md itself does not know this,
only the FS knows it.

In case of HD failure and resync, the md will start
the operations from the beginning to the end.
On the other hand,  it would be wise to sync first the
used blocks (20% in the example) and later the others,
if necessary at all.
Because the array will get faster to a "safe state".

In order to do this, some filesystem tool could
provide a "priority map", telling the array which
blocks should be synchronized first and which last.
Probably there is more, like set the md read-only,
not sure about that.

So, from the md point of view, there will be a file,
with some bitmap information, and it will start
a resync accordingly to that bitmap.

How does it sound? Does it make sense at all?

Hope this could give some ideas to improve the
already wonderful md subsystems!

For the road map you proposed, I've no major
comments, just my personal priorities would be
raid5->raid6 conversion and hot-device-replace.

Thanks,

bye,

-- 

pg

----- Original Nachricht ----
Von:     Neil Brown <neilb@suse.de>
An:      linux-raid@vger.kernel.org
Datum:   19.12.2008 05:10
Betreff: Roadmap for md/raid ???

> 
> 
> Not really a roadmap, more a few tourist attractions that you might
> see on the way if you stick around (and if I stick around)...
> 
> Comments welcome.
> 
> NeilBrown
> 
> 
> - Bad block list
>   The idea here is to maintain and store on each device a list of
>   blocks that are known to be 'bad'.  This effectively allows us to
>   fail a single block rather than a whole device when we get a media
>   write error.  Of course if updating the bad-block-list gives an
>   error we then have to fail the device.
> 
>   We would also record a bad block if we get a read error on a degraded
>   array.  This would e.g. allow recovery for a degraded raid1 where the
>   sole remaining device has a bad block.
> 
>   An array could have multiple errors on different devices and just
>   those stripes would be considered to be "degraded".  As long a no
>   single stripe had too many bad blocks, the data would still be safe.
>   Naturally as soon as you get one bad block, the array becomes
>   susceptible to data loss on a single device failure, so it wouldn't
>   be advisable to run with non-empty badblock lists for an extended
>   length of time,  However it might provide breathing space until
>   drive replacement can be achieved.
> 
> - hot-device-replace
>   This is probably the most asked for feature of late.  It would allow
>   a device to be 'recovered' while the original was still in service. 
>   So instead of failing out a device and adding a spare, you can add
>   the spare, build the data onto it, then fail out the device.
> 
>   This meshes well with the bad block list.  When we find a bad block,
>   we start a hot-replace onto a spare (if one exists).  If sleeping
>   bad blocks are discovered during the hot-replace process, we don't
>   lose the data unless we find two bad blocks in the same stripe.
>   And then we just lose data in that stripe.
> 
>   Recording in the metadata that a hot-replace was happening might be
>   a little tricky, so it could be that if you reboot in the middle,
>   you would have to restart from the beginning.  Similarly there would
>   be no 'intent' bitmap involved for this resync.
> 
>   Each personality would have to implement much of this independently,
>   effectively providing a mini raid1 implementation.  It would be very
>   minimal without e.g. read balancing or write-behind etc.
> 
>   There would be no point implementing this in raid1.  Just
>   raid456 and raid10.
>   It could conceivably make sense for raid0 and linear, but that is
>   very unlikely to be implemented.
> 
> - split-mirror
>   This is really a function of mdadm rather than md.  It is already
>   quite possible to break a mirror into two separate single-device
>   arrays.  However it is a sufficiently common operation that it is
>   probably making it very easy to do with mdadm.
>   I'm thinking something like
>       mdadm --create /dev/md/new --split /dev/md/old
> 
>   will create a new raid1 by taking one device off /dev/md/old (which
>   must be a raid1) and making an array with exactly the right metadata
>   and size.
> 
> - raid5->raid6 conversion.
>    This is also a fairly commonly asked for feature.
>    The first step would be to define a raid6 layout where the Q block
>    was not rotated around the devices but was always on the last
>    device.  Then we could change a raid5 to a singly-degraded raid6
>    without moving any data.
> 
>    The next step would be to implement in-place restriping. 
>    This involves 
>       - freezing a section of the array (all IO blocks)
>       - copying the data out to a safe backup
>       - copying it back in with the new layout
>       - updating the metadata to indicate that the restripe has
>         progressed.
>       - repeat.
> 
>    This would probably be quite slow but it would achieve the desired
>    result. 
> 
>    Once we have in-place restriping we could change chunksize as
>    well.
> 
> - raid5 reduce number of devices.
>    We can currently restripe a raid5 (or 6) over a larger number of
>    devices but not over a smaller number of devices.  That means you
>    cannot undo an increase that you didn't want.
> 
>    It might be nice to allow this to happen at the same time as
>    increasing --size (if the devices are big enough) to allow the
>    array to be restriped without changing the available space.
> 
> - cluster raid1
>    Allow a raid1 to be assembled on multiple hosts that share some
>    drives, so a cluster filesystem (e.g. ocfs2) can be run over it.
>    It requires co-ordination to handle failure events and
>    resync/recovery.  Most of this would probably be done in userspace.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Jetzt komfortabel bei Arcor-Digital TV einsteigen: Mehr Happy Ends, mehr Herzschmerz, mehr Fernsehen! Erleben Sie 50 digitale TV Programme und optional 60 Pay TV Sender, einen elektronischen Programmführer mit Movie Star Bewertungen von TV Movie. Außerdem, aktuelle Filmhits und spannende Dokus in der Arcor-Videothek. Infos unter www.arcor.de/tv
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Roadmap for md/raid ???
  2008-12-19  9:01 Aw: " piergiorgio.sartor
@ 2008-12-19 17:01 ` Dan Williams
  0 siblings, 0 replies; 23+ messages in thread
From: Dan Williams @ 2008-12-19 17:01 UTC (permalink / raw)
  To: piergiorgio.sartor; +Cc: neilb, linux-raid

On Fri, Dec 19, 2008 at 2:01 AM,  <piergiorgio.sartor@nexgo.de> wrote:
> In case of HD failure and resync, the md will start
> the operations from the beginning to the end.
> On the other hand,  it would be wise to sync first the
> used blocks (20% in the example) and later the others,
> if necessary at all.
> Because the array will get faster to a "safe state".
>

One idea/question is whether we can take advantage of the trim/discard
commands added for SSD support to allow filesystems to tell MD which
parts of the array it cares about.  So for a lightly populated array
we can try to treat free space as a second class citizen.

--
Dan

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2009-01-21 17:04 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-19  4:10 Roadmap for md/raid ??? Neil Brown
2008-12-19 15:44 ` Chris Worley
2008-12-19 15:51   ` Justin Piszcz
2008-12-19 16:13     ` Bernd Schubert
2008-12-30 18:12 ` Janek Kozicki
2008-12-30 18:15   ` Janek Kozicki
2009-01-19  0:54   ` Neil Brown
2009-01-19 12:25     ` Keld Jørn Simonsen
2009-01-19 19:03       ` thomas62186218
2009-01-19 20:00         ` Jon Nelson
2009-01-19 20:18           ` Greg Freemyer
2009-01-19 20:30             ` Jon Nelson
2009-01-11 18:14 ` Piergiorgio Sartor
2009-01-19  1:40   ` Neil Brown
2009-01-19 18:19     ` Piergiorgio Sartor
2009-01-19 18:26       ` Peter Rabbitson
2009-01-19 18:41         ` Piergiorgio Sartor
2009-01-19 21:08       ` Keld Jørn Simonsen
2009-01-14 20:43 ` Bill Davidsen
2009-01-19  2:05   ` Neil Brown
     [not found]     ` <49740C81.2030502@tmr.com>
2009-01-19 22:32       ` Neil Brown
2009-01-21 17:04         ` Bill Davidsen
  -- strict thread matches above, loose matches on Subject: below --
2008-12-19  9:01 Aw: " piergiorgio.sartor
2008-12-19 17:01 ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).