Auto Rebuild on hot-plug

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Auto Rebuild on hot-plug
@ 2010-03-25  0:35 Neil Brown
  2010-03-25  2:47 ` Michael Evans
                   ` (5 more replies)
  0 siblings, 6 replies; 33+ messages in thread
From: Neil Brown @ 2010-03-25  0:35 UTC (permalink / raw)
  To: Doug Ledford, Dan Williams, Labun, Marcin,
	"Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.c>
  Cc: linux-raid, Bill Davidsen

Greetings.
 I find myself in the middle of two separate off-list conversations on the
 same topic and it has reached the point where I think the conversations
 really need to be unite and brought on-list.

 So here is my current understanding and thoughts.

 The topic is about making rebuild after a failure easier.  It strikes me as
 particularly relevant after the link  Bill Davidsen recently forwards to the
 list:

       http://blogs.techrepublic.com.com/opensource/?p=1368

 The most significant thing I got from this was a complain in the comments
 that managing md raid was too complex and hence error-prone.

 I see the issue as breaking down in to two parts.
  1/ When a device is hot plugged into the system, is md allowed to use it as
     a spare for recovery?
  2/ If md has a spare device, what set of arrays can it be used in if needed.

 A typical hot plug event will need to address both of these questions in
 turn before recovery actually starts.

 Part 1.

  A newly hotplugged device may have metadata for RAID (0.90, 1.x, IMSM, DDF,
  other vendor metadata) or LVM or a filesystem.  It might have a partition
  table which could be subordinate to or super-ordinate to other metadata.
  (i.e. RAID in partitions, or partitions in RAID).  The metadata may or may
  not be stale.  It may or may not match - either strongly or weakly -
  metadata on devices in currently active arrays.

  A newly hotplugged device also has a "path" which we can see
  in /dev/disk/by-path.  This is somehow indicative of a physical location.
  This path may be the same as the path of a device which was recently
  removed.  It might be one of a set of paths which make up a "RAID chassis".
  It might be one of a set of paths one which we happen to find other RAID
  arrays.

  Some how from all of that information we need to decide if md can use the
  device without asking, or possibly with a simple yes/no question, and we
  need to decide what to actually do with the device.

  Options for what to do with the device include:
    - write an MBR and partition table, then do something as below with
      each partition
    - include the device (or partition) in an array that it was previously
      part of, but from which it was removed
    - include the device or partition as a spare in a native-metadata array.
    - add the device as a spare to a vendor-metadata array

 Part 2.

   If we have a spare device and a degraded array we need to know if it is OK
   to add the device as a hot-spare to that array.
   Currently this is handled (for native metadata) by 'mdadm --monitor' and
   the  spare-groups tag in mdadm.conf.
   For vendor metadata, if the spare is already in the container then mdmon
   should handle the spare assignment, but if the spare is in a different
   container, 'mdadm --monitor' should move it to the right container, but
   doesn't yet.

   The "spare-group" functionality works but isn't necessarily the easiest
   way to express the configuration desires.  People are likely to want to
   specify how far a global spare can migrate using physical address: path.

   So for example you might specify a group of paths with wildcards with the
   implication that all arrays which contain disks from this group of paths
   are automatically in the same spare-group.

 Configuration and State

   I think it is clear that configuration for this should go in mdadm.conf.
   This would at least cover identifying groups of device by path and ways
   what is allowed to be done to those devices.
   It is possible that some configuration could be determined by inspecting
   the hardware directly.  e.g. the IMSM code currently looks for an Option
   ROM show confirms that the right Intel controller is present and so the
   system can boot from the IMSM device.  It is possible that other
   information could be gained this way so that the mdadm.conf configuration
   would not need to identify paths but alternately identify some
   platform-specific concept.

   The configuration would have to say what is permitted for hot-plugged
   devices:  nothing, re-add, claim-bare-only, claim-any-unrecognised
   The configuration would also describe mobility of spares across
   different device sets.

   This would add a new line type to mdadm.conf. e.g.
     DOMAIN or CHASSIS or DEDICATED or something else.
   The line would identify
         some devices by path or platform
         a metadata type that is expected here
         what hotplug is allows to do
         a spare-group that applies to all array which use devices from this
         group/domain/chassis/thing
         source for MBR?  template for partitioning?  or would this always
             be copied from some other device in the set if hotplug= allowed
             partitioning?

   State required would include
       - where devices have been recently removed from, and what they were in
         use for
       - which arrays are currently using which device sets, though that can
         be determined dynamically from inspecting active arrays.
       - ?? partition tables off any devices that are in use so if they are
         removed and an new device added the partition table can be
         replaced.

 Usability

  The idea of being able to pull out a device and plug in a replacement and
  have it all "just work" is a good one.  However I don't want to be too
  dependent on state that might have been saved from the old device.
  I would like to also be able to point to a new device which didn't exist
  before and say "use this".   mdadm would use the path information to decide
  which contain or set of drives was most appropriate, extract
  MBR/partitioning from one of those, impose it on the new device and include
  the device or partitions in the appropriate array.

  For RAID over partitions, this assumes a fairly regular configuration: all
  devices partitioned the same way, and each array build out of a set of 
  aligned partitions (e.g. /dev/sd[bcde]2 ).
  One of the strength of md is that you don't have to use such a restricted
  configuration, but I think it would be very hard to reliably "to the right
  thing" with an irregular set up (e.g. a raid1 over a 1T device and 2 500GB
  devices in a raid0).

  So I think we should firmly limit the range of configurations for which
  auto-magic stuff is done.  Vendor metadata is already fairly strongly
  defined.  We just add a device to the vendor container and let it worry
  about the detail.  For native metadata we need to draw a firm line.
  I think that line should be "all devices partitioned the same" but I
  am open to discussion.

  If we have "mdadm --use-this-device-however" without needing to know
  anything about pre-existing state, then a hot-remove would just need to
  record that the device was used by arrays X and Y. Then on hot plug we could
   - do nothing
   - do something if metadata on device allows
   - do use-this-device-however if there was a recent hot-remove of the device
   - always do use-this-device-however
  depending on configuration.

 Implementation

  I think we all agree that migrating spares between containers is best done
  by "mdadm --monitor".  It needs to be enhanced to intuit spare-group names
  from "DOMAIN" declarations, and to move spares between vendor containers.

  For hot-plug, hot-unplug I prefer to use udev triggers.  plug runs
    mdadm --incremental /dev/whatever
  which would be extended to do other clever things if allowed
  Unplug would run
     mdadm --force-remove /dev/whatever
  which finds any arrays containing the device (or partitions?) and
  fail/removes them and records the fact with a timestamp.

  However if someone has a convincing reason to build this functionality
  into  "mdadm --monitor" instead using libudev I am willing to listen.

  Probably the most important first step is to determine a configuration
  syntax and be sure it is broad enough to cover all needs.

  I'm thinking:
    DOMAIN path=glob-pattern metadata=type  hotplug=mode  spare-group=name

  I explicitly have "path=" in case we find there is a need to identify
  devices some other way - maybe by control vendor:device or some other
  content-based approach
  The spare-group name is inherited by any array with devices in this
  domain as long as that doesn't result it in having two different
  spare-group names.
  I'm not sure if "metadata=" is really needed.  If all the arrays that use
  these devices have the same metadata, it would be redundant to list it here.
  If they use different metadata ... then what?
  I guess two different DOMAIN lines could identify the same devices and 
  list different metadata types and given them different spare-group
  names.  However you cannot support hotplug of bare devices into both ...

  If it possible for multiple DOMAIN lines to identify the same device,
  e.g. by having more or less specific patterns. In this case the spare-group
  names are ignored if they conflict, and the hotplug mode used is the most
  permissive.

  hotplug modes are:
    none  - ignore any hotplugged device
    incr  - normal incremental assembly (the default).  If the device has
         metadata that matches an array, try to add it to the array
    replace - If above fails and a device was recently removed from this
         same path, add this device to the same array(s) that the old devices
         was part of
    include - If the above fails and the device has not recognisable metadata
         add it to any array/container that uses devices in this domain,
         partitioning first if necessary.
    force - as above but ignore any pre-existing metadata

  I'm not sure that all those are needed, or are the best names.  Names like
    ignore, reattach, rebuild, rebuild_spare
  have also been suggested.

  It might be useful to have a 'partition=type' flag to specify MBR or GPT ??

There, I think that just about covers everything relevant from the various
conversations.
Please feel free to disagree or suggest new use cases or explain why this
would not work or would not be ideal.
There was a suggestion that more state needed to be stored to support
auto-rebuild (detail of each device so they can be recovered exactly after a
device is pulled and a replacement added).  I'm not convinced of this but am
happy to hear more explanations.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-25  0:35 Auto Rebuild on hot-plug Neil Brown
@ 2010-03-25  2:47 ` Michael Evans
  2010-03-31  1:18   ` Neil Brown
  2010-03-25  8:01 ` Luca Berra
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 33+ messages in thread
From: Michael Evans @ 2010-03-25  2:47 UTC (permalink / raw)
  To: Neil Brown
  Cc: Doug Ledford, Dan Williams, Labun, Marcin,
	Hawrylewicz Czarnowski, Przemyslaw, Ciechanowski, Ed, linux-raid,
	Bill Davidsen

On Wed, Mar 24, 2010 at 5:35 PM, Neil Brown <neilb@suse.de> wrote:
>
> Greetings.
>  I find myself in the middle of two separate off-list conversations on the
>  same topic and it has reached the point where I think the conversations
>  really need to be unite and brought on-list.
>
>  So here is my current understanding and thoughts.
>
>  The topic is about making rebuild after a failure easier.  It strikes me as
>  particularly relevant after the link  Bill Davidsen recently forwards to the
>  list:
>
>       http://blogs.techrepublic.com.com/opensource/?p=1368
>
>  The most significant thing I got from this was a complain in the comments
>  that managing md raid was too complex and hence error-prone.
>
>  I see the issue as breaking down in to two parts.
>  1/ When a device is hot plugged into the system, is md allowed to use it as
>     a spare for recovery?
>  2/ If md has a spare device, what set of arrays can it be used in if needed.
>
>  A typical hot plug event will need to address both of these questions in
>  turn before recovery actually starts.
>
>  Part 1.
>
>  A newly hotplugged device may have metadata for RAID (0.90, 1.x, IMSM, DDF,
>  other vendor metadata) or LVM or a filesystem.  It might have a partition
>  table which could be subordinate to or super-ordinate to other metadata.
>  (i.e. RAID in partitions, or partitions in RAID).  The metadata may or may
>  not be stale.  It may or may not match - either strongly or weakly -
>  metadata on devices in currently active arrays.
>
>  A newly hotplugged device also has a "path" which we can see
>  in /dev/disk/by-path.  This is somehow indicative of a physical location.
>  This path may be the same as the path of a device which was recently
>  removed.  It might be one of a set of paths which make up a "RAID chassis".
>  It might be one of a set of paths one which we happen to find other RAID
>  arrays.
>
>  Some how from all of that information we need to decide if md can use the
>  device without asking, or possibly with a simple yes/no question, and we
>  need to decide what to actually do with the device.
>
>  Options for what to do with the device include:
>    - write an MBR and partition table, then do something as below with
>      each partition
>    - include the device (or partition) in an array that it was previously
>      part of, but from which it was removed
>    - include the device or partition as a spare in a native-metadata array.
>    - add the device as a spare to a vendor-metadata array
>
>  Part 2.
>
>   If we have a spare device and a degraded array we need to know if it is OK
>   to add the device as a hot-spare to that array.
>   Currently this is handled (for native metadata) by 'mdadm --monitor' and
>   the  spare-groups tag in mdadm.conf.
>   For vendor metadata, if the spare is already in the container then mdmon
>   should handle the spare assignment, but if the spare is in a different
>   container, 'mdadm --monitor' should move it to the right container, but
>   doesn't yet.
>
>   The "spare-group" functionality works but isn't necessarily the easiest
>   way to express the configuration desires.  People are likely to want to
>   specify how far a global spare can migrate using physical address: path.
>
>   So for example you might specify a group of paths with wildcards with the
>   implication that all arrays which contain disks from this group of paths
>   are automatically in the same spare-group.
>
>
>  Configuration and State
>
>   I think it is clear that configuration for this should go in mdadm.conf.
>   This would at least cover identifying groups of device by path and ways
>   what is allowed to be done to those devices.
>   It is possible that some configuration could be determined by inspecting
>   the hardware directly.  e.g. the IMSM code currently looks for an Option
>   ROM show confirms that the right Intel controller is present and so the
>   system can boot from the IMSM device.  It is possible that other
>   information could be gained this way so that the mdadm.conf configuration
>   would not need to identify paths but alternately identify some
>   platform-specific concept.
>
>   The configuration would have to say what is permitted for hot-plugged
>   devices:  nothing, re-add, claim-bare-only, claim-any-unrecognised
>   The configuration would also describe mobility of spares across
>   different device sets.
>
>   This would add a new line type to mdadm.conf. e.g.
>     DOMAIN or CHASSIS or DEDICATED or something else.
>   The line would identify
>         some devices by path or platform
>         a metadata type that is expected here
>         what hotplug is allows to do
>         a spare-group that applies to all array which use devices from this
>         group/domain/chassis/thing
>         source for MBR?  template for partitioning?  or would this always
>             be copied from some other device in the set if hotplug= allowed
>             partitioning?
>
>   State required would include
>       - where devices have been recently removed from, and what they were in
>         use for
>       - which arrays are currently using which device sets, though that can
>         be determined dynamically from inspecting active arrays.
>       - ?? partition tables off any devices that are in use so if they are
>         removed and an new device added the partition table can be
>         replaced.
>
>  Usability
>
>  The idea of being able to pull out a device and plug in a replacement and
>  have it all "just work" is a good one.  However I don't want to be too
>  dependent on state that might have been saved from the old device.
>  I would like to also be able to point to a new device which didn't exist
>  before and say "use this".   mdadm would use the path information to decide
>  which contain or set of drives was most appropriate, extract
>  MBR/partitioning from one of those, impose it on the new device and include
>  the device or partitions in the appropriate array.
>
>  For RAID over partitions, this assumes a fairly regular configuration: all
>  devices partitioned the same way, and each array build out of a set of
>  aligned partitions (e.g. /dev/sd[bcde]2 ).
>  One of the strength of md is that you don't have to use such a restricted
>  configuration, but I think it would be very hard to reliably "to the right
>  thing" with an irregular set up (e.g. a raid1 over a 1T device and 2 500GB
>  devices in a raid0).
>
>  So I think we should firmly limit the range of configurations for which
>  auto-magic stuff is done.  Vendor metadata is already fairly strongly
>  defined.  We just add a device to the vendor container and let it worry
>  about the detail.  For native metadata we need to draw a firm line.
>  I think that line should be "all devices partitioned the same" but I
>  am open to discussion.
>
>  If we have "mdadm --use-this-device-however" without needing to know
>  anything about pre-existing state, then a hot-remove would just need to
>  record that the device was used by arrays X and Y. Then on hot plug we could
>   - do nothing
>   - do something if metadata on device allows
>   - do use-this-device-however if there was a recent hot-remove of the device
>   - always do use-this-device-however
>  depending on configuration.
>
>  Implementation
>
>  I think we all agree that migrating spares between containers is best done
>  by "mdadm --monitor".  It needs to be enhanced to intuit spare-group names
>  from "DOMAIN" declarations, and to move spares between vendor containers.
>
>  For hot-plug, hot-unplug I prefer to use udev triggers.  plug runs
>    mdadm --incremental /dev/whatever
>  which would be extended to do other clever things if allowed
>  Unplug would run
>     mdadm --force-remove /dev/whatever
>  which finds any arrays containing the device (or partitions?) and
>  fail/removes them and records the fact with a timestamp.
>
>  However if someone has a convincing reason to build this functionality
>  into  "mdadm --monitor" instead using libudev I am willing to listen.
>
>  Probably the most important first step is to determine a configuration
>  syntax and be sure it is broad enough to cover all needs.
>
>  I'm thinking:
>    DOMAIN path=glob-pattern metadata=type  hotplug=mode  spare-group=name
>
>  I explicitly have "path=" in case we find there is a need to identify
>  devices some other way - maybe by control vendor:device or some other
>  content-based approach
>  The spare-group name is inherited by any array with devices in this
>  domain as long as that doesn't result it in having two different
>  spare-group names.
>  I'm not sure if "metadata=" is really needed.  If all the arrays that use
>  these devices have the same metadata, it would be redundant to list it here.
>  If they use different metadata ... then what?
>  I guess two different DOMAIN lines could identify the same devices and
>  list different metadata types and given them different spare-group
>  names.  However you cannot support hotplug of bare devices into both ...
>
>  If it possible for multiple DOMAIN lines to identify the same device,
>  e.g. by having more or less specific patterns. In this case the spare-group
>  names are ignored if they conflict, and the hotplug mode used is the most
>  permissive.
>
>  hotplug modes are:
>    none  - ignore any hotplugged device
>    incr  - normal incremental assembly (the default).  If the device has
>         metadata that matches an array, try to add it to the array
>    replace - If above fails and a device was recently removed from this
>         same path, add this device to the same array(s) that the old devices
>         was part of
>    include - If the above fails and the device has not recognisable metadata
>         add it to any array/container that uses devices in this domain,
>         partitioning first if necessary.
>    force - as above but ignore any pre-existing metadata
>
>
>  I'm not sure that all those are needed, or are the best names.  Names like
>    ignore, reattach, rebuild, rebuild_spare
>  have also been suggested.
>
>  It might be useful to have a 'partition=type' flag to specify MBR or GPT ??
>
>
> There, I think that just about covers everything relevant from the various
> conversations.
> Please feel free to disagree or suggest new use cases or explain why this
> would not work or would not be ideal.
> There was a suggestion that more state needed to be stored to support
> auto-rebuild (detail of each device so they can be recovered exactly after a
> device is pulled and a replacement added).  I'm not convinced of this but am
> happy to hear more explanations.
>
> Thanks,
> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

My feeling on the entire subject matter is that this is /not/ an easy
decision.  Computers are rarely correct when they guess at what an
administrator wants, and attempting to implement the functionality
within mdadm is prone to many limitations or re-inventing the wheel.

If mdadm / mdmon is part of the process at all, I think it should be
used to either fork an executable (script or otherwise) which invokes
the administrative actions that have been pre-determined.

I believe that the default action should be to do /nothing/.  That is
the only safe thing to do.  If an administrative framework is desired
that seems to fall under a larger project goal which is likely better
covered by programs more aware of the overall system state.  This
route also allows for a range of scalability.

It may be sufficient in an initramfs context to either spawn a shell
or even just wait in a recovery console after the mdadm invocation
returns failure.  It might also be desired to use a very simple
reaction which assumes any spare of sufficient size which is added
should be allocated to the largest or closest comparable area based on
pre-determined preferences.

At the same time, I could see the value in mapping actual physical
locations to an array, remembering any missing or failed device
layouts, and re-creating the same layouts on the new device.  However
those actions are a little above what mdadm should be operating at.

With both of those viewpoints I see the following solution.

The most specific action match is followed.

Action-matches should be restrict-able by path wildcard, simple size
comparisons, AND state for metadata.
As a final deciding factor action-matches should also have an optional
priority value, so that when all else matches one rule out of a set
will be known to run first.

The result of matching an action, once again, should be an external
program or shell to allow for maximum flexibility.

I am not at all opposed to adding good default choices for those
actions in either binary or shell script form.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-25  0:35 Auto Rebuild on hot-plug Neil Brown
  2010-03-25  2:47 ` Michael Evans
@ 2010-03-25  8:01 ` Luca Berra
  2010-03-31  1:26   ` Neil Brown
  2010-03-25 14:10 ` John Robinson
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 33+ messages in thread
From: Luca Berra @ 2010-03-25  8:01 UTC (permalink / raw)
  To: linux-raid

On Thu, Mar 25, 2010 at 11:35:43AM +1100, Neil Brown wrote:
>
>       http://blogs.techrepublic.com.com/opensource/?p=1368
>
> The most significant thing I got from this was a complain in the comments
> that managing md raid was too complex and hence error-prone.

well, i would not be upset by j. random jerk complaining in a blog
comments, as soon as you make it one click you will find another one
that complains because it is not is favorite colour :P

> I see the issue as breaking down in to two parts.
>  1/ When a device is hot plugged into the system, is md allowed to use it as
>     a spare for recovery?
>  2/ If md has a spare device, what set of arrays can it be used in if needed.
>
> A typical hot plug event will need to address both of these questions in
> turn before recovery actually starts.
>
> Part 1.
>
>  A newly hotplugged device may have metadata for RAID (0.90, 1.x, IMSM, DDF,
>  other vendor metadata) or LVM or a filesystem.  It might have a partition
>  table which could be subordinate to or super-ordinate to other metadata.
>  (i.e. RAID in partitions, or partitions in RAID).  The metadata may or may
>  not be stale.  It may or may not match - either strongly or weakly -
>  metadata on devices in currently active arrays.
also the newly hotplugged device may have _data_ on it.

>  Some how from all of that information we need to decide if md can use the
>  device without asking, or possibly with a simple yes/no question, and we
>  need to decide what to actually do with the device.
how does the yes/no question part work?

>  Options for what to do with the device include:
>    - write an MBR and partition table, then do something as below with
>      each partition
>    - include the device (or partition) in an array that it was previously
>      part of, but from which it was removed
>    - include the device or partition as a spare in a native-metadata array.
>    - add the device as a spare to a vendor-metadata array
I really feel there is much room for causing disasters with a similar
approach.

The main difference from an hw raid controller is that the hw raid
controller _requires_ full control on the individual disks.
MD does not. Trying to do things automatically without full control is
very dangerous.
this may be different when using ddf or imsm since they are usually
working on whole drives attached to a raid-like controller (even if one
of the strenghts of md is being able to activate those arrays even
without the original controller).

If you want to be user-friendly just add a simple script
/usr/bin/md-replace-drive
It will take as input either an md array or a working drive as source,
and the new drive as target.
In the first case it has examine the components of the source md
determine if they are partitions or a whole devices (sysfs), in the first
case, find the whole drive and ensure they are partitioned in the same
way.
It will examine the source drive for partition and all md arrays it is
part of. it will ensure that those arrays have a failed device,
Check the size of the components and match them to the new drive (no
sense replacing a 1T drive with a 750Gb one)

ask the user for confirmation in big understandable letters

replicate any mbr and partition table, and include the device (or all
newly created partitions) in the relevant md device.

an improvement would be not needing user to specify a source in the most
simple of cases, by checking for all arrays with a failed device.

we can also make /usr/bin/md-create-spare ...

> Part 2.
>

makes sense

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-25  0:35 Auto Rebuild on hot-plug Neil Brown
  2010-03-25  2:47 ` Michael Evans
  2010-03-25  8:01 ` Luca Berra
@ 2010-03-25 14:10 ` John Robinson
  2010-03-31  1:30   ` Neil Brown
  2010-03-25 15:04 ` Labun, Marcin
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 33+ messages in thread
From: John Robinson @ 2010-03-25 14:10 UTC (permalink / raw)
  To: Neil Brown
  Cc: Doug Ledford, Dan Williams, Labun, Marcin,
	Hawrylewicz Czarnowski, Przemyslaw, Ciechanowski, Ed, linux-raid,
	Bill Davidsen

On 25/03/2010 00:35, Neil Brown wrote:
> Greetings.
>  I find myself in the middle of two separate off-list conversations on the
>  same topic and it has reached the point where I think the conversations
>  really need to be unite and brought on-list.
> 
>  So here is my current understanding and thoughts.
> 
>  The topic is about making rebuild after a failure easier.  It strikes me as
>  particularly relevant after the link  Bill Davidsen recently forwards to the
>  list:
> 
>        http://blogs.techrepublic.com.com/opensource/?p=1368
> 
>  The most significant thing I got from this was a complain in the comments
>  that managing md raid was too complex and hence error-prone.
> 
>  I see the issue as breaking down in to two parts.
>   1/ When a device is hot plugged into the system, is md allowed to use it as
>      a spare for recovery?
>   2/ If md has a spare device, what set of arrays can it be used in if needed.
> 
>  A typical hot plug event will need to address both of these questions in
>  turn before recovery actually starts.
> 
>  Part 1.
> 
>   A newly hotplugged device may have metadata for RAID (0.90, 1.x, IMSM, DDF,
>   other vendor metadata) or LVM or a filesystem.  It might have a partition
>   table which could be subordinate to or super-ordinate to other metadata.
>   (i.e. RAID in partitions, or partitions in RAID).  The metadata may or may
>   not be stale.  It may or may not match - either strongly or weakly -
>   metadata on devices in currently active arrays.

Or indeed it may have no metadata at all - it may be a fresh disc. I 
didn't see that you stated this specifically at any point, though it was 
there by implication, so I will: you're going to have to pick up hotplug 
events for bare drives, which presumably means you'll also get events 
for CD-ROM drives, USB sticks, printers with media card slots in them etc.

>   A newly hotplugged device also has a "path" which we can see
>   in /dev/disk/by-path.  This is somehow indicative of a physical location.
>   This path may be the same as the path of a device which was recently
>   removed.  It might be one of a set of paths which make up a "RAID chassis".
>   It might be one of a set of paths one which we happen to find other RAID
>   arrays.

Indeed, I would like to be able to declare any 
/dev/disk/by-path/pci-0000:00:1f.2-scsi-[0-4] to be suitable candidates 
for hot-plugging, because those are the 5 motherboard SATA ports I've 
hooked into my hot-swap chassis.

As an aside, I just tried yanking and replugging one of my drives, on 
CentOS 5.4, and it successfully went away and came back again, but 
wasn't automatically re-added, even though the metadata etc was all there.

>   Some how from all of that information we need to decide if md can use the
>   device without asking, or possibly with a simple yes/no question, and we
>   need to decide what to actually do with the device.
> 
>   Options for what to do with the device include:
>     - write an MBR and partition table, then do something as below with
>       each partition

Definitely want this for bare drives. In my case I'd like the MBR and 
first 62 sectors copied from one of the live drives, or a copy saved for 
the purpose, so the disc can be bootable.

My concern is that this is surely outwith the regular scope of 
mdadm/mdmon, as is handling bare drives/CD-ROMs/USB sticks. Do we need 
another mdadm companion rather than an addition?

>     - include the device (or partition) in an array that it was previously
>       part of, but from which it was removed

Definitely, just so I can pull a drive and plug it in again and point 
and say ooh, everything's up and running again, to demonstrate how cool 
Linux md is. I imagine some distros' udev/hotplug rules do this already, 
almost by default where they assemble arrays incrementally.

>     - include the device or partition as a spare in a native-metadata array.

I think in my situation I'd quite like the first partition, type fd 
metadata 0.90 RAID-1 mounted as /boot, added as an active mirror not a 
spare, again so that if this new drive appears as sda at the next power 
cycle, the system will boot.

The second partition, a RAID-5 with LVM on it, could be added as a 
spare, because it would then automatically be rebuilt onto if the array 
was degraded.

>  Part 2.
[...]

I'm afraid I have nothing to add here, it all sounds good.

Cheers,

John.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: Auto Rebuild on hot-plug
  2010-03-25  0:35 Auto Rebuild on hot-plug Neil Brown
                   ` (2 preceding siblings ...)
  2010-03-25 14:10 ` John Robinson
@ 2010-03-25 15:04 ` Labun, Marcin
  2010-03-27  0:37   ` Dan Williams
  2010-03-26  6:41 ` linbloke
  2010-03-26  7:52 ` Majed B.
  5 siblings, 1 reply; 33+ messages in thread
From: Labun, Marcin @ 2010-03-25 15:04 UTC (permalink / raw)
  To: Neil Brown, Doug Ledford, Williams, Dan J,
	"Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@
  Cc: linux-raid@vger.kernel.org, Bill Davidsen

Thank you for bringing-up this design.

> 
>   Probably the most important first step is to determine a configuration
>   syntax and be sure it is broad enough to cover all needs.
> 
>   I'm thinking:
>     DOMAIN path=glob-pattern metadata=type  hotplug=mode  spare-group=name
> 
>   I explicitly have "path=" in case we find there is a need to identify
>   devices some other way - maybe by control vendor:device or some other
>   content-based approach
>   The spare-group name is inherited by any array with devices in this
>   domain as long as that doesn't result it in having two different
>   spare-group names.
>   I'm not sure if "metadata=" is really needed.  If all the arrays that
> use
>   these devices have the same metadata, it would be redundant to list it
> here.

I think that metadata keyword can be used to identify scope of devices to which the DOMAIN line applies.
For instance we could have:
DOMAIN path=glob-pattern metadata=imsm hotplug=mode1  spare-group=name1
DOMAIN path=glob-pattern metadata=0.90 hotplug=mode2  spare-group=name2

Keywords: 
Path, metadata and spare-group shall define to which arrays the hotplug definition (or other definition of action) applies. User could define any subset of it.
For instance to define that all imsm arrays shall use hotplug mode2 user shall define:
DOMAIN metadata=imsm hotplug=mode2

In above example user need not define spare-group in his/her configuration file for each array.

I also assume that each metadata handler can additionally sets its own rules of accepting the spare in the container. Rules can be derived from platform dependencies or metadata. Notice that user can disable platform specific constrains by defining IMSM_NO_PLATFORM environment variable. 


>   If they use different metadata ... then what?
>   I guess two different DOMAIN lines could identify the same devices and
>   list different metadata types and given them different spare-group
>   names.  However you cannot support hotplug of bare devices into both ...
> 
>   If it possible for multiple DOMAIN lines to identify the same device,
>   e.g. by having more or less specific patterns. In this case the spare-
> group
>   names are ignored if they conflict, and the hotplug mode used is the
> most
>   permissive.
Maybe use the most specific match?

> 
>   hotplug modes are:
>     none  - ignore any hotplugged device
>     incr  - normal incremental assembly (the default).  If the device has
>          metadata that matches an array, try to add it to the array
>     replace - If above fails and a device was recently removed from this
>          same path, add this device to the same array(s) that the old
> devices
>          was part of
>     include - If the above fails and the device has not recognisable
> metadata
>          add it to any array/container that uses devices in this domain,
>          partitioning first if necessary.
>     force - as above but ignore any pre-existing metadata
> 
> 
>   I'm not sure that all those are needed, or are the best names.  Names
> like
>     ignore, reattach, rebuild, rebuild_spare
>   have also been suggested.

Please consider:
      spare_add - add any spare device that matches the metadata container/volume in case of native metadata regardless of array state, so later such a spare can be used in rebuild process.

Can we assume for all external metadata that spares added any container can be potentially moved between all container the same metadata?
I expect that this could be default behavior if no spare groups are defined for some metadata.
More over each metadata handler could impose build-in rules on spares assignment to specific container.

Thanks,
Marcin Labun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-25  0:35 Auto Rebuild on hot-plug Neil Brown
                   ` (3 preceding siblings ...)
  2010-03-25 15:04 ` Labun, Marcin
@ 2010-03-26  6:41 ` linbloke
  2010-03-31  1:35   ` Neil Brown
  2010-03-26  7:52 ` Majed B.
  5 siblings, 1 reply; 33+ messages in thread
From: linbloke @ 2010-03-26  6:41 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Neil Brown wrote:
> Greetings.
>  I find myself in the middle of two separate off-list conversations on the
>  same topic and it has reached the point where I think the conversations
>  really need to be unite and brought on-list.
>
>  So here is my current understanding and thoughts.
>
>  The topic is about making rebuild after a failure easier.  It strikes me as
>  particularly relevant after the link  Bill Davidsen recently forwards to the
>  list:
>
>        http://blogs.techrepublic.com.com/opensource/?p=1368
>
>  The most significant thing I got from this was a complain in the comments
>  that managing md raid was too complex and hence error-prone.
>
>  I see the issue as breaking down in to two parts.
>   1/ When a device is hot plugged into the system, is md allowed to use it as
>      a spare for recovery?
>   2/ If md has a spare device, what set of arrays can it be used in if needed.
>
>  A typical hot plug event will need to address both of these questions in
>  turn before recovery actually starts.
>
>  Part 1.
>
>   A newly hotplugged device may have metadata for RAID (0.90, 1.x, IMSM, DDF,
>   other vendor metadata) or LVM or a filesystem.  It might have a partition
>   table which could be subordinate to or super-ordinate to other metadata.
>   (i.e. RAID in partitions, or partitions in RAID).  The metadata may or may
>   not be stale.  It may or may not match - either strongly or weakly -
>   metadata on devices in currently active arrays.
>
>   A newly hotplugged device also has a "path" which we can see
>   in /dev/disk/by-path.  This is somehow indicative of a physical location.
>   This path may be the same as the path of a device which was recently
>   removed.  It might be one of a set of paths which make up a "RAID chassis".
>   It might be one of a set of paths one which we happen to find other RAID
>   arrays.
>
>   Some how from all of that information we need to decide if md can use the
>   device without asking, or possibly with a simple yes/no question, and we
>   need to decide what to actually do with the device.
>
>   Options for what to do with the device include:
>     - write an MBR and partition table, then do something as below with
>       each partition
>     - include the device (or partition) in an array that it was previously
>       part of, but from which it was removed
>     - include the device or partition as a spare in a native-metadata array.
>     - add the device as a spare to a vendor-metadata array
>
>  Part 2.
>
>    If we have a spare device and a degraded array we need to know if it is OK
>    to add the device as a hot-spare to that array.
>    Currently this is handled (for native metadata) by 'mdadm --monitor' and
>    the  spare-groups tag in mdadm.conf.
>    For vendor metadata, if the spare is already in the container then mdmon
>    should handle the spare assignment, but if the spare is in a different
>    container, 'mdadm --monitor' should move it to the right container, but
>    doesn't yet.
>
>    The "spare-group" functionality works but isn't necessarily the easiest
>    way to express the configuration desires.  People are likely to want to
>    specify how far a global spare can migrate using physical address: path.
>
>    So for example you might specify a group of paths with wildcards with the
>    implication that all arrays which contain disks from this group of paths
>    are automatically in the same spare-group.
>
>
>  Configuration and State
>
>    I think it is clear that configuration for this should go in mdadm.conf.
>    This would at least cover identifying groups of device by path and ways
>    what is allowed to be done to those devices.
>    It is possible that some configuration could be determined by inspecting
>    the hardware directly.  e.g. the IMSM code currently looks for an Option
>    ROM show confirms that the right Intel controller is present and so the
>    system can boot from the IMSM device.  It is possible that other
>    information could be gained this way so that the mdadm.conf configuration
>    would not need to identify paths but alternately identify some
>    platform-specific concept.
>
>    The configuration would have to say what is permitted for hot-plugged
>    devices:  nothing, re-add, claim-bare-only, claim-any-unrecognised
>    The configuration would also describe mobility of spares across
>    different device sets.
>
>    This would add a new line type to mdadm.conf. e.g.
>      DOMAIN or CHASSIS or DEDICATED or something else.
>    The line would identify
>          some devices by path or platform
>          a metadata type that is expected here
>          what hotplug is allows to do
>          a spare-group that applies to all array which use devices from this
>          group/domain/chassis/thing
>          source for MBR?  template for partitioning?  or would this always
>              be copied from some other device in the set if hotplug= allowed
>              partitioning?
>
>    State required would include
>        - where devices have been recently removed from, and what they were in
>          use for
>        - which arrays are currently using which device sets, though that can
>          be determined dynamically from inspecting active arrays.
>        - ?? partition tables off any devices that are in use so if they are
>          removed and an new device added the partition table can be
>          replaced.
>
>  Usability
>
>   The idea of being able to pull out a device and plug in a replacement and
>   have it all "just work" is a good one.  However I don't want to be too
>   dependent on state that might have been saved from the old device.
>   I would like to also be able to point to a new device which didn't exist
>   before and say "use this".   mdadm would use the path information to decide
>   which contain or set of drives was most appropriate, extract
>   MBR/partitioning from one of those, impose it on the new device and include
>   the device or partitions in the appropriate array.
>
>   For RAID over partitions, this assumes a fairly regular configuration: all
>   devices partitioned the same way, and each array build out of a set of 
>   aligned partitions (e.g. /dev/sd[bcde]2 ).
>   One of the strength of md is that you don't have to use such a restricted
>   configuration, but I think it would be very hard to reliably "to the right
>   thing" with an irregular set up (e.g. a raid1 over a 1T device and 2 500GB
>   devices in a raid0).
>
>   So I think we should firmly limit the range of configurations for which
>   auto-magic stuff is done.  Vendor metadata is already fairly strongly
>   defined.  We just add a device to the vendor container and let it worry
>   about the detail.  For native metadata we need to draw a firm line.
>   I think that line should be "all devices partitioned the same" but I
>   am open to discussion.
>
>   If we have "mdadm --use-this-device-however" without needing to know
>   anything about pre-existing state, then a hot-remove would just need to
>   record that the device was used by arrays X and Y. Then on hot plug we could
>    - do nothing
>    - do something if metadata on device allows
>    - do use-this-device-however if there was a recent hot-remove of the device
>    - always do use-this-device-however
>   depending on configuration.
>
>  Implementation
>
>   I think we all agree that migrating spares between containers is best done
>   by "mdadm --monitor".  It needs to be enhanced to intuit spare-group names
>   from "DOMAIN" declarations, and to move spares between vendor containers.
>
>   For hot-plug, hot-unplug I prefer to use udev triggers.  plug runs
>     mdadm --incremental /dev/whatever
>   which would be extended to do other clever things if allowed
>   Unplug would run
>      mdadm --force-remove /dev/whatever
>   which finds any arrays containing the device (or partitions?) and
>   fail/removes them and records the fact with a timestamp.
>
>   However if someone has a convincing reason to build this functionality
>   into  "mdadm --monitor" instead using libudev I am willing to listen.
>
>   Probably the most important first step is to determine a configuration
>   syntax and be sure it is broad enough to cover all needs.
>
>   I'm thinking:
>     DOMAIN path=glob-pattern metadata=type  hotplug=mode  spare-group=name
>
>   I explicitly have "path=" in case we find there is a need to identify
>   devices some other way - maybe by control vendor:device or some other
>   content-based approach
>   The spare-group name is inherited by any array with devices in this
>   domain as long as that doesn't result it in having two different
>   spare-group names.
>   I'm not sure if "metadata=" is really needed.  If all the arrays that use
>   these devices have the same metadata, it would be redundant to list it here.
>   If they use different metadata ... then what?
>   I guess two different DOMAIN lines could identify the same devices and 
>   list different metadata types and given them different spare-group
>   names.  However you cannot support hotplug of bare devices into both ...
>
>   If it possible for multiple DOMAIN lines to identify the same device,
>   e.g. by having more or less specific patterns. In this case the spare-group
>   names are ignored if they conflict, and the hotplug mode used is the most
>   permissive.
>
>   hotplug modes are:
>     none  - ignore any hotplugged device
>     incr  - normal incremental assembly (the default).  If the device has
>          metadata that matches an array, try to add it to the array
>     replace - If above fails and a device was recently removed from this
>          same path, add this device to the same array(s) that the old devices
>          was part of
>     include - If the above fails and the device has not recognisable metadata
>          add it to any array/container that uses devices in this domain,
>          partitioning first if necessary.
>     force - as above but ignore any pre-existing metadata
>
>
>   I'm not sure that all those are needed, or are the best names.  Names like
>     ignore, reattach, rebuild, rebuild_spare
>   have also been suggested.
>
>   It might be useful to have a 'partition=type' flag to specify MBR or GPT ??
>
>
> There, I think that just about covers everything relevant from the various
> conversations.
> Please feel free to disagree or suggest new use cases or explain why this
> would not work or would not be ideal.
> There was a suggestion that more state needed to be stored to support
> auto-rebuild (detail of each device so they can be recovered exactly after a
> device is pulled and a replacement added).  I'm not convinced of this but am
> happy to hear more explanations.
>
> Thanks,
> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   
Hi Neil,

I look forward to being able to update my mdadm.conf with the paths to 
devices that are important to my RAID so that if a fault were to develop 
on an array, then I'd be really happy to fail and remove the faulty 
device, insert a blank device  of sufficient size into the defined path 
and have the RAID auto restore. If the disk is not blank or too small, 
provide a useful error message (insert disk of larger capacity, delete 
partitions, zero superblocks) and exit.  I think you do an amazing job 
and it worries me that you and the other contributors to mdadm could 
spend your valuable time trying to solve problems about how to cater for 
every metadata, partition type etc when a simple blank device is easy to 
achieve and could then "Auto Rebuild on hot-plug".

Perhaps as we nominate a spare disk, we could nominate a spare path. I'm 
certainly no expert and my use case is simple (raid 1's and 10's) but it 
seems to me a lot of complexity can be avoided for the sake of a blank disk.

Cheers,
Josh





^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-25  0:35 Auto Rebuild on hot-plug Neil Brown
                   ` (4 preceding siblings ...)
  2010-03-26  6:41 ` linbloke
@ 2010-03-26  7:52 ` Majed B.
  2010-03-31  1:42   ` Neil Brown
  5 siblings, 1 reply; 33+ messages in thread
From: Majed B. @ 2010-03-26  7:52 UTC (permalink / raw)
  To: Neil Brown
  Cc: Doug Ledford, Dan Williams, Labun, Marcin,
	Hawrylewicz Czarnowski, Przemyslaw, Ciechanowski, Ed, linux-raid,
	Bill Davidsen

Why not treat this similar to how hardware RAID manages disks & spares?
Disk has no metadata -> new -> use as spare.
Disk has metadata -> array exists -> add to array.
Disk has metadata -> array doesn't exist (disk came from another
system) -> sit idle & wait for an admin to do the work.

As to identify disks and know which disks were removed and put back to
an array, there's the metadata & there's the disk's serial number
which can obtained using hdparm. I also think that all disks now
include a World Wide Number (WWN) which is more suitable for use in
this case than a disk's serial number.

Some people rant because they see things only from their own
perspective and assume that there's no case or scenario but their own.
So don't pay too much attention :p

Here's a scenario: What if I had an existing RAID1 array of 3 disks. I
bought a new disk and I wanted to make a new array in the system. So I
add the new disk, and I want to use one of the RAID1 array disks in
this new array.

Being lazy, instead of failing the disk then removing it using the
console, I just removed it from the port then added it again. I
certainly don't want mdadm to start resyncing, forcing me to wait!

As you can see in this scenario, it includes the situation where an
admin is a lazy bum who is going to use the command line anyway to
make the new array but didn't bother to properly remove the disk he
wanted. And there's the case of the newly added disk.

Why assume things & guess when an admin should know what to do?
I certainly don't want to risk my arrays in mdadm guessing for me. And
keep one thing in mind: How often do people interact with storage
systems?

If I configure mdadm today, the next I may want to add or replace a
disk would be a year later. I certainly would have forgotten whatever
configuration was there! And depending on the situation I have, I
certainly wouldn't want mdadm to guess.

On Thu, Mar 25, 2010 at 3:35 AM, Neil Brown <neilb@suse.de> wrote:
>
> Greetings.
>  I find myself in the middle of two separate off-list conversations on the
>  same topic and it has reached the point where I think the conversations
>  really need to be unite and brought on-list.
>
>  So here is my current understanding and thoughts.
>
>  The topic is about making rebuild after a failure easier.  It strikes me as
>  particularly relevant after the link  Bill Davidsen recently forwards to the
>  list:
>
>       http://blogs.techrepublic.com.com/opensource/?p=1368
>
>  The most significant thing I got from this was a complain in the comments
>  that managing md raid was too complex and hence error-prone.
>
>  I see the issue as breaking down in to two parts.
>  1/ When a device is hot plugged into the system, is md allowed to use it as
>     a spare for recovery?
>  2/ If md has a spare device, what set of arrays can it be used in if needed.
>
>  A typical hot plug event will need to address both of these questions in
>  turn before recovery actually starts.
>
>  Part 1.
>
>  A newly hotplugged device may have metadata for RAID (0.90, 1.x, IMSM, DDF,
>  other vendor metadata) or LVM or a filesystem.  It might have a partition
>  table which could be subordinate to or super-ordinate to other metadata.
>  (i.e. RAID in partitions, or partitions in RAID).  The metadata may or may
>  not be stale.  It may or may not match - either strongly or weakly -
>  metadata on devices in currently active arrays.
>
>  A newly hotplugged device also has a "path" which we can see
>  in /dev/disk/by-path.  This is somehow indicative of a physical location.
>  This path may be the same as the path of a device which was recently
>  removed.  It might be one of a set of paths which make up a "RAID chassis".
>  It might be one of a set of paths one which we happen to find other RAID
>  arrays.
>
>  Some how from all of that information we need to decide if md can use the
>  device without asking, or possibly with a simple yes/no question, and we
>  need to decide what to actually do with the device.
>
>  Options for what to do with the device include:
>    - write an MBR and partition table, then do something as below with
>      each partition
>    - include the device (or partition) in an array that it was previously
>      part of, but from which it was removed
>    - include the device or partition as a spare in a native-metadata array.
>    - add the device as a spare to a vendor-metadata array
>
>  Part 2.
>
>   If we have a spare device and a degraded array we need to know if it is OK
>   to add the device as a hot-spare to that array.
>   Currently this is handled (for native metadata) by 'mdadm --monitor' and
>   the  spare-groups tag in mdadm.conf.
>   For vendor metadata, if the spare is already in the container then mdmon
>   should handle the spare assignment, but if the spare is in a different
>   container, 'mdadm --monitor' should move it to the right container, but
>   doesn't yet.
>
>   The "spare-group" functionality works but isn't necessarily the easiest
>   way to express the configuration desires.  People are likely to want to
>   specify how far a global spare can migrate using physical address: path.
>
>   So for example you might specify a group of paths with wildcards with the
>   implication that all arrays which contain disks from this group of paths
>   are automatically in the same spare-group.
>
>
>  Configuration and State
>
>   I think it is clear that configuration for this should go in mdadm.conf.
>   This would at least cover identifying groups of device by path and ways
>   what is allowed to be done to those devices.
>   It is possible that some configuration could be determined by inspecting
>   the hardware directly.  e.g. the IMSM code currently looks for an Option
>   ROM show confirms that the right Intel controller is present and so the
>   system can boot from the IMSM device.  It is possible that other
>   information could be gained this way so that the mdadm.conf configuration
>   would not need to identify paths but alternately identify some
>   platform-specific concept.
>
>   The configuration would have to say what is permitted for hot-plugged
>   devices:  nothing, re-add, claim-bare-only, claim-any-unrecognised
>   The configuration would also describe mobility of spares across
>   different device sets.
>
>   This would add a new line type to mdadm.conf. e.g.
>     DOMAIN or CHASSIS or DEDICATED or something else.
>   The line would identify
>         some devices by path or platform
>         a metadata type that is expected here
>         what hotplug is allows to do
>         a spare-group that applies to all array which use devices from this
>         group/domain/chassis/thing
>         source for MBR?  template for partitioning?  or would this always
>             be copied from some other device in the set if hotplug= allowed
>             partitioning?
>
>   State required would include
>       - where devices have been recently removed from, and what they were in
>         use for
>       - which arrays are currently using which device sets, though that can
>         be determined dynamically from inspecting active arrays.
>       - ?? partition tables off any devices that are in use so if they are
>         removed and an new device added the partition table can be
>         replaced.
>
>  Usability
>
>  The idea of being able to pull out a device and plug in a replacement and
>  have it all "just work" is a good one.  However I don't want to be too
>  dependent on state that might have been saved from the old device.
>  I would like to also be able to point to a new device which didn't exist
>  before and say "use this".   mdadm would use the path information to decide
>  which contain or set of drives was most appropriate, extract
>  MBR/partitioning from one of those, impose it on the new device and include
>  the device or partitions in the appropriate array.
>
>  For RAID over partitions, this assumes a fairly regular configuration: all
>  devices partitioned the same way, and each array build out of a set of
>  aligned partitions (e.g. /dev/sd[bcde]2 ).
>  One of the strength of md is that you don't have to use such a restricted
>  configuration, but I think it would be very hard to reliably "to the right
>  thing" with an irregular set up (e.g. a raid1 over a 1T device and 2 500GB
>  devices in a raid0).
>
>  So I think we should firmly limit the range of configurations for which
>  auto-magic stuff is done.  Vendor metadata is already fairly strongly
>  defined.  We just add a device to the vendor container and let it worry
>  about the detail.  For native metadata we need to draw a firm line.
>  I think that line should be "all devices partitioned the same" but I
>  am open to discussion.
>
>  If we have "mdadm --use-this-device-however" without needing to know
>  anything about pre-existing state, then a hot-remove would just need to
>  record that the device was used by arrays X and Y. Then on hot plug we could
>   - do nothing
>   - do something if metadata on device allows
>   - do use-this-device-however if there was a recent hot-remove of the device
>   - always do use-this-device-however
>  depending on configuration.
>
>  Implementation
>
>  I think we all agree that migrating spares between containers is best done
>  by "mdadm --monitor".  It needs to be enhanced to intuit spare-group names
>  from "DOMAIN" declarations, and to move spares between vendor containers.
>
>  For hot-plug, hot-unplug I prefer to use udev triggers.  plug runs
>    mdadm --incremental /dev/whatever
>  which would be extended to do other clever things if allowed
>  Unplug would run
>     mdadm --force-remove /dev/whatever
>  which finds any arrays containing the device (or partitions?) and
>  fail/removes them and records the fact with a timestamp.
>
>  However if someone has a convincing reason to build this functionality
>  into  "mdadm --monitor" instead using libudev I am willing to listen.
>
>  Probably the most important first step is to determine a configuration
>  syntax and be sure it is broad enough to cover all needs.
>
>  I'm thinking:
>    DOMAIN path=glob-pattern metadata=type  hotplug=mode  spare-group=name
>
>  I explicitly have "path=" in case we find there is a need to identify
>  devices some other way - maybe by control vendor:device or some other
>  content-based approach
>  The spare-group name is inherited by any array with devices in this
>  domain as long as that doesn't result it in having two different
>  spare-group names.
>  I'm not sure if "metadata=" is really needed.  If all the arrays that use
>  these devices have the same metadata, it would be redundant to list it here.
>  If they use different metadata ... then what?
>  I guess two different DOMAIN lines could identify the same devices and
>  list different metadata types and given them different spare-group
>  names.  However you cannot support hotplug of bare devices into both ...
>
>  If it possible for multiple DOMAIN lines to identify the same device,
>  e.g. by having more or less specific patterns. In this case the spare-group
>  names are ignored if they conflict, and the hotplug mode used is the most
>  permissive.
>
>  hotplug modes are:
>    none  - ignore any hotplugged device
>    incr  - normal incremental assembly (the default).  If the device has
>         metadata that matches an array, try to add it to the array
>    replace - If above fails and a device was recently removed from this
>         same path, add this device to the same array(s) that the old devices
>         was part of
>    include - If the above fails and the device has not recognisable metadata
>         add it to any array/container that uses devices in this domain,
>         partitioning first if necessary.
>    force - as above but ignore any pre-existing metadata
>
>
>  I'm not sure that all those are needed, or are the best names.  Names like
>    ignore, reattach, rebuild, rebuild_spare
>  have also been suggested.
>
>  It might be useful to have a 'partition=type' flag to specify MBR or GPT ??
>
>
> There, I think that just about covers everything relevant from the various
> conversations.
> Please feel free to disagree or suggest new use cases or explain why this
> would not work or would not be ideal.
> There was a suggestion that more state needed to be stored to support
> auto-rebuild (detail of each device so they can be recovered exactly after a
> device is pulled and a replacement added).  I'm not convinced of this but am
> happy to hear more explanations.
>
> Thanks,
> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
       Majed B.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-25 15:04 ` Labun, Marcin
@ 2010-03-27  0:37   ` Dan Williams
  2010-03-29 18:10     ` Doug Ledford
  0 siblings, 1 reply; 33+ messages in thread
From: Dan Williams @ 2010-03-27  0:37 UTC (permalink / raw)
  To: Labun, Marcin
  Cc: Neil Brown, Doug Ledford, Hawrylewicz Czarnowski, Przemyslaw,
	Ciechanowski, Ed, linux-raid@vger.kernel.org, Bill Davidsen

On Thu, Mar 25, 2010 at 8:04 AM, Labun, Marcin <Marcin.Labun@intel.com> wrote:
> I think that metadata keyword can be used to identify scope of devices to which the DOMAIN line applies.
> For instance we could have:
> DOMAIN path=glob-pattern metadata=imsm hotplug=mode1  spare-group=name1
> DOMAIN path=glob-pattern metadata=0.90 hotplug=mode2  spare-group=name2
>
> Keywords:
> Path, metadata and spare-group shall define to which arrays the hotplug definition (or other definition of action) applies. User could define any subset of it.
> For instance to define that all imsm arrays shall use hotplug mode2 user shall define:
> DOMAIN metadata=imsm hotplug=mode2
>
> In above example user need not define spare-group in his/her configuration file for each array.
>
> I also assume that each metadata handler can additionally sets its own rules of accepting the spare in the container. Rules can be derived from platform dependencies or metadata. Notice that user can disable platform specific constrains by defining IMSM_NO_PLATFORM environment variable.
>

For the 'platform' case we could automate some decisions, but I think
I would rather extend the --detail-platform option to dump the
recommended/compatible DOMAIN entries for the platform, perhaps via
the --brief modifier.  This mirrors what can be done with --examine
--brief to generate an initial configuration file that can be modified
to taste.

>>
>>   hotplug modes are:
>>     none  - ignore any hotplugged device
>>     incr  - normal incremental assembly (the default).  If the device has
>>          metadata that matches an array, try to add it to the array
>>     replace - If above fails and a device was recently removed from this
>>          same path, add this device to the same array(s) that the old
>> devices
>>          was part of
>>     include - If the above fails and the device has not recognisable
>> metadata
>>          add it to any array/container that uses devices in this domain,
>>          partitioning first if necessary.
>>     force - as above but ignore any pre-existing metadata
>>
>>
>>   I'm not sure that all those are needed, or are the best names.  Names
>> like
>>     ignore, reattach, rebuild, rebuild_spare
>>   have also been suggested.
>
> Please consider:
>      spare_add - add any spare device that matches the metadata container/volume in case of native metadata regardless of array state, so later such a spare can be used in rebuild process.

This is the same as 'incr' above.  If the device has metadata and
hotplug is enabled, auto-incorporate the device.

> Can we assume for all external metadata that spares added any container can be potentially moved between all container the same metadata?

Yes, that can be the default action, and the spare-group keyword can
be specified to override.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-27  0:37   ` Dan Williams
@ 2010-03-29 18:10     ` Doug Ledford
  2010-03-29 18:36       ` John Robinson
  2010-03-29 21:36       ` Dan Williams
  0 siblings, 2 replies; 33+ messages in thread
From: Doug Ledford @ 2010-03-29 18:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Labun, Marcin, Neil Brown, Hawrylewicz Czarnowski, Przemyslaw,
	Ciechanowski, Ed, linux-raid@vger.kernel.org, Bill Davidsen

[-- Attachment #1: Type: text/plain, Size: 9042 bytes --]

On 03/26/2010 08:37 PM, Dan Williams wrote:
> On Thu, Mar 25, 2010 at 8:04 AM, Labun, Marcin <Marcin.Labun@intel.com> wrote:
>> I think that metadata keyword can be used to identify scope of devices to which the DOMAIN line applies.
>> For instance we could have:
>> DOMAIN path=glob-pattern metadata=imsm hotplug=mode1  spare-group=name1
>> DOMAIN path=glob-pattern metadata=0.90 hotplug=mode2  spare-group=name2
>>
>> Keywords:
>> Path, metadata and spare-group shall define to which arrays the hotplug definition (or other definition of action) applies. User could define any subset of it.
>> For instance to define that all imsm arrays shall use hotplug mode2 user shall define:
>> DOMAIN metadata=imsm hotplug=mode2
>>
>> In above example user need not define spare-group in his/her configuration file for each array.
>>
>> I also assume that each metadata handler can additionally sets its own rules of accepting the spare in the container. Rules can be derived from platform dependencies or metadata. Notice that user can disable platform specific constrains by defining IMSM_NO_PLATFORM environment variable.
>>
> 
> For the 'platform' case we could automate some decisions, but I think
> I would rather extend the --detail-platform option to dump the
> recommended/compatible DOMAIN entries for the platform, perhaps via
> the --brief modifier.  This mirrors what can be done with --examine
> --brief to generate an initial configuration file that can be modified
> to taste.

So, a few things that I think can be said about the DOMAIN line type
(I'm assuming for now that this is what we'll use, mainly because I'm
implementing it right now):

There is an assumed, default DOMAIN line that is the equivalent of:

DOMAIN path=* metadata=* action=incremental spare-group=<none>

This is what you get simply by normal udev incremental assembly rules
(notice I used action instead of hotplug, action makes more sense to me
as all the words we use to define hotplug mode are in fact actions to
take on hotplug).  We will treat this as a given.  Anything else
requires an explicit DOMAIN line in mdadm.conf.

The second thing I'm having a hard time with is the spare-group.  To be
honest, if I follow what I think I should, and make it a hard
requirement that any action other than none and incremental must use a
non-global path glob (aka, path= MUST be present and can not be *), then
spare-group looses all meaning.  I say this because if a disk matches
the path glob is it in a specific spare group already (the one that this
DOMAIN represents) and ditto if arrays are on disks in this DOMAIN, then
they are automatically part of the same spare-group.  In other words, I
think spare-group becomes entirely redundant once we have a DOMAIN keyword.

I'm also having a hard time justifying the existence of the metadata
keyword.  The reason is that the metadata is already determined for us
by the path glob.  Specifically, if we assume that an array's members
can not cross domain boundaries (a reasonable requirement in my opinion,
we can't make an array where we can guarantee to the user that hot
plugging a replacement disk will do what they expect if some of the
array's members are inside the domain and some are outside the domain),
then we should only ever need the metadata keyword if we are mixing
metadata types within this domain.  Well, we can always narrow down the
domain if we are doing something like the first three sata disks on an
Intel Matrix RAID controller as imsm and the last three as jbod with
version 1.x metadata by putting the first half in one domain and the
second half in another.  And this would be the right thing to do versus
trying to cover both in one domain.  That means that only if we ever
mixed imsm/ddf and md native raid types on a single disk would we be
unable to narrow down the domain properly, and I'm not sure we care to
support this.  So, that leaves us back to not really needing the
metadata keyword as the disks present in the path spec glob should be
uniform in the metadata type and we should be able to simply use the
right metadata from that.

>>>   hotplug modes are:
>>>     none  - ignore any hotplugged device
>>>     incr  - normal incremental assembly (the default).  If the device has
>>>          metadata that matches an array, try to add it to the array
>>>     replace - If above fails and a device was recently removed from this
>>>          same path, add this device to the same array(s) that the old
>>> devices
>>>          was part of
>>>     include - If the above fails and the device has not recognisable
>>> metadata
>>>          add it to any array/container that uses devices in this domain,
>>>          partitioning first if necessary.
>>>     force - as above but ignore any pre-existing metadata
>>>
>>>
>>>   I'm not sure that all those are needed, or are the best names.  Names
>>> like
>>>     ignore, reattach, rebuild, rebuild_spare
>>>   have also been suggested.
>>
>> Please consider:
>>      spare_add - add any spare device that matches the metadata container/volume in case of native metadata regardless of array state, so later such a spare can be used in rebuild process.
> 
> This is the same as 'incr' above.  If the device has metadata and
> hotplug is enabled, auto-incorporate the device.

So my preferred and suggest words for the action item are as follows
(Note: there are two classes of actions, things we do when presented
with a disk and we have a degraded array, and things we do when
presented with a disk and all arrays in domain are fully up to date,
which implies this is a new disk in the domain and not replacing a
faulty disk in the domain, which implies the domain wasn't previously
full up...it might be worth having two keywords in the DOMAIN line to
separate these two items, but I'm going to argue a bit later that we
really don't care about the second option and so maybe not):

none
incremental - what we have now, and the default
readd - if incremental didn't work but the device is supposed to be part
of the array, then attempt the --re-add option of mdadm, this would
allow a sysadmin to unplug and replug a device from an array if it got
kicked for some reason and the system would attempt to reinsert it into
the array with minimal rebuild, but it would not attempt to use any
device that was hot plugged that didn't previously belong to the array
safe_use - if the new drive is currently bare and we have a degraded
array, assume this drive is intended to repair the degraded array and
use the device
force_use - as above but don't require the drive be empty

All of the above actions are related to domains that are degraded.  But
what to do if the array isn't degraded?  We could add the device as a
spare, but if the array isn't degraded, adding a new hot spare doesn't
really *do* anything.  No rebuild will start, nothing immediate happens,
it just goes in and sits there.  And now that we have all these fancy
grow options, it's not entirely clear that a user would want that
anyway.  So, I would argue that if the array isn't degraded, then there
is no sense of emergency in our actions, and there exists multiple
options for what to do with the device, some include being a hot spare
while others include using the device to grow the array, and the
possibilities and answers to what to do here are not at all clear.  Even
if the user had previously configured us to treat the device as a spare,
they may change their mind and want to grow things.  Given that there's
no immediate need to do anything as there aren't any degraded arrays, I
say let the user do whatever they want and don't try to do anything
automatically as it seems likely to me that the user's wants in this
area are likely to change from time to time based on circumstances and
having them update the config file prior to inserting the device is more
klunky than just telling them to do whatever they want themselves after
inserting the device.

>> Can we assume for all external metadata that spares added any container can be potentially moved between all container the same metadata?
> 
> Yes, that can be the default action, and the spare-group keyword can
> be specified to override.

Or as I mentioned earlier, two domains with different path globs gets
you this without having to use the spare-group keyword.  For instead,
you can put the sata ports on one domain path and the sas ports on
another domain path as the bios won't allow containers to cross that
boundary and that is sufficient to make us handle hot plugged drives
properly when both are in use.  I really don't see the use of the
spare-group keyword, the path glob should be sufficient.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-29 18:10     ` Doug Ledford
@ 2010-03-29 18:36       ` John Robinson
  2010-03-29 18:57         ` Doug Ledford
  2010-03-29 21:36       ` Dan Williams
  1 sibling, 1 reply; 33+ messages in thread
From: John Robinson @ 2010-03-29 18:36 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Dan Williams, Labun, Marcin, Neil Brown,
	Hawrylewicz Czarnowski, Przemyslaw, Ciechanowski, Ed,
	linux-raid@vger.kernel.org, Bill Davidsen

On 29/03/2010 19:10, Doug Ledford wrote:
[...]
> I'm also having a hard time justifying the existence of the metadata
> keyword.
[...]
> So, that leaves us back to not really needing the
> metadata keyword as the disks present in the path spec glob should be
> uniform in the metadata type and we should be able to simply use the
> right metadata from that.

I think I agree; in my limited scenario I might want to use 0.90 
metadata on my sdX1 to make my /boot, but 1.x on my other partitions, 
and it'll be whole discs that match my path spec so one metadata type 
wouldn't apply uniformly.

[...]
> All of the above actions are related to domains that are degraded.  But
> what to do if the array isn't degraded?  We could add the device as a
> spare, but if the array isn't degraded, adding a new hot spare doesn't
> really *do* anything.  No rebuild will start, nothing immediate happens,
> it just goes in and sits there.  And now that we have all these fancy
> grow options, it's not entirely clear that a user would want that
> anyway.  So, I would argue that if the array isn't degraded, then there
> is no sense of emergency in our actions, and there exists multiple
> options for what to do with the device, some include being a hot spare
> while others include using the device to grow the array, and the
> possibilities and answers to what to do here are not at all clear.  Even
> if the user had previously configured us to treat the device as a spare,
> they may change their mind and want to grow things.  Given that there's
> no immediate need to do anything as there aren't any degraded arrays, I
> say let the user do whatever they want and don't try to do anything
> automatically as it seems likely to me that the user's wants in this
> area are likely to change from time to time based on circumstances and
> having them update the config file prior to inserting the device is more
> klunky than just telling them to do whatever they want themselves after
> inserting the device.
[...]

Yes, but do create the partition(s), boot sector, etc and set up the 
spare(s). The user installed the system with anaconda or whatever, and 
may not know the incantations to partition his new disc or install a 
boot loader, so if he's managed to configure a mdadm.conf which says the 
spare slots in his RAID chassis should belong to mdadm, prepare them for 
him. Then all he needs to do is issue whatever grow command.

I think the exception to this is /boot on RAID-1, where I would prefer 
to be able to have the system automatically add the new partition as an 
active mirror instead of a hot spare, in case this new drive is what we 
have to boot off next time.

I suppose there might be circumstances where you want to do something 
else, like Netgear do on their ReadyNAS, but while it might be nice to 
be able to configure that sort of automatic growing and reshaping, it 
doesn't belong in the default config.

Cheers,

John.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-29 18:36       ` John Robinson
@ 2010-03-29 18:57         ` Doug Ledford
  2010-03-29 22:36           ` John Robinson
  0 siblings, 1 reply; 33+ messages in thread
From: Doug Ledford @ 2010-03-29 18:57 UTC (permalink / raw)
  To: John Robinson
  Cc: Dan Williams, Labun, Marcin, Neil Brown,
	Hawrylewicz Czarnowski, Przemyslaw, Ciechanowski, Ed,
	linux-raid@vger.kernel.org, Bill Davidsen

[-- Attachment #1: Type: text/plain, Size: 3566 bytes --]

On 03/29/2010 02:36 PM, John Robinson wrote:
> On 29/03/2010 19:10, Doug Ledford wrote:
> [...]
>> All of the above actions are related to domains that are degraded.  But
>> what to do if the array isn't degraded?  We could add the device as a
>> spare, but if the array isn't degraded, adding a new hot spare doesn't
>> really *do* anything.  No rebuild will start, nothing immediate happens,
>> it just goes in and sits there.  And now that we have all these fancy
>> grow options, it's not entirely clear that a user would want that
>> anyway.  So, I would argue that if the array isn't degraded, then there
>> is no sense of emergency in our actions, and there exists multiple
>> options for what to do with the device, some include being a hot spare
>> while others include using the device to grow the array, and the
>> possibilities and answers to what to do here are not at all clear.  Even
>> if the user had previously configured us to treat the device as a spare,
>> they may change their mind and want to grow things.  Given that there's
>> no immediate need to do anything as there aren't any degraded arrays, I
>> say let the user do whatever they want and don't try to do anything
>> automatically as it seems likely to me that the user's wants in this
>> area are likely to change from time to time based on circumstances and
>> having them update the config file prior to inserting the device is more
>> klunky than just telling them to do whatever they want themselves after
>> inserting the device.
> [...]
> 
> Yes, but do create the partition(s), boot sector, etc and set up the
> spare(s).

Really, we should never have to do this in the situation I listed: aka
no degraded arrays exist.  This implies that if you had a raid1 /boot
array, that it's still intact.  So partitioning and setting up boot
loaders doesn't make sense as the new disk isn't going in to replace
anything.  You *might* want to add it to the raid1 /boot, but we don't
know that so doing things automatically doesn't make sense.

> The user installed the system with anaconda or whatever, and
> may not know the incantations to partition his new disc or install a
> boot loader, so if he's managed to configure a mdadm.conf which says the
> spare slots in his RAID chassis should belong to mdadm, prepare them for
> him. Then all he needs to do is issue whatever grow command.
> 
> I think the exception to this is /boot on RAID-1, where I would prefer
> to be able to have the system automatically add the new partition as an
> active mirror instead of a hot spare, in case this new drive is what we
> have to boot off next time.

Again, I'm drawing a distinction here between a degraded array and a
non-degraded array.  If the current array isn't degraded, then we won't
be booting off the new drive next time unless the user goes into the
BIOS and sets the new drive as the active boot device.  And if the user
is going to do that, then they ought to be able to setup their new boot
mirror member themselves.

> I suppose there might be circumstances where you want to do something
> else, like Netgear do on their ReadyNAS, but while it might be nice to
> be able to configure that sort of automatic growing and reshaping, it
> doesn't belong in the default config.
> 
> Cheers,
> 
> John.


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-29 18:10     ` Doug Ledford
  2010-03-29 18:36       ` John Robinson
@ 2010-03-29 21:36       ` Dan Williams
  2010-03-29 23:30         ` Doug Ledford
  1 sibling, 1 reply; 33+ messages in thread
From: Dan Williams @ 2010-03-29 21:36 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Labun, Marcin, Neil Brown, Hawrylewicz Czarnowski, Przemyslaw,
	Ciechanowski, Ed, linux-raid@vger.kernel.org, Bill Davidsen

On Mon, Mar 29, 2010 at 11:10 AM, Doug Ledford <dledford@redhat.com> wrote:
> The second thing I'm having a hard time with is the spare-group.  To be
> honest, if I follow what I think I should, and make it a hard
> requirement that any action other than none and incremental must use a
> non-global path glob (aka, path= MUST be present and can not be *), then
> spare-group looses all meaning.  I say this because if a disk matches
> the path glob is it in a specific spare group already (the one that this
> DOMAIN represents) and ditto if arrays are on disks in this DOMAIN, then
> they are automatically part of the same spare-group.  In other words, I
> think spare-group becomes entirely redundant once we have a DOMAIN keyword.

I agree once you have a DOMAIN you implicitly have a spare-group.  So
DOMAIN would supersede the existing spare-group identifier in the
ARRAY line and cause mdadm --monitor to auto-migrate spares between
0.90 and 1.x metadata arrays in the same DOMAIN.  For the imsm case
the expectation is that spares migrate between containers regardless
of the DOMAIN line as that is what the implementation expects.
However this is where we get into questions of DOMAIN conflicting with
'platform' expectations, under what conditions, if any, should DOMAIN
be allowed to conflict/override the platform constraint?  Currently
there is an environment variable IMSM_NO_PLATFORM, do we also need a
configuration op

> I'm also having a hard time justifying the existence of the metadata
> keyword.  The reason is that the metadata is already determined for us
> by the path glob.  Specifically, if we assume that an array's members
> can not cross domain boundaries (a reasonable requirement in my opinion,
> we can't make an array where we can guarantee to the user that hot
> plugging a replacement disk will do what they expect if some of the
> array's members are inside the domain and some are outside the domain),
> then we should only ever need the metadata keyword if we are mixing
> metadata types within this domain.  Well, we can always narrow down the
> domain if we are doing something like the first three sata disks on an
> Intel Matrix RAID controller as imsm and the last three as jbod with
> version 1.x metadata by putting the first half in one domain and the
> second half in another.  And this would be the right thing to do versus
> trying to cover both in one domain.  That means that only if we ever
> mixed imsm/ddf and md native raid types on a single disk would we be
> unable to narrow down the domain properly, and I'm not sure we care to
> support this.  So, that leaves us back to not really needing the
> metadata keyword as the disks present in the path spec glob should be
> uniform in the metadata type and we should be able to simply use the
> right metadata from that.

...but this assumes we already have an array assembled in the domain
before the first hot plug event.  The 'metadata' keyword would be
helpful at assembly time for ensuring only arrays of a certain type
are brought up in the domain.

We also need some consideration for reporting and enforcing 'platform'
boundaries if the user requests it.  By default mdadm will block
attempts to create/assemble configurations that the option-rom does
not support (i.e. disk attached to third-party controller).  For the
hotplug case if the  DOMAIN is configured incorrectly I can see cases
where a user would like to specify "enforce platform constraints even
if my domain says otherwise", and the inverse "yes, I know the
option-rom does not support this configuration, but I know what I am
doing".

So I see a couple options:
1/ path=platform: auto-determine/enforce the domain(s) for all
platform raid controllers in the system
2/ Allow the user to manually enter a DOMAIN that is compatible but
different than the default platform constraints like your 3-ahci ports
for imsm-RAID remainder reserved for 1.x arrays example above
3/ Allow the user to turn off platform constraints and define 'exotic'
domains (mixed controller configurations).

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-29 18:57         ` Doug Ledford
@ 2010-03-29 22:36           ` John Robinson
  2010-03-29 22:41             ` Dan Williams
  2010-03-29 23:35             ` Doug Ledford
  0 siblings, 2 replies; 33+ messages in thread
From: John Robinson @ 2010-03-29 22:36 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Dan Williams, Labun, Marcin, Neil Brown,
	Hawrylewicz Czarnowski, Przemyslaw, Ciechanowski, Ed,
	linux-raid@vger.kernel.org, Bill Davidsen

On 29/03/2010 19:57, Doug Ledford wrote:
> On 03/29/2010 02:36 PM, John Robinson wrote:
[...]
>> Yes, but do create the partition(s), boot sector, etc and set up the
>> spare(s).
> 
> Really, we should never have to do this in the situation I listed: aka
> no degraded arrays exist.  This implies that if you had a raid1 /boot
> array, that it's still intact.  So partitioning and setting up boot
> loaders doesn't make sense as the new disk isn't going in to replace
> anything.  You *might* want to add it to the raid1 /boot, but we don't
> know that so doing things automatically doesn't make sense.

Actually I've just recently had the scenario where it would have made 
perfect sense. I hooked up the RAID chassis SATA[0-4] ports to the RAID 
chassis and put 3 drives in the first 3 slots. Actually it turned out 
I'd wired it up R-L not L-R so if I'd added a new drive in one of the 
two right-hand slots it would have turned up as sda on the next boot. 
OK, to some extent that's me being stupid, but at the same time I 
correctly hooked up the first 5 SATA ports to the hot-swap chassis and 
would want them considered the same group etc.

Cheers,

John.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-29 22:36           ` John Robinson
@ 2010-03-29 22:41             ` Dan Williams
  2010-03-29 22:46               ` John Robinson
  2010-03-29 23:35             ` Doug Ledford
  1 sibling, 1 reply; 33+ messages in thread
From: Dan Williams @ 2010-03-29 22:41 UTC (permalink / raw)
  To: John Robinson
  Cc: Doug Ledford, Labun, Marcin, Neil Brown,
	Hawrylewicz Czarnowski, Przemyslaw, Ciechanowski, Ed,
	linux-raid@vger.kernel.org, Bill Davidsen

On Mon, Mar 29, 2010 at 3:36 PM, John Robinson
<john.robinson@anonymous.org.uk> wrote:
> On 29/03/2010 19:57, Doug Ledford wrote:
>>
>> On 03/29/2010 02:36 PM, John Robinson wrote:
>
> [...]
>>>
>>> Yes, but do create the partition(s), boot sector, etc and set up the
>>> spare(s).
>>
>> Really, we should never have to do this in the situation I listed: aka
>> no degraded arrays exist.  This implies that if you had a raid1 /boot
>> array, that it's still intact.  So partitioning and setting up boot
>> loaders doesn't make sense as the new disk isn't going in to replace
>> anything.  You *might* want to add it to the raid1 /boot, but we don't
>> know that so doing things automatically doesn't make sense.
>
> Actually I've just recently had the scenario where it would have made
> perfect sense. I hooked up the RAID chassis SATA[0-4] ports to the RAID
> chassis and put 3 drives in the first 3 slots. Actually it turned out I'd
> wired it up R-L not L-R so if I'd added a new drive in one of the two
> right-hand slots it would have turned up as sda on the next boot. OK, to
> some extent that's me being stupid, but at the same time I correctly hooked
> up the first 5 SATA ports to the hot-swap chassis and would want them
> considered the same group etc.

This kind of situation is where an option-rom comes in handy i.e. the
platform firmware knows to boot from a defined raid volume.  However,
it comes with quirky constraints like not supporting > 2-drive raid1.
But I see your point that it would be nice to at least have the option
auto-grow raid1 boot arrays.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-29 22:41             ` Dan Williams
@ 2010-03-29 22:46               ` John Robinson
  0 siblings, 0 replies; 33+ messages in thread
From: John Robinson @ 2010-03-29 22:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Doug Ledford, Labun, Marcin, Neil Brown,
	Hawrylewicz Czarnowski, Przemyslaw, Ciechanowski, Ed,
	linux-raid@vger.kernel.org, Bill Davidsen

On 29/03/2010 23:41, Dan Williams wrote:
> On Mon, Mar 29, 2010 at 3:36 PM, John Robinson
> <john.robinson@anonymous.org.uk> wrote:
[...]
>> OK, to
>> some extent that's me being stupid, but at the same time I correctly hooked
>> up the first 5 SATA ports to the hot-swap chassis and would want them
>> considered the same group etc.
> 
> This kind of situation is where an option-rom comes in handy i.e. the
> platform firmware knows to boot from a defined raid volume.  However,
> it comes with quirky constraints like not supporting > 2-drive raid1.
> But I see your point that it would be nice to at least have the option
> auto-grow raid1 boot arrays.

As it happens this was on an Intel-chipset board with ICH10-R and option 
ROM, and I would have used IMSM if RHEL/CentOS had supported it at the 
time, so I'm following IMSM support developments closely.

Cheers,

John.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-29 21:36       ` Dan Williams
@ 2010-03-29 23:30         ` Doug Ledford
  2010-03-30  0:46           ` Dan Williams
  0 siblings, 1 reply; 33+ messages in thread
From: Doug Ledford @ 2010-03-29 23:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: Labun, Marcin, Neil Brown, Hawrylewicz Czarnowski, Przemyslaw,
	Ciechanowski, Ed, linux-raid@vger.kernel.org, Bill Davidsen

[-- Attachment #1: Type: text/plain, Size: 8372 bytes --]

On 03/29/2010 05:36 PM, Dan Williams wrote:
> On Mon, Mar 29, 2010 at 11:10 AM, Doug Ledford <dledford@redhat.com> wrote:
>> The second thing I'm having a hard time with is the spare-group.  To be
>> honest, if I follow what I think I should, and make it a hard
>> requirement that any action other than none and incremental must use a
>> non-global path glob (aka, path= MUST be present and can not be *), then
>> spare-group looses all meaning.  I say this because if a disk matches
>> the path glob is it in a specific spare group already (the one that this
>> DOMAIN represents) and ditto if arrays are on disks in this DOMAIN, then
>> they are automatically part of the same spare-group.  In other words, I
>> think spare-group becomes entirely redundant once we have a DOMAIN keyword.
> 
> I agree once you have a DOMAIN you implicitly have a spare-group.  So
> DOMAIN would supersede the existing spare-group identifier in the
> ARRAY line and cause mdadm --monitor to auto-migrate spares between
> 0.90 and 1.x metadata arrays in the same DOMAIN.  For the imsm case
> the expectation is that spares migrate between containers regardless
> of the DOMAIN line as that is what the implementation expects.

Give me some clearer explanation here because I think you and I are
using terms differently and so I want to make sure I have things right.
 My understanding of imsm raid containers is that all the drives that
belong to a single option rom, as long as they aren't listed as jbod in
the option rom setup, belong to the same container.  That container is
then split up into various chunks and that's where you get logical
volumes.  I know there are odd rules for logical volumes inside a
container, but I think those are mostly irrelevant to this discussion.
So, when I think of a domain for imsm, I think of all the sata ports or
sas ports under a single option rom.  From that perspective, spares can
*not* move between domains as a spare on a sas port can't be added to a
sata option rom container array.  I was under the impression that if you
had, say, a 6 port sata controller option rom, you couldn't have the
first three ports be one container and the next three ports be another
container.  Is that impression wrong?  If so, that would explain our
confusion over domains.

However, that just means (to me anyway) that I would treat all of the
sata ports as one domain with multiple container arrays in that domain
just like we can have multiple native md arrays in a domain.  If a disk
dies and we hot plug a new one, then mdadm would look for the degraded
container present in the domain and add the spare to it.  It would then
be up to mdmon to determine what logical volumes are currently degraded
and slice up the new drive to work as spares for those degraded logical
volumes.  Does this sound correct to you, and can mdmon do that already
or will this need to be added?

> However this is where we get into questions of DOMAIN conflicting with
> 'platform' expectations, under what conditions, if any, should DOMAIN
> be allowed to conflict/override the platform constraint?  Currently
> there is an environment variable IMSM_NO_PLATFORM, do we also need a
> configuration op

I'm not sure I would ever allow breaking valid platform limitations.  I
think if you want to break platform limitations, then you need to use
native md raid arrays and not imsm/ddf.  It seems to me that if you
allow the creation of an imsm/ddf array that the BIOS can't work with
then you've potentially opened an entire can of worms we don't want to
open about expectations that the BIOS will be able to work with things
but can't.  If you force native arrays as the only type that can break
platform limitations, then you are at least perfectly clear with the
user that the BIOS can't do what the user wants.

>> I'm also having a hard time justifying the existence of the metadata
>> keyword.  The reason is that the metadata is already determined for us
>> by the path glob.  Specifically, if we assume that an array's members
>> can not cross domain boundaries (a reasonable requirement in my opinion,
>> we can't make an array where we can guarantee to the user that hot
>> plugging a replacement disk will do what they expect if some of the
>> array's members are inside the domain and some are outside the domain),
>> then we should only ever need the metadata keyword if we are mixing
>> metadata types within this domain.  Well, we can always narrow down the
>> domain if we are doing something like the first three sata disks on an
>> Intel Matrix RAID controller as imsm and the last three as jbod with
>> version 1.x metadata by putting the first half in one domain and the
>> second half in another.  And this would be the right thing to do versus
>> trying to cover both in one domain.  That means that only if we ever
>> mixed imsm/ddf and md native raid types on a single disk would we be
>> unable to narrow down the domain properly, and I'm not sure we care to
>> support this.  So, that leaves us back to not really needing the
>> metadata keyword as the disks present in the path spec glob should be
>> uniform in the metadata type and we should be able to simply use the
>> right metadata from that.
> 
> ...but this assumes we already have an array assembled in the domain
> before the first hot plug event.  The 'metadata' keyword would be
> helpful at assembly time for ensuring only arrays of a certain type
> are brought up in the domain.

OK, I can see this.  Especially if someone if not using ARRAY lines and
instead has enabled the AUTO keyword to just auto assemble arrays.  If
we had a hard requirement that all arrays are listed in the file then we
could deduce the metadata of a domain from the arrays present in it, but
we don't.

> We also need some consideration for reporting and enforcing 'platform'
> boundaries if the user requests it.  By default mdadm will block
> attempts to create/assemble configurations that the option-rom does
> not support (i.e. disk attached to third-party controller).  For the
> hotplug case if the  DOMAIN is configured incorrectly I can see cases
> where a user would like to specify "enforce platform constraints even
> if my domain says otherwise", and the inverse "yes, I know the
> option-rom does not support this configuration, but I know what I am
> doing".

I can think of a perfect example of when I would want to break platform
rules here.  I have a machine that's imsm capable with motherboard sata
ports, but if a drive went out I wouldn't want to open up the case, put
a new drive in, and cable it all up with the machine live.  On the other
hand, that same machine has an external 4 drive hot plug chassis
attached and I could put a drive into it, add it to the imsm array, and
have everything rebuild before ever shutting the machine down.  But, the
expectation here is that things wouldn't work unless I moved that drive
out of the external chassis and into the machine proper before
rebooting, otherwise the BIOS will consider the array degraded.  So
while this is a perfectly valid scenario, I don't think it's one that we
should be catering to in any automated actions.  Quite simply, I think
our support for automated actions should be limited to what we *know* is
right, and that we'll get right, and not try to be esoteric lest we end
up screwing the pooch so to speak.  At least not for initial
implementations.

> So I see a couple options:
> 1/ path=platform: auto-determine/enforce the domain(s) for all
> platform raid controllers in the system

I think for imsm/ddf metadata, this should be automatic.

> 2/ Allow the user to manually enter a DOMAIN that is compatible but
> different than the default platform constraints like your 3-ahci ports
> for imsm-RAID remainder reserved for 1.x arrays example above

I agree.  More restrictive than platform is OK.

> 3/ Allow the user to turn off platform constraints and define 'exotic'
> domains (mixed controller configurations).

Only for native metadata formats IMO.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-29 22:36           ` John Robinson
  2010-03-29 22:41             ` Dan Williams
@ 2010-03-29 23:35             ` Doug Ledford
  2010-03-30 12:10               ` John Robinson
  1 sibling, 1 reply; 33+ messages in thread
From: Doug Ledford @ 2010-03-29 23:35 UTC (permalink / raw)
  To: John Robinson
  Cc: Dan Williams, Labun, Marcin, Neil Brown,
	Hawrylewicz Czarnowski, Przemyslaw, Ciechanowski, Ed,
	linux-raid@vger.kernel.org, Bill Davidsen

[-- Attachment #1: Type: text/plain, Size: 2267 bytes --]

On 03/29/2010 06:36 PM, John Robinson wrote:
> On 29/03/2010 19:57, Doug Ledford wrote:
>> On 03/29/2010 02:36 PM, John Robinson wrote:
> [...]
>>> Yes, but do create the partition(s), boot sector, etc and set up the
>>> spare(s).
>>
>> Really, we should never have to do this in the situation I listed: aka
>> no degraded arrays exist.  This implies that if you had a raid1 /boot
>> array, that it's still intact.  So partitioning and setting up boot
>> loaders doesn't make sense as the new disk isn't going in to replace
>> anything.  You *might* want to add it to the raid1 /boot, but we don't
>> know that so doing things automatically doesn't make sense.
> 
> Actually I've just recently had the scenario where it would have made
> perfect sense. I hooked up the RAID chassis SATA[0-4] ports to the RAID
> chassis and put 3 drives in the first 3 slots. Actually it turned out
> I'd wired it up R-L not L-R so if I'd added a new drive in one of the
> two right-hand slots it would have turned up as sda on the next boot.

Yes, but how do you want to fix that situation?  Would you want to make
the new drives be new boot drives, or would you prefer to shut down,
move all the previous drives over two slots, and then put the new drive
into the fourth slot that you previously thought was the second slot?  I
understand your situation, but were I in that position I'd just shuffle
my drives to correct my original mistake and go on with things, I
wouldn't make the new drives be boot drives.  So I'm still not sure I
see the point to making a new drive that isn't replacing an existing
drive automatically get set up for boot duty.

> OK, to some extent that's me being stupid, but at the same time I
> correctly hooked up the first 5 SATA ports to the hot-swap chassis and
> would want them considered the same group etc.

I understand wanting them in the same group, but unless something is
degraded, just being in the same group doesn't tell us if you want to
keep it as a spare or use it to grow things.


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-29 23:30         ` Doug Ledford
@ 2010-03-30  0:46           ` Dan Williams
  2010-03-30 15:23             ` Doug Ledford
  0 siblings, 1 reply; 33+ messages in thread
From: Dan Williams @ 2010-03-30  0:46 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Labun, Marcin, Neil Brown, Hawrylewicz Czarnowski, Przemyslaw,
	Ciechanowski, Ed, linux-raid@vger.kernel.org, Bill Davidsen

On Mon, Mar 29, 2010 at 4:30 PM, Doug Ledford <dledford@redhat.com> wrote:
> On 03/29/2010 05:36 PM, Dan Williams wrote:
>> I agree once you have a DOMAIN you implicitly have a spare-group.  So
>> DOMAIN would supersede the existing spare-group identifier in the
>> ARRAY line and cause mdadm --monitor to auto-migrate spares between
>> 0.90 and 1.x metadata arrays in the same DOMAIN.  For the imsm case
>> the expectation is that spares migrate between containers regardless
>> of the DOMAIN line as that is what the implementation expects.
>
> Give me some clearer explanation here because I think you and I are
> using terms differently and so I want to make sure I have things right.
>  My understanding of imsm raid containers is that all the drives that
> belong to a single option rom, as long as they aren't listed as jbod in
> the option rom setup, belong to the same container.

I think the disconnect in the imsm case is that the container to
DOMAIN relationship is N:1, not 1:1.  The mdadm notion of an
imsm-container correlates directly with a 'family' in the imsm
metadata.  The rules of a family are:

1/ All family members must be a member of all defined volumes.  For
example with a 4-drive container you could not simultaneously have a
4-drive (sd[abcd]) raid10 and a 2-drive (sd[ab]) raid1 volume because
any volume would need to incorporate all 4 disks.  Also, per the rules
if you create two raid1 volumes sd[ab] and sd[cd] those would show up
as two containers.

2/ A spare drive does not belong to any particular family
('family_number' is undefined for a spare).  The Windows driver will
automatically use a spare to fix any degraded family in the system.
In the mdadm/mdmon case since we break families into containers we
need a mechanism to migrate spare devices between containers because
they are equally valid hot spare candidate for any imsm container in
the system.

> That container is
> then split up into various chunks and that's where you get logical
> volumes.  I know there are odd rules for logical volumes inside a
> container, but I think those are mostly irrelevant to this discussion.
> So, when I think of a domain for imsm, I think of all the sata ports or
> sas ports under a single option rom.  From that perspective, spares can
> *not* move between domains as a spare on a sas port can't be added to a
> sata option rom container array.  I was under the impression that if you
> had, say, a 6 port sata controller option rom, you couldn't have the
> first three ports be one container and the next three ports be another
> container.  Is that impression wrong?

Yes, we can have exactly this situation.

This begs the question, why not change the definition of an imsm
container to incorporate anything with imsm metadata?  This definitely
would make spare management easier.  This was an early design decision
and had the nice side effect that it lined up naturally with the
failure and rebuild boundaries of a family.  I could give it more
thought, but right now I believe there is a lot riding on this 1:1
container-to-family relationship, and I would rather not go there.

> However, that just means (to me anyway) that I would treat all of the
> sata ports as one domain with multiple container arrays in that domain
> just like we can have multiple native md arrays in a domain.  If a disk
> dies and we hot plug a new one, then mdadm would look for the degraded
> container present in the domain and add the spare to it.  It would then
> be up to mdmon to determine what logical volumes are currently degraded
> and slice up the new drive to work as spares for those degraded logical
> volumes.  Does this sound correct to you, and can mdmon do that already
> or will this need to be added?

This sounds correct, and no mdmon cannot do this today.  The current
discussions we (Marcin and I) had with Neil offlist was extending
mdadm --monitor to handle spare migration for containers since it
already handles spare migration for native md arrays.  It will need
some mdmon coordination since mdmon is the only agent that can
disambiguate a spare from a stale device at any given point in time.

>> However this is where we get into questions of DOMAIN conflicting with
>> 'platform' expectations, under what conditions, if any, should DOMAIN
>> be allowed to conflict/override the platform constraint?  Currently
>> there is an environment variable IMSM_NO_PLATFORM, do we also need a
>> configuration op
>
> I'm not sure I would ever allow breaking valid platform limitations.  I
> think if you want to break platform limitations, then you need to use
> native md raid arrays and not imsm/ddf.  It seems to me that if you
> allow the creation of an imsm/ddf array that the BIOS can't work with
> then you've potentially opened an entire can of worms we don't want to
> open about expectations that the BIOS will be able to work with things
> but can't.  If you force native arrays as the only type that can break
> platform limitations, then you are at least perfectly clear with the
> user that the BIOS can't do what the user wants.

Agreed.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-29 23:35             ` Doug Ledford
@ 2010-03-30 12:10               ` John Robinson
  2010-03-30 15:53                 ` Doug Ledford
  0 siblings, 1 reply; 33+ messages in thread
From: John Robinson @ 2010-03-30 12:10 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Dan Williams, Labun, Marcin, Neil Brown,
	Hawrylewicz Czarnowski, Przemyslaw, Ciechanowski, Ed,
	linux-raid@vger.kernel.org, Bill Davidsen

On 30/03/2010 00:35, Doug Ledford wrote:
> On 03/29/2010 06:36 PM, John Robinson wrote:
>> On 29/03/2010 19:57, Doug Ledford wrote:
>>> On 03/29/2010 02:36 PM, John Robinson wrote:
>> [...]
>>>> Yes, but do create the partition(s), boot sector, etc and set up the
>>>> spare(s).
>>> Really, we should never have to do this in the situation I listed: aka
>>> no degraded arrays exist.  This implies that if you had a raid1 /boot
>>> array, that it's still intact.  So partitioning and setting up boot
>>> loaders doesn't make sense as the new disk isn't going in to replace
>>> anything.  You *might* want to add it to the raid1 /boot, but we don't
>>> know that so doing things automatically doesn't make sense.
>> Actually I've just recently had the scenario where it would have made
>> perfect sense. I hooked up the RAID chassis SATA[0-4] ports to the RAID
>> chassis and put 3 drives in the first 3 slots. Actually it turned out
>> I'd wired it up R-L not L-R so if I'd added a new drive in one of the
>> two right-hand slots it would have turned up as sda on the next boot.
> 
> Yes, but how do you want to fix that situation?  Would you want to make
> the new drives be new boot drives, or would you prefer to shut down,
> move all the previous drives over two slots, and then put the new drive
> into the fourth slot that you previously thought was the second slot?  I
> understand your situation, but were I in that position I'd just shuffle
> my drives to correct my original mistake and go on with things, I
> wouldn't make the new drives be boot drives.  So I'm still not sure I
> see the point to making a new drive that isn't replacing an existing
> drive automatically get set up for boot duty.

I wouldn't want to take the server down to shuffle the drives or cables. 
But my point really is that if I have decided that I would want all the 
drives in my chassis to have identical partition tables and carry an 
active mirror of an array - in my example /boot - I would like to be 
able to configure the hotplug arrangement to make it so, rather than 
leaving me to have to manually regenerate the partition table, install 
grub, add the spare and perhaps even grow the array.

Of course this is a per-installation policy decision of what to do when 
an extra drive is added to a non-degraded array, I'm certainly not 
suggesting this should be the default action, though I think it would be 
nice if it were possible to configure an action in this case.

>> OK, to some extent that's me being stupid, but at the same time I
>> correctly hooked up the first 5 SATA ports to the hot-swap chassis and
>> would want them considered the same group etc.
> 
> I understand wanting them in the same group, but unless something is
> degraded, just being in the same group doesn't tell us if you want to
> keep it as a spare or use it to grow things.

I quite agree. All I'm getting at is that I'd like to be able to say 
something in my mdadm.conf or wherever to say what I'd like done. This 
might mean that I end up something like the following:
DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0       action=include
DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0-part1 action=grow
DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0-part2 action=replace
DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0-part3 action=include

The first line gets the partition table and grub boot code regenerated 
even when nothing's degraded. This in turn may trigger the other lines. 
In the second line my action=grow means fix up my /boot if it's degraded 
and both --add and --grow so it gets mirrored onto a fresh disc. The 
third lines says fix up my swap array if it's degraded, but leave alone 
otherwise. The fourth line says fix up my data array if it's degraded, 
and add as a spare if it's a fresh disc. This last lets me decide later 
what (if any) kind of --grow I want to do - make it larger or reshape 
from RAID-5 to RAID-6.

But as you say, the default should be
DOMAIN path=* action=incremental

and the installer (automated or human) probably wants to edit that to 
include at least
DOMAIN path=something action=replace
to take advantage of this auto-rebuild on hot-plug feature.

Sorry if I'm being long-winded, but hopefully you can see how I'd like 
to be able to configure things. In the first instance, though, just 
getting as far as the replace option would be great.

Cheers,

John.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-30  0:46           ` Dan Williams
@ 2010-03-30 15:23             ` Doug Ledford
  2010-03-30 17:47               ` Labun, Marcin
                                 ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Doug Ledford @ 2010-03-30 15:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: Labun, Marcin, Neil Brown, Hawrylewicz Czarnowski, Przemyslaw,
	Ciechanowski, Ed, linux-raid@vger.kernel.org, Bill Davidsen

[-- Attachment #1: Type: text/plain, Size: 9321 bytes --]

On 03/29/2010 08:46 PM, Dan Williams wrote:
> I think the disconnect in the imsm case is that the container to
> DOMAIN relationship is N:1, not 1:1.  The mdadm notion of an
> imsm-container correlates directly with a 'family' in the imsm
> metadata.  The rules of a family are:
> 
> 1/ All family members must be a member of all defined volumes.  For
> example with a 4-drive container you could not simultaneously have a
> 4-drive (sd[abcd]) raid10 and a 2-drive (sd[ab]) raid1 volume because
> any volume would need to incorporate all 4 disks.  Also, per the rules
> if you create two raid1 volumes sd[ab] and sd[cd] those would show up
> as two containers.
> 
> 2/ A spare drive does not belong to any particular family
> ('family_number' is undefined for a spare).  The Windows driver will
> automatically use a spare to fix any degraded family in the system.
> In the mdadm/mdmon case since we break families into containers we
> need a mechanism to migrate spare devices between containers because
> they are equally valid hot spare candidate for any imsm container in
> the system.

This explains the weird behavior I got when trying to create arrays on
my IMSM box via the BIOS.  Thanks for the clear explanation of family
delineation.

> This begs the question, why not change the definition of an imsm
> container to incorporate anything with imsm metadata?  This definitely
> would make spare management easier.  This was an early design decision
> and had the nice side effect that it lined up naturally with the
> failure and rebuild boundaries of a family.  I could give it more
> thought, but right now I believe there is a lot riding on this 1:1
> container-to-family relationship, and I would rather not go there.

I'm fine with the container being family based and not domain based.  I
just didn't realize that distinction existed.  It's all cleared up now ;-)

>> However, that just means (to me anyway) that I would treat all of the
>> sata ports as one domain with multiple container arrays in that domain
>> just like we can have multiple native md arrays in a domain.  If a disk
>> dies and we hot plug a new one, then mdadm would look for the degraded
>> container present in the domain and add the spare to it.  It would then
>> be up to mdmon to determine what logical volumes are currently degraded
>> and slice up the new drive to work as spares for those degraded logical
>> volumes.  Does this sound correct to you, and can mdmon do that already
>> or will this need to be added?
> 
> This sounds correct, and no mdmon cannot do this today.  The current
> discussions we (Marcin and I) had with Neil offlist was extending
> mdadm --monitor to handle spare migration for containers since it
> already handles spare migration for native md arrays.  It will need
> some mdmon coordination since mdmon is the only agent that can
> disambiguate a spare from a stale device at any given point in time.

So we'll need to coordinate on this aspect of things then.  I'll keep
you updated as I get started implementing this if you want to think
about how you would like to handle this interaction between mdadm/mdmon.

As far as I can tell, we've reached a fairly decent consensus on things.
 But, just to be clear, I'll reiterate that concensus here:

Add a new linetype: DOMAIN with options path= (must be specified at
least once for any domain action other than none and incremental and
must be something other than a global match for any action other than
none and incremental) and metadata= (specifies the metadata type
possible for this domain as one of imsm/ddf/md, and where for imsm or
ddf types, we will verify that the path portions of the domain do not
violate possible platform limitations) and action= (where action is
none, incremental, readd, safe_use, force_use where action is specific
to a hotplug when a degraded array in the domain exists and can possibly
have slightly different meanings depending on whether the path specifies
a whole disk device or specific partitions on a range of devices, and
where there is the possibility of adding more options or a new option
name for the case of adding a hotplug drive to a domain where no arrays
are degraded, in which case issues such as boot sectors, partition
tables, hot spare versus grow, etc. must be addressed).

Modify udev rules files to cover the following scenarios (it's
unfortunate that we have to split things up like this, but in order to
deal with either bare drives or drives that have things like lvm data
and we are using force_use, we must trigger on *all* drive hotplug
events, we must trigger early, and we must override other subsystem's
possible hotplug actions, otherwise the force_use option will be a noop):

1) plugging in a device that already has md raid metadata present
   a) if the device has metadata corresponding to one of our arrays,
attempt to do normal incremental add
   b) if the device has metadata corresponding to one of our arrays, and
the normal add failed and the options readd, safe_use, or force_use are
present in the mdadm.conf file, reattempt to add using readd
   c) if the device has metadata corresponding to one of our arrays, and
the readd failed, and the options safe_use or force_use are present,
then do a regular add of the device to the array (possibly with doing a
preemptive zero-superblock on the device we are adding).  This should
never fail.
   d) if the device has metadata that does not correspond to any array
in the system, and there is a degraded array, and the option force_use
is present, then quite possibly repartition the device to make the
partitions match the degraded devices, zero any superblocks, and add the
device to the arrays.  BIG FAT WARNING: the force_use option will cause
you to loose data if you plug in an array disk for another machine while
this machine has degraded arrays.

2) plugging in a device that doesn't already have md raid metadata
present but is part of an md domain
   a) if the device is bare and the option safe_use is present and we
have degraded arrays, partition the device (if needed) and then add
partitions to degraded arrays
   b) if the device is not bare, and the option force_use is present and
we have degraded arrays, (re)partition the device (if needed) and then
add partitions to degraded arrays.  BIG FAT WARNING: if you enable this
mode, and you hotplug say an LVM volume into your domain when you have a
degraded array, kiss your LVM volume goodbye.

Modify udev rules files to deal with device removal.  Specifically, we
need to watch for removal of devices that are part of raid arrays and if
they weren't failed when they were removed, fail them, and then remove
them from the array.  This is necessary for readd to work.  It also
releases our hold on the scsi device so it can be fully released and the
new device can be added back using the same device name.

Modify mdadm -I mode to read the mdadm.conf file for the DOMAIN lines on
hotplug events and then modify the -I behavior to suit the situation.
The majority of the hotplug changes mentioned above will actually be
implemented as part of mdadm -I, we will simply add a few rules to call
mdadm -I in a few new situations, then allow mdadm -I (which has
unlimited smarts, where as udev rules get very convoluted very quickly
if you try to make them smart) to actually make the decisions and do the
right thing.  This means that effectively, we might just end up calling
mdadm -I on every disk hot plug event whether there is md metadata or
not, but only doing special things when the various conditions above are
met.

Modify mdadm and the spare-group concept of ARRAY lines to coordinate
spare-group assignments and DOMAIN assignments.  We need to know what to
do in the event of a conflict between the two.  My guess is that this is
unlikely, but in the end, I think we need to phase out spare-group
entirely in favor of domain.  Since we can't have a conflict without
someone adding domain lines to the config file, I would suggest that the
domain assignments override spare-group assignments and we complain
about the conflict.  That way, even though the user obviously intended
something specific with spare-group, he also must have intended
something specific with domain assignments, and as the domain keyword is
the newest and latest thing, honor the latest wishes and warn about it
in case they misentered something.

Modify mdadm/mdmon to enable spare migration between imsm containers in
a domain.  Retain mdadm ability to move hot spares between native
arrays, but make it based on domain now instead of spare-group, and in
the config settings if someone has spare-group assignments and no domain
assignments, then create internal domain entries that mimic the
spare-group layout so that we can modify the core spare movement code to
only be concerned with domain entries.

I think that covers it.  Do we have a consensus on the general work?
Your thoughts Neil?

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-30 12:10               ` John Robinson
@ 2010-03-30 15:53                 ` Doug Ledford
  2010-04-02 11:01                   ` John Robinson
  0 siblings, 1 reply; 33+ messages in thread
From: Doug Ledford @ 2010-03-30 15:53 UTC (permalink / raw)
  To: John Robinson
  Cc: Dan Williams, Labun, Marcin, Neil Brown,
	Hawrylewicz Czarnowski, Przemyslaw, Ciechanowski, Ed,
	linux-raid@vger.kernel.org, Bill Davidsen

[-- Attachment #1: Type: text/plain, Size: 6672 bytes --]

On 03/30/2010 08:10 AM, John Robinson wrote:
> On 30/03/2010 00:35, Doug Ledford wrote:
>> On 03/29/2010 06:36 PM, John Robinson wrote:
>>> On 29/03/2010 19:57, Doug Ledford wrote:
>>>> On 03/29/2010 02:36 PM, John Robinson wrote:
>>> [...]
>>>>> Yes, but do create the partition(s), boot sector, etc and set up the
>>>>> spare(s).
>>>> Really, we should never have to do this in the situation I listed: aka
>>>> no degraded arrays exist.  This implies that if you had a raid1 /boot
>>>> array, that it's still intact.  So partitioning and setting up boot
>>>> loaders doesn't make sense as the new disk isn't going in to replace
>>>> anything.  You *might* want to add it to the raid1 /boot, but we don't
>>>> know that so doing things automatically doesn't make sense.
>>> Actually I've just recently had the scenario where it would have made
>>> perfect sense. I hooked up the RAID chassis SATA[0-4] ports to the RAID
>>> chassis and put 3 drives in the first 3 slots. Actually it turned out
>>> I'd wired it up R-L not L-R so if I'd added a new drive in one of the
>>> two right-hand slots it would have turned up as sda on the next boot.
>>
>> Yes, but how do you want to fix that situation?  Would you want to make
>> the new drives be new boot drives, or would you prefer to shut down,
>> move all the previous drives over two slots, and then put the new drive
>> into the fourth slot that you previously thought was the second slot?  I
>> understand your situation, but were I in that position I'd just shuffle
>> my drives to correct my original mistake and go on with things, I
>> wouldn't make the new drives be boot drives.  So I'm still not sure I
>> see the point to making a new drive that isn't replacing an existing
>> drive automatically get set up for boot duty.
> 
> I wouldn't want to take the server down to shuffle the drives or cables.
> But my point really is that if I have decided that I would want all the
> drives in my chassis to have identical partition tables and carry an
> active mirror of an array - in my example /boot - I would like to be
> able to configure the hotplug arrangement to make it so, rather than
> leaving me to have to manually regenerate the partition table, install
> grub, add the spare and perhaps even grow the array.

I can (sorta) understand this.  I personally never create any more /boot
partitions than the number of drives I can loose from my / array + 1.
So, if I have raid5 / array, I do 2 /boot partitions.  Anything more is
a waste since if you loose both of those boot drives, you also have too
few drives for the / array.  But, if you want any given drive bootable,
that's possible.  However...see below for an issue relating to this.

> Of course this is a per-installation policy decision of what to do when
> an extra drive is added to a non-degraded array, I'm certainly not
> suggesting this should be the default action, though I think it would be
> nice if it were possible to configure an action in this case.
> 
>>> OK, to some extent that's me being stupid, but at the same time I
>>> correctly hooked up the first 5 SATA ports to the hot-swap chassis and
>>> would want them considered the same group etc.
>>
>> I understand wanting them in the same group, but unless something is
>> degraded, just being in the same group doesn't tell us if you want to
>> keep it as a spare or use it to grow things.
> 
> I quite agree. All I'm getting at is that I'd like to be able to say
> something in my mdadm.conf or wherever to say what I'd like done. This
> might mean that I end up something like the following:
> DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0       action=include
> DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0-part1 action=grow
> DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0-part2 action=replace
> DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0-part3 action=include

This I'm not so sure about.  I can try to make this a reality, but the
issue here is that when you are allowed to specify things on a partition
by partition basis, it becomes very easy to create conflicting commands.
 For example, lets say you have part1 action=grow, but for the bare disk
you have action=incremental.  And let's assume you plug in a bare disk.
 In order to honor the part1 action=grow, we would have to partition the
disk, which is in conflict with the bare disk action of incremental
since that implies we would only use preexisting md raid partitions.  I
could *easily* see the feature of allowing per partition actions causing
the overall code complexity to double or more.  You know, I'd rather
provide a simple grub script that automatically setup all raid1 members
as boot devices any time it was ran than try to handle this
automatically ;-)  Maybe I should add that to the mdadm package on
x86/x86_64, something like mdadm-grub-install or similar.

> The first line gets the partition table and grub boot code regenerated
> even when nothing's degraded. This in turn may trigger the other lines.
> In the second line my action=grow means fix up my /boot if it's degraded
> and both --add and --grow so it gets mirrored onto a fresh disc. The
> third lines says fix up my swap array if it's degraded, but leave alone
> otherwise. The fourth line says fix up my data array if it's degraded,
> and add as a spare if it's a fresh disc. This last lets me decide later
> what (if any) kind of --grow I want to do - make it larger or reshape
> from RAID-5 to RAID-6.

As pointed out above, some of these are conflicting commands in that
they tell us to modify the disk in one place, and leave it alone in
another.  The basic assumption you are making here is that we will
always be able to duplicate the partition table because all drives in a
domain will have the same partition table.  And that's not always the case.

> But as you say, the default should be
> DOMAIN path=* action=incremental
> 
> and the installer (automated or human) probably wants to edit that to
> include at least
> DOMAIN path=something action=replace
> to take advantage of this auto-rebuild on hot-plug feature.
> 
> Sorry if I'm being long-winded, but hopefully you can see how I'd like
> to be able to configure things. In the first instance, though, just
> getting as far as the replace option would be great.

I see where you are going, I'm a little worried about getting there ;-)


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: Auto Rebuild on hot-plug
  2010-03-30 15:23             ` Doug Ledford
@ 2010-03-30 17:47               ` Labun, Marcin
  2010-03-30 23:47                 ` Dan Williams
  2010-03-30 23:36               ` Dan Williams
  2010-03-31  4:53               ` Neil Brown
  2 siblings, 1 reply; 33+ messages in thread
From: Labun, Marcin @ 2010-03-30 17:47 UTC (permalink / raw)
  To: Doug Ledford, Williams, Dan J
  Cc: Neil Brown, Hawrylewicz Czarnowski, Przemyslaw, Ciechanowski, Ed,
	linux-raid@vger.kernel.org, Bill Davidsen

> As far as I can tell, we've reached a fairly decent consensus on things.
>  But, just to be clear, I'll reiterate that concensus here:
> 
> Add a new linetype: DOMAIN with options path= (must be specified at
> least once for any domain action other than none and incremental and
> must be something other than a global match for any action other than
> none and incremental) and metadata= (specifies the metadata type
> possible for this domain as one of imsm/ddf/md, and where for imsm or
> ddf types, we will verify that the path portions of the domain do not
> violate possible platform limitations) and action= (where action is
> none, incremental, readd, safe_use, force_use where action is specific
> to a hotplug when a degraded array in the domain exists and can possibly
> have slightly different meanings depending on whether the path specifies
> a whole disk device or specific partitions on a range of devices, and
> where there is the possibility of adding more options or a new option
> name for the case of adding a hotplug drive to a domain where no arrays
> are degraded, in which case issues such as boot sectors, partition
> tables, hot spare versus grow, etc. must be addressed).
I understand that there are following defaults:
- Platform/metadata limitations create default domains
- metadata handler deliver default actions 
The equivalent configuration line for imsm is:
DOMAIN path="any" metadata=imsm action=none
User could additionally split default domains using spare groups and path keyword.
For instance for imsm, the default domain area is platform controller. 
If any metadata is server by multiple controllers, each of them creates its own domain.
If we allow for "any" for the path keyword, a user could simply override metadata defaults for all his controllers by:
DOMAIN path="any" metadata=imsm action=safe_use

> 
> Modify udev rules files to cover the following scenarios (it's
> unfortunate that we have to split things up like this, but in order to
> deal with either bare drives or drives that have things like lvm data
> and we are using force_use, we must trigger on *all* drive hotplug
> events, we must trigger early, and we must override other subsystem's
> possible hotplug actions, otherwise the force_use option will be a noop):
> 
> 1) plugging in a device that already has md raid metadata present
>    a) if the device has metadata corresponding to one of our arrays,
> attempt to do normal incremental add
>    b) if the device has metadata corresponding to one of our arrays, and
> the normal add failed and the options readd, safe_use, or force_use are
> present in the mdadm.conf file, reattempt to add using readd
>    c) if the device has metadata corresponding to one of our arrays, and
> the readd failed, and the options safe_use or force_use are present,
> then do a regular add of the device to the array (possibly with doing a
> preemptive zero-superblock on the device we are adding).  This should
> never fail.
>    d) if the device has metadata that does not correspond to any array
> in the system, and there is a degraded array, and the option force_use
> is present, then quite possibly repartition the device to make the
> partitions match the degraded devices, zero any superblocks, and add the
> device to the arrays.  BIG FAT WARNING: the force_use option will cause
> you to loose data if you plug in an array disk for another machine while
> this machine has degraded arrays.
> 
> 2) plugging in a device that doesn't already have md raid metadata
> present but is part of an md domain
>    a) if the device is bare and the option safe_use is present and we
> have degraded arrays, partition the device (if needed) and then add
> partitions to degraded arrays
>    b) if the device is not bare, and the option force_use is present and
> we have degraded arrays, (re)partition the device (if needed) and then
> add partitions to degraded arrays.  BIG FAT WARNING: if you enable this
> mode, and you hotplug say an LVM volume into your domain when you have a
> degraded array, kiss your LVM volume goodbye.
> 
> Modify udev rules files to deal with device removal.  Specifically, we
> need to watch for removal of devices that are part of raid arrays and if
> they weren't failed when they were removed, fail them, and then remove
> them from the array.  This is necessary for readd to work.  It also
> releases our hold on the scsi device so it can be fully released and the
> new device can be added back using the same device name.
I think that implementation can be something like that:
We shall set cookie to store the path of disk which is removed from the md device. Later if the new device is re-plugged in the port, it can be used for rebuild. 
We shall set timer when cookies shall expire. I propose to clean them on start-up (mdadm -monitor can be a candidate; default action shall be cookies clean-up).

[ .. ]

> Modify mdadm and the spare-group concept of ARRAY lines to coordinate
> spare-group assignments and DOMAIN assignments.  We need to know what to
> do in the event of a conflict between the two.  My guess is that this is
> unlikely, but in the end, I think we need to phase out spare-group
> entirely in favor of domain.  Since we can't have a conflict without
> someone adding domain lines to the config file, I would suggest that the
> domain assignments override spare-group assignments and we complain
> about the conflict.  That way, even though the user obviously intended
> something specific with spare-group, he also must have intended
> something specific with domain assignments, and as the domain keyword is
> the newest and latest thing, honor the latest wishes and warn about it
> in case they misentered something.
> 
> Modify mdadm/mdmon to enable spare migration between imsm containers in
> a domain.  Retain mdadm ability to move hot spares between native
> arrays, but make it based on domain now instead of spare-group, and in
> the config settings if someone has spare-group assignments and no domain
> assignments, then create internal domain entries that mimic the
> spare-group layout so that we can modify the core spare movement code to
> only be concerned with domain entries.
Enable spare disk sharing between containers if they belong to the same domain and have not conflicting spare group assignment. This will allow for spare sharing by default.
Additionally, we can consult metadata handlers before moving spares between containers. We can do that by adding another metadata handler function which shall test metadata and controller dependencies (I can imagine that user can define metadata stored domains of spare sharing; controllers (OROM) dependent constrains shall be handled in this function, too).  
Thanks,
Marcin Labun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-30 15:23             ` Doug Ledford
  2010-03-30 17:47               ` Labun, Marcin
@ 2010-03-30 23:36               ` Dan Williams
  2010-03-31  4:53               ` Neil Brown
  2 siblings, 0 replies; 33+ messages in thread
From: Dan Williams @ 2010-03-30 23:36 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Labun, Marcin, Neil Brown, Hawrylewicz Czarnowski, Przemyslaw,
	Ciechanowski, Ed, linux-raid@vger.kernel.org, Bill Davidsen

On Tue, Mar 30, 2010 at 8:23 AM, Doug Ledford <dledford@redhat.com> wrote:
> On 03/29/2010 08:46 PM, Dan Williams wrote:
>> This begs the question, why not change the definition of an imsm
>> container to incorporate anything with imsm metadata?  This definitely
>> would make spare management easier.  This was an early design decision
>> and had the nice side effect that it lined up naturally with the
>> failure and rebuild boundaries of a family.  I could give it more
>> thought, but right now I believe there is a lot riding on this 1:1
>> container-to-family relationship, and I would rather not go there.
>
> I'm fine with the container being family based and not domain based.  I
> just didn't realize that distinction existed.  It's all cleared up now ;-)
>

Great.

>>> However, that just means (to me anyway) that I would treat all of the
>>> sata ports as one domain with multiple container arrays in that domain
>>> just like we can have multiple native md arrays in a domain.  If a disk
>>> dies and we hot plug a new one, then mdadm would look for the degraded
>>> container present in the domain and add the spare to it.  It would then
>>> be up to mdmon to determine what logical volumes are currently degraded
>>> and slice up the new drive to work as spares for those degraded logical
>>> volumes.  Does this sound correct to you, and can mdmon do that already
>>> or will this need to be added?
>>
>> This sounds correct, and no mdmon cannot do this today.  The current
>> discussions we (Marcin and I) had with Neil offlist was extending
>> mdadm --monitor to handle spare migration for containers since it
>> already handles spare migration for native md arrays.  It will need
>> some mdmon coordination since mdmon is the only agent that can
>> disambiguate a spare from a stale device at any given point in time.
>
> So we'll need to coordinate on this aspect of things then.  I'll keep
> you updated as I get started implementing this if you want to think
> about how you would like to handle this interaction between mdadm/mdmon.

Ok, that sounds like a good split we'll keep you posted as well,

> As far as I can tell, we've reached a fairly decent consensus on things.
>  But, just to be clear, I'll reiterate that concensus here:
>
> Add a new linetype: DOMAIN with options path= (must be specified at
> least once for any domain action other than none and incremental and
> must be something other than a global match for any action other than
> none and incremental) and metadata= (specifies the metadata type
> possible for this domain as one of imsm/ddf/md

Why not 0.90 and 1.x for instead of 'md'?  These match the 'name'
attribute of struct superswitch.

> and where for imsm or
> ddf types, we will verify that the path portions of the domain do not
> violate possible platform limitations) and action= (where action is
> none, incremental, readd, safe_use, force_use where action is specific
> to a hotplug when a degraded array in the domain exists and can possibly
> have slightly different meanings depending on whether the path specifies
> a whole disk device or specific partitions on a range of devices

I have been thinking that the path= option specifies controller paths,
not disk devices.  Something like "pci-0000:00:1f.2-scsi-[0-3]*" to
pick the first 4 ahci ports.  This also purposefully excludes virtual
devices dm/md.  I think we want to limit this functionality to
physical controller ports... or were you looking to incorporate
support for any block device?

> and
> where there is the possibility of adding more options or a new option
> name for the case of adding a hotplug drive to a domain where no arrays
> are degraded, in which case issues such as boot sectors, partition
> tables, hot spare versus grow, etc. must be addressed).
>
> Modify udev rules files to cover the following scenarios (it's
> unfortunate that we have to split things up like this, but in order to
> deal with either bare drives or drives that have things like lvm data
> and we are using force_use, we must trigger on *all* drive hotplug
> events, we must trigger early, and we must override other subsystem's
> possible hotplug actions, otherwise the force_use option will be a noop):

Can't we limit the scope to the hotplug events we care about by
filtering the udev scripts based on the current contents of the
configuration file?  We already need a step in the process that
verifies if the configuration heeds the platform constraints.  So,
something like mdadm --activate-domains that validates the
configuration, generates the necessary udev scripts and enables
hotplug.

>
> 1) plugging in a device that already has md raid metadata present
>   a) if the device has metadata corresponding to one of our arrays,
> attempt to do normal incremental add
>   b) if the device has metadata corresponding to one of our arrays, and
> the normal add failed and the options readd, safe_use, or force_use are
> present in the mdadm.conf file, reattempt to add using readd
>   c) if the device has metadata corresponding to one of our arrays, and
> the readd failed, and the options safe_use or force_use are present,
> then do a regular add of the device to the array (possibly with doing a
> preemptive zero-superblock on the device we are adding).  This should
> never fail.

Yes, but this also reminds me about the multiple superblock case.  It
should usually only happen to people that experiment with different
metadata types, but we should catch and probably ignore drives that
have ambiguous/multiple superblocks.

>   d) if the device has metadata that does not correspond to any array
> in the system, and there is a degraded array, and the option force_use
> is present, then quite possibly repartition the device to make the
> partitions match the degraded devices, zero any superblocks, and add the
> device to the arrays.  BIG FAT WARNING: the force_use option will cause
> you to loose data if you plug in an array disk for another machine while
> this machine has degraded arrays.

Let's also limit this to ports that were recently (as specified by a
timeout= option to the DOMAIN) unplugged.  This limits the potential
damage.

>
> 2) plugging in a device that doesn't already have md raid metadata
> present but is part of an md domain
>   a) if the device is bare and the option safe_use is present and we
> have degraded arrays, partition the device (if needed) and then add
> partitions to degraded arrays
>   b) if the device is not bare, and the option force_use is present and
> we have degraded arrays, (re)partition the device (if needed) and then
> add partitions to degraded arrays.  BIG FAT WARNING: if you enable this
> mode, and you hotplug say an LVM volume into your domain when you have a
> degraded array, kiss your LVM volume goodbye.
>
> Modify udev rules files to deal with device removal.  Specifically, we
> need to watch for removal of devices that are part of raid arrays and if
> they weren't failed when they were removed, fail them, and then remove
> them from the array.  This is necessary for readd to work.  It also
> releases our hold on the scsi device so it can be fully released and the
> new device can be added back using the same device name.
>

Nod.

> Modify mdadm -I mode to read the mdadm.conf file for the DOMAIN lines on
> hotplug events and then modify the -I behavior to suit the situation.
> The majority of the hotplug changes mentioned above will actually be
> implemented as part of mdadm -I, we will simply add a few rules to call
> mdadm -I in a few new situations, then allow mdadm -I (which has
> unlimited smarts, where as udev rules get very convoluted very quickly
> if you try to make them smart) to actually make the decisions and do the
> right thing.  This means that effectively, we might just end up calling
> mdadm -I on every disk hot plug event whether there is md metadata or
> not, but only doing special things when the various conditions above are
> met.

Modulo the ability to have a global enable / disable for domains via
something like --activate-domains

>
> Modify mdadm and the spare-group concept of ARRAY lines to coordinate
> spare-group assignments and DOMAIN assignments.  We need to know what to
> do in the event of a conflict between the two.  My guess is that this is
> unlikely, but in the end, I think we need to phase out spare-group
> entirely in favor of domain.  Since we can't have a conflict without
> someone adding domain lines to the config file, I would suggest that the
> domain assignments override spare-group assignments and we complain
> about the conflict.  That way, even though the user obviously intended
> something specific with spare-group, he also must have intended
> something specific with domain assignments, and as the domain keyword is
> the newest and latest thing, honor the latest wishes and warn about it
> in case they misentered something.

Sounds reasonable.

>
> Modify mdadm/mdmon to enable spare migration between imsm containers in
> a domain.  Retain mdadm ability to move hot spares between native
> arrays, but make it based on domain now instead of spare-group, and in
> the config settings if someone has spare-group assignments and no domain
> assignments, then create internal domain entries that mimic the
> spare-group layout so that we can modify the core spare movement code to
> only be concerned with domain entries.
>
> I think that covers it.  Do we have a consensus on the general work?

I think we have a consensus.  The wrinkle that comes to mind is the
case we talked about before where some ahci ports have been reserved
for jbod support in the DOMAIN configuration.  If the user plugs in an
imsm-metadata disk into a "jbod port" and reboots the option-rom will
assemble the array across the DOMAIN boundary.  You would need to put
explicit "passthrough" metadata on the disk to get the option-rom to
ignore it, but then you couldn't put another metadata type on that
disk.  So maybe we can't support the subset case and need to force the
platform's full expectation of the domain boundaries or honor the
DOMAIN line and let the user figure out/remember why this one raid
member slot does not respond to hotplug events.

Thanks for the detailed write up.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-30 17:47               ` Labun, Marcin
@ 2010-03-30 23:47                 ` Dan Williams
  0 siblings, 0 replies; 33+ messages in thread
From: Dan Williams @ 2010-03-30 23:47 UTC (permalink / raw)
  To: Labun, Marcin
  Cc: Doug Ledford, Neil Brown, Hawrylewicz Czarnowski, Przemyslaw,
	Ciechanowski, Ed, linux-raid@vger.kernel.org, Bill Davidsen

On Tue, Mar 30, 2010 at 10:47 AM, Labun, Marcin <Marcin.Labun@intel.com> wrote:
> I understand that there are following defaults:
> - Platform/metadata limitations create default domains
> - metadata handler deliver default actions
> The equivalent configuration line for imsm is:
> DOMAIN path="any" metadata=imsm action=none

I would expect path=<ahci ports> when metadata=imsm

> User could additionally split default domains using spare groups and path keyword.
> For instance for imsm, the default domain area is platform controller.
> If any metadata is server by multiple controllers, each of them creates its own domain.

A single DOMAIN can span several controllers, but only if that does
not violate the 'platform' constraints for that metadata type (which
are always enforced).

> Enable spare disk sharing between containers if they belong to the same domain and have not conflicting spare group assignment. This will allow for spare sharing by default.

Yes, spare sharing by default within the domain and as Doug said
ignore any conflicts with the spare-group identifier i.e. DOMAIN
overrides/supersedes spare-group.

> Additionally, we can consult metadata handlers before moving spares between containers. We can do that by adding another metadata handler function which shall test metadata and controller dependencies (I can imagine that user can define metadata stored domains of spare sharing; controllers (OROM) dependent constrains shall be handled in this function, too).

This really is just a variation of load_super() performed on a
container with an extra disk added to report whether the device is
spare, failed, or otherwise out of sync.

In the imsm case this is load_super_imsm_all() with another disk
(outside of the current container list) to compare against.

--
Dan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-25  2:47 ` Michael Evans
@ 2010-03-31  1:18   ` Neil Brown
  2010-03-31  2:46     ` Michael Evans
  0 siblings, 1 reply; 33+ messages in thread
From: Neil Brown @ 2010-03-31  1:18 UTC (permalink / raw)
  To: Michael Evans
  Cc: Doug Ledford, Dan Williams, Labun, Marcin,
	Hawrylewicz Czarnowski, Przemyslaw, Ciechanowski, Ed, linux-raid,
	Bill Davidsen

On Wed, 24 Mar 2010 19:47:59 -0700
Michael Evans <mjevans1983@gmail.com> wrote:

> I believe that the default action should be to do /nothing/.  That is
> the only safe thing to do.  If an administrative framework is desired
> that seems to fall under a larger project goal which is likely better
> covered by programs more aware of the overall system state.  This
> route also allows for a range of scalability.

I agree that /nothing/ should be the default action for a device with
unrecognised content.
If the content of the device is recognised, it is OK to have a default with
does what the content implies - i.e. build a device into an array.
But maybe that it what you meant.

I think there is useful stuff that can be done entirely inside mdadm but it
is worth thinking about where to draw the line.  I'm not convinced that mdadm
should "know" about partition tables and MBRs.  Possible the task of copying
those is best placed in a script.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-25  8:01 ` Luca Berra
@ 2010-03-31  1:26   ` Neil Brown
  2010-03-31  6:10     ` Luca Berra
  0 siblings, 1 reply; 33+ messages in thread
From: Neil Brown @ 2010-03-31  1:26 UTC (permalink / raw)
  To: Luca Berra; +Cc: linux-raid

On Thu, 25 Mar 2010 09:01:08 +0100
Luca Berra <bluca@comedia.it> wrote:

> On Thu, Mar 25, 2010 at 11:35:43AM +1100, Neil Brown wrote:
> >
> >       http://blogs.techrepublic.com.com/opensource/?p=1368
> >
> > The most significant thing I got from this was a complain in the comments
> > that managing md raid was too complex and hence error-prone.
> 
> well, i would not be upset by j. random jerk complaining in a blog
> comments, as soon as you make it one click you will find another one
> that complains because it is not is favorite colour :P

We can learn something from any opinion that different from our own.

It is clear to me that using mdadm requires a certain level of understanding
to be used effectively and safely.
I don't think that can be entirely address in mdadm: there is a place of a
higher level framework that encodes policies and gives advice.  But there is
still room to improve mdadm to make it more powerful, more informative, and
more forgiving.

> 
> > I see the issue as breaking down in to two parts.
> >  1/ When a device is hot plugged into the system, is md allowed to use it as
> >     a spare for recovery?
> >  2/ If md has a spare device, what set of arrays can it be used in if needed.
> >
> > A typical hot plug event will need to address both of these questions in
> > turn before recovery actually starts.
> >
> > Part 1.
> >
> >  A newly hotplugged device may have metadata for RAID (0.90, 1.x, IMSM, DDF,
> >  other vendor metadata) or LVM or a filesystem.  It might have a partition
> >  table which could be subordinate to or super-ordinate to other metadata.
> >  (i.e. RAID in partitions, or partitions in RAID).  The metadata may or may
> >  not be stale.  It may or may not match - either strongly or weakly -
> >  metadata on devices in currently active arrays.
> also the newly hotplugged device may have _data_ on it.
>

You mean completely raw data, no partitions, no filesystem structure etc?
Yes, that is possible.  People who are likely to handle devices like that
would choose more conservative configurations.

> >  Some how from all of that information we need to decide if md can use the
> >  device without asking, or possibly with a simple yes/no question, and we
> >  need to decide what to actually do with the device.
> how does the yes/no question part work?

I imagine an Email to the admin "Hey boss, I just noticed you plugged in a
drive that looks like it used to be part of some array.  We need a spare on
this other array and the new device is big enough.  Shall I huh huh huh?  Go
on let me..."

Then the admin can choose to run the command "make it so", or not.

> we can also make /usr/bin/md-create-spare ...

Yes, there is a place for something like that certainly.

NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-25 14:10 ` John Robinson
@ 2010-03-31  1:30   ` Neil Brown
  0 siblings, 0 replies; 33+ messages in thread
From: Neil Brown @ 2010-03-31  1:30 UTC (permalink / raw)
  To: John Robinson
  Cc: Doug Ledford, Dan Williams, Labun, Marcin,
	Hawrylewicz Czarnowski, Przemyslaw, Ciechanowski, Ed, linux-raid,
	Bill Davidsen

On Thu, 25 Mar 2010 14:10:05 +0000
John Robinson <john.robinson@anonymous.org.uk> wrote:


> >  Part 1.
> > 
> >   A newly hotplugged device may have metadata for RAID (0.90, 1.x, IMSM, DDF,
> >   other vendor metadata) or LVM or a filesystem.  It might have a partition
> >   table which could be subordinate to or super-ordinate to other metadata.
> >   (i.e. RAID in partitions, or partitions in RAID).  The metadata may or may
> >   not be stale.  It may or may not match - either strongly or weakly -
> >   metadata on devices in currently active arrays.
> 
> Or indeed it may have no metadata at all - it may be a fresh disc. I 
> didn't see that you stated this specifically at any point, though it was 
> there by implication, so I will: you're going to have to pick up hotplug 
> events for bare drives, which presumably means you'll also get events 
> for CD-ROM drives, USB sticks, printers with media card slots in them etc.

Correct.  We would expect that "domain path=" matching to say that those
should only be used if they already have recognisable metadata on them.
To make use of a device with no metadata already present, it would need to
appear at a path for which auto-rebuild was explicitly enabled.


> 
> >   A newly hotplugged device also has a "path" which we can see
> >   in /dev/disk/by-path.  This is somehow indicative of a physical location.
> >   This path may be the same as the path of a device which was recently
> >   removed.  It might be one of a set of paths which make up a "RAID chassis".
> >   It might be one of a set of paths one which we happen to find other RAID
> >   arrays.
> 
> Indeed, I would like to be able to declare any 
> /dev/disk/by-path/pci-0000:00:1f.2-scsi-[0-4] to be suitable candidates 
> for hot-plugging, because those are the 5 motherboard SATA ports I've 
> hooked into my hot-swap chassis.
> 
> As an aside, I just tried yanking and replugging one of my drives, on 
> CentOS 5.4, and it successfully went away and came back again, but 
> wasn't automatically re-added, even though the metadata etc was all there.

No.  That is because we have not yet implemented anything that has been
described in this document...

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-26  6:41 ` linbloke
@ 2010-03-31  1:35   ` Neil Brown
  0 siblings, 0 replies; 33+ messages in thread
From: Neil Brown @ 2010-03-31  1:35 UTC (permalink / raw)
  To: linbloke; +Cc: linux-raid

On Fri, 26 Mar 2010 17:41:02 +1100
linbloke <linbloke@fastmail.fm> wrote:

> Hi Neil,
> 
> I look forward to being able to update my mdadm.conf with the paths to 
> devices that are important to my RAID so that if a fault were to develop 
> on an array, then I'd be really happy to fail and remove the faulty 
> device, insert a blank device  of sufficient size into the defined path 
> and have the RAID auto restore. If the disk is not blank or too small, 
> provide a useful error message (insert disk of larger capacity, delete 
> partitions, zero superblocks) and exit.  I think you do an amazing job 
> and it worries me that you and the other contributors to mdadm could 
> spend your valuable time trying to solve problems about how to cater for 
> every metadata, partition type etc when a simple blank device is easy to 
> achieve and could then "Auto Rebuild on hot-plug".

:-)

One the one hand, we should always look beyond the immediate problem we are
tring to solve in order to see the big picture and make sure the solution we
choose doesn't cut us off from solving other more general problems when they
arrive.
On the other hand, we don't want to expand the scope so much that we end up
biting off more than we can chew.

A general design with a specific implementation is probably a good target....

Thanks,
NeilBrown

> 
> Perhaps as we nominate a spare disk, we could nominate a spare path. I'm 
> certainly no expert and my use case is simple (raid 1's and 10's) but it 
> seems to me a lot of complexity can be avoided for the sake of a blank disk.
> 
> Cheers,
> Josh
> 
> 
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-26  7:52 ` Majed B.
@ 2010-03-31  1:42   ` Neil Brown
  0 siblings, 0 replies; 33+ messages in thread
From: Neil Brown @ 2010-03-31  1:42 UTC (permalink / raw)
  To: Majed B.
  Cc: Doug Ledford, Dan Williams, Labun, Marcin,
	Hawrylewicz Czarnowski, Przemyslaw, Ciechanowski, Ed, linux-raid,
	Bill Davidsen

On Fri, 26 Mar 2010 10:52:07 +0300
"Majed B." <majedb@gmail.com> wrote:

> Why not treat this similar to how hardware RAID manages disks & spares?
> Disk has no metadata -> new -> use as spare.
> Disk has metadata -> array exists -> add to array.
> Disk has metadata -> array doesn't exist (disk came from another
> system) -> sit idle & wait for an admin to do the work.

That would certainly be a sensible configuration to support.

> 
> As to identify disks and know which disks were removed and put back to
> an array, there's the metadata & there's the disk's serial number
> which can obtained using hdparm. I also think that all disks now
> include a World Wide Number (WWN) which is more suitable for use in
> this case than a disk's serial number.
> 
> Some people rant because they see things only from their own
> perspective and assume that there's no case or scenario but their own.
> So don't pay too much attention :p
> 
> Here's a scenario: What if I had an existing RAID1 array of 3 disks. I
> bought a new disk and I wanted to make a new array in the system. So I
> add the new disk, and I want to use one of the RAID1 array disks in
> this new array.
> 
> Being lazy, instead of failing the disk then removing it using the
> console, I just removed it from the port then added it again. I
> certainly don't want mdadm to start resyncing, forcing me to wait!

Lazy people often do cause themselves more work in the long run, there is
nothing much we can do about that.  Take it up with Murphy.

> 
> As you can see in this scenario, it includes the situation where an
> admin is a lazy bum who is going to use the command line anyway to
> make the new array but didn't bother to properly remove the disk he
> wanted. And there's the case of the newly added disk.
> 
> Why assume things & guess when an admin should know what to do?
> I certainly don't want to risk my arrays in mdadm guessing for me. And
> keep one thing in mind: How often do people interact with storage
> systems?
> 
> If I configure mdadm today, the next I may want to add or replace a
> disk would be a year later. I certainly would have forgotten whatever
> configuration was there! And depending on the situation I have, I
> certainly wouldn't want mdadm to guess.

That is a point worth considering.  Where possible we should discourage
configurations that would be 'surprising'.
Unfortunately a thing that is surprising to one person in one situation may
be completely obvious to someone else in a different situation.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-31  1:18   ` Neil Brown
@ 2010-03-31  2:46     ` Michael Evans
  0 siblings, 0 replies; 33+ messages in thread
From: Michael Evans @ 2010-03-31  2:46 UTC (permalink / raw)
  To: Neil Brown
  Cc: Doug Ledford, Dan Williams, Labun, Marcin,
	Hawrylewicz Czarnowski, Przemyslaw, Ciechanowski, Ed, linux-raid,
	Bill Davidsen

On Tue, Mar 30, 2010 at 6:18 PM, Neil Brown <neilb@suse.de> wrote:
> On Wed, 24 Mar 2010 19:47:59 -0700
> Michael Evans <mjevans1983@gmail.com> wrote:
>
>
>
>> I believe that the default action should be to do /nothing/.  That is
>> the only safe thing to do.  If an administrative framework is desired
>> that seems to fall under a larger project goal which is likely better
>> covered by programs more aware of the overall system state.  This
>> route also allows for a range of scalability.
>
> I agree that /nothing/ should be the default action for a device with
> unrecognised content.
> If the content of the device is recognised, it is OK to have a default with
> does what the content implies - i.e. build a device into an array.
> But maybe that it what you meant.
>
> I think there is useful stuff that can be done entirely inside mdadm but it
> is worth thinking about where to draw the line.  I'm not convinced that mdadm
> should "know" about partition tables and MBRs.  Possible the task of copying
> those is best placed in a script.
>
> Thanks,
> NeilBrown
>

My larger context was looking at non-recognized devices; assembling
pre-marked containers is fine.  With the provision that pass basic
safety checks validate that outcome; is the uuid correct, does the
home-host match the current array, is the update count valid (or else
add as a prior stale member that should be marked as hot spare).

For anything else mdadm might be better off taking the approach that
an administratively selected set of actions should be performed.  If
the task is JUST doing stuff that mdadm would already be invoked to do
anyway then it is tolerable for those reactions to be configurable
within the .conf file, though I fear the syntax may be uglier than
assuming there's also at least a basic /bin/sh that could interpret a
set of more standard commands.  It would also provide a good example
to extend in to custom scripts.

Another advantage of using a shell script instead is that
administrators can hack in whatever tricks they want.  If they have a
partition tool or method they like they can script it and get the
results they want.  More complicated tricks could also be performed,
such as first preparing the disc for cryptographic storage by filling
it with random data, or performing SMART checks, or any other
operation of their choice.

Alternatively, if an administrator or device maker needs something
different they could produce a binary to run instead.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-30 15:23             ` Doug Ledford
  2010-03-30 17:47               ` Labun, Marcin
  2010-03-30 23:36               ` Dan Williams
@ 2010-03-31  4:53               ` Neil Brown
  2 siblings, 0 replies; 33+ messages in thread
From: Neil Brown @ 2010-03-31  4:53 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Dan Williams, Labun, Marcin, Hawrylewicz Czarnowski, Przemyslaw,
	Ciechanowski, Ed, linux-raid@vger.kernel.org, Bill Davidsen

On Tue, 30 Mar 2010 11:23:08 -0400
Doug Ledford <dledford@redhat.com> wrote:

> As far as I can tell, we've reached a fairly decent consensus on things.
>  But, just to be clear, I'll reiterate that concensus here:
> 
> Add a new linetype: DOMAIN with options path= (must be specified at
> least once for any domain action other than none and incremental and
> must be something other than a global match for any action other than
> none and incremental) and metadata= (specifies the metadata type
> possible for this domain as one of imsm/ddf/md, and where for imsm or
> ddf types, we will verify that the path portions of the domain do not
> violate possible platform limitations) and action= (where action is
> none, incremental, readd, safe_use, force_use where action is specific
> to a hotplug when a degraded array in the domain exists and can possibly
> have slightly different meanings depending on whether the path specifies
> a whole disk device or specific partitions on a range of devices, and
> where there is the possibility of adding more options or a new option
> name for the case of adding a hotplug drive to a domain where no arrays
> are degraded, in which case issues such as boot sectors, partition
> tables, hot spare versus grow, etc. must be addressed).
> 
> Modify udev rules files to cover the following scenarios (it's
> unfortunate that we have to split things up like this, but in order to
> deal with either bare drives or drives that have things like lvm data
> and we are using force_use, we must trigger on *all* drive hotplug
> events, we must trigger early, and we must override other subsystem's
> possible hotplug actions, otherwise the force_use option will be a noop):
> 
> 1) plugging in a device that already has md raid metadata present
>    a) if the device has metadata corresponding to one of our arrays,
> attempt to do normal incremental add
>    b) if the device has metadata corresponding to one of our arrays, and
> the normal add failed and the options readd, safe_use, or force_use are
> present in the mdadm.conf file, reattempt to add using readd
>    c) if the device has metadata corresponding to one of our arrays, and
> the readd failed, and the options safe_use or force_use are present,
> then do a regular add of the device to the array (possibly with doing a
> preemptive zero-superblock on the device we are adding).  This should
> never fail.
>    d) if the device has metadata that does not correspond to any array
> in the system, and there is a degraded array, and the option force_use
> is present, then quite possibly repartition the device to make the
> partitions match the degraded devices, zero any superblocks, and add the
> device to the arrays.  BIG FAT WARNING: the force_use option will cause
> you to loose data if you plug in an array disk for another machine while
> this machine has degraded arrays.
> 
> 2) plugging in a device that doesn't already have md raid metadata
> present but is part of an md domain
>    a) if the device is bare and the option safe_use is present and we
> have degraded arrays, partition the device (if needed) and then add
> partitions to degraded arrays
>    b) if the device is not bare, and the option force_use is present and
> we have degraded arrays, (re)partition the device (if needed) and then
> add partitions to degraded arrays.  BIG FAT WARNING: if you enable this
> mode, and you hotplug say an LVM volume into your domain when you have a
> degraded array, kiss your LVM volume goodbye.
> 
> Modify udev rules files to deal with device removal.  Specifically, we
> need to watch for removal of devices that are part of raid arrays and if
> they weren't failed when they were removed, fail them, and then remove
> them from the array.  This is necessary for readd to work.  It also
> releases our hold on the scsi device so it can be fully released and the
> new device can be added back using the same device name.
> 
> Modify mdadm -I mode to read the mdadm.conf file for the DOMAIN lines on
> hotplug events and then modify the -I behavior to suit the situation.
> The majority of the hotplug changes mentioned above will actually be
> implemented as part of mdadm -I, we will simply add a few rules to call
> mdadm -I in a few new situations, then allow mdadm -I (which has
> unlimited smarts, where as udev rules get very convoluted very quickly
> if you try to make them smart) to actually make the decisions and do the
> right thing.  This means that effectively, we might just end up calling
> mdadm -I on every disk hot plug event whether there is md metadata or
> not, but only doing special things when the various conditions above are
> met.
> 
> Modify mdadm and the spare-group concept of ARRAY lines to coordinate
> spare-group assignments and DOMAIN assignments.  We need to know what to
> do in the event of a conflict between the two.  My guess is that this is
> unlikely, but in the end, I think we need to phase out spare-group
> entirely in favor of domain.  Since we can't have a conflict without
> someone adding domain lines to the config file, I would suggest that the
> domain assignments override spare-group assignments and we complain
> about the conflict.  That way, even though the user obviously intended
> something specific with spare-group, he also must have intended
> something specific with domain assignments, and as the domain keyword is
> the newest and latest thing, honor the latest wishes and warn about it
> in case they misentered something.
> 
> Modify mdadm/mdmon to enable spare migration between imsm containers in
> a domain.  Retain mdadm ability to move hot spares between native
> arrays, but make it based on domain now instead of spare-group, and in
> the config settings if someone has spare-group assignments and no domain
> assignments, then create internal domain entries that mimic the
> spare-group layout so that we can modify the core spare movement code to
> only be concerned with domain entries.
> 
> I think that covers it.  Do we have a consensus on the general work?
> Your thoughts Neil?
> 

Thoughts ... yes ... all over the place.  I won't try to group them, just a
random list:

"bare devices"
  To make sure we are on the same page, we should have a definition for this.
  How about "every byte in the first megabyte and last megabyte of the device
  is the same (e.g. 0x00 or 0x5a of 0xff) ??
  We would want a program (mdadm option?) to be able to make a device into a
  bare device.

Dan's "--activate-domains" option which creates a targeted udev rules file for
  "force_use" - I first I though "yuck, no", but then it grew on me.  I think
  I quite like the idea now.  We can put a rules file in /dev/.udev/rules.d/
  which targets just the path that we want to over-ride.
  I can see two approaches:
    1/ create the file during boot with something like "mdadm --activate-domins"
    2/ create a file whenever a device in an md-array is hot-removed which
       targets just that path and claims it immediately for md.
       Removing these after a timeout would be needed.

  The second feels elegant but could be racy.  The first is probably the
  better approach.

Your idea of only performing an action if there is a degraded array doesn't
  seem quite right.
  If I have a set of ports dedicated to raid and I plug in a bare device,
  I want to become a hot-spare whether there are degraded arrays that
  will use it immediately or not.
  You say the making it a hot spare doesn't actually "do" anything, but it
  does.  It makes available for recovery.

  If a device fails, then I plug in a spare I want it to recovery - so do you.
  If I plug in a spare and then a device fails, I want it to recover, but it
  seems you don't.  I cannot reconcile that difference.

  Yes, the admin might want to grow the array, but that is fine:  the spare
  is ready to be use for growth, or to replace a failure, or whatever is
  needed.

Native metadata: on partitions or on whole device.
  We need to make sure we understand the distinctions between these two
  scenarios.
  If a whole-device array is present, we probably include the device in that
  array, writing metadata if necessary.  Maybe we also copy everything between
  the start of the device and the data_offset incase something useful was
  placed there.
  If partitions are present then we probably want to call out to a script
  which is given the name of the new device and the names of the other
  devices present in the domain.  A supplied script would copy the partition
  table if they all had the same partition table, and make sure the
  boot block was copied as well.
  If this created new partitions, that would be a new set of hot-plug
  events(?).
  I think this is a situation where we at first want to only support very
  simple configurations where all devices are partitioned the same and all
  arrays are across a set of aligned partitions (/dev/sd?2), but allow
  script writers to do whatever they like.
  Maybe the 'action' word could be a script (did you suggest that already?)
  if it doesn't match a builtin action.

  This relates to the issue John raised about domains that identify
  partitions and tie actions to partitions.  You seem to have a difficulty
  with that which I don't understand yet.
  If the whole-device action results in partitions being created then maybe
  it ends there.  The partitions will appear via hot-plug and to act of them
  accordingly.
  Each partition might inherit an action from the whole-device, or might have
  it's own explicit action.

  Here is a related question:  Do we want the spare migration functionality
  to be able to re-partition a device.  I think not.  I think we need to
  assume that groups of related devices have the same partitioning, at least
  in the first implementation.

Multiple metadatas in a domain
  I think this should be supported as far as it makes sense.
  I'm not sure migrating spares between different metadata's makes a lot of
  sense, at least as a default.  Does it?
  When a bare device is plugged into a domain we need to know what sort of
  metadata to use on it (imsm, ddf, 0.90, 1.x, partitioned ?).  That would
  be one use for having a metadata= tag on a domain line.

Overlapping / nested domains
  Does this make sense?  Should it be allowed?
  Your suggestion of a top-level wildcard domain suggests that domains
  can be nested, and that feels like the right thing to do, though there
  aren't very many cases where you would want to specify different values at
  different levels of the nesting.  Maybe just 2 level?  At least 3 for
  global / devices / partitions (if we allow domains to contain parititons).

  But can domains overlap but not nest?  If they did you would need a strict
  over-ride policy: which action over-rides which.
  I cannot think of a use-case for this and think we should probably
  explicitly disallow it.

Patterns
  Are domain patterns globs or regexp or prefix or some combination?
  Prefix-only meets a lot of needs, but would allow partitions to match the
  whole device.
  globs are quite good with fixed-sized-fields which is largely what we have
  with paths, but you cannot do multi-digit numerical ranges. (08..12)
  Regex is more than I want I think.

  You can do numerical ranges with multiple domains:
     DOMAIN path=foo-0[89]-bar  action=thing
     DOMAIN path=foo-1[012]-bar action=thing

  That would say something about the concept of 'domain boundaries', as there
  should be no sense that there is a boundary between these two.
  Which leads me to...

spare-group
  I don't think I agree with your approach to spare-groups.
  You seem to be tying them back to domains.  I think I want domains to
  (optionally) provide spare-group tags.

  A spare-group (currently) is a label attach to some arrays (containers, not
  members, when that makes a difference) which by implication is attached to
  every device in the array.  Some times these are whole devices, some times
  partitions.

  A spare device tagged for a particular spare-group can be moved to any
  array tagged with the same spare-group.

  I see domains as (optionally) adding spare-group tags to a set of devices
  and, by implication, any array which contains those devices.

  I a domain implicitly defined a new spare-group, that would reduce your
  flexibility for defining domains using globs, as noted above.

  So a device can receive a spare-group from multiple sources.  I'm not sure
  how they interact.  It certainly could work for a device to be in multiple
  spare-groups

  domains don't implicitly define a spare-group (though a 'platform' domain
  might I guess).. Though Dan's idea of encoding 'platform' requirements
  explicitly by having mdadm generate them for inclusion in mdadm.conf might
  work.

  What has this to do with domain boundaries?  I don't think such a concept
  should exist.  The domain line associates a bunch of tags with a bunch of
  devices but does not imply a line between those inside and those outside.
  Where such a line is needed, it comes from devices sharing a particular
  tag, or not.
  So the set of all devices that have a particular spare-group tag form a set
  for the purposes of spare migration and devices cannot leave or join that
  set.

  I'm not sure how the spare-group for a domain translates to partitions on
  devices in that domain.  Maybe they get -partN appended.  Maybe
  you need a 
   DOMAIN path=foo*-partX spare-group=bar
  to assign spare-groups to partitions.
  I think I would support both

Thanks all I can think of for now.
In summary: I think there is lots of agreement, but there a still a few
details that need to be ironed out.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-31  1:26   ` Neil Brown
@ 2010-03-31  6:10     ` Luca Berra
  0 siblings, 0 replies; 33+ messages in thread
From: Luca Berra @ 2010-03-31  6:10 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On Wed, Mar 31, 2010 at 12:26:22PM +1100, Neil Brown wrote:
>On Thu, 25 Mar 2010 09:01:08 +0100
>Luca Berra <bluca@comedia.it> wrote:
>
>> On Thu, Mar 25, 2010 at 11:35:43AM +1100, Neil Brown wrote:
>> >
>> >       http://blogs.techrepublic.com.com/opensource/?p=1368
>> >
>> > The most significant thing I got from this was a complain in the comments
>> > that managing md raid was too complex and hence error-prone.
>> 
>> well, i would not be upset by j. random jerk complaining in a blog
>> comments, as soon as you make it one click you will find another one
>> that complains because it is not is favorite colour :P
>
>We can learn something from any opinion that different from our own.

yes, i realize my comment was rude, sorry for that, but that comment on
the blog was imho just a troll.

>It is clear to me that using mdadm requires a certain level of understanding
>to be used effectively and safely.
>I don't think that can be entirely address in mdadm: there is a place of a
>higher level framework that encodes policies and gives advice.  But there is
I fully agree.
>still room to improve mdadm to make it more powerful, more informative, and
>more forgiving.
until it can read mail ;)

>> 
>> > I see the issue as breaking down in to two parts.
>> >  1/ When a device is hot plugged into the system, is md allowed to use it as
>> >     a spare for recovery?
>> >  2/ If md has a spare device, what set of arrays can it be used in if needed.
>> >
>> > A typical hot plug event will need to address both of these questions in
>> > turn before recovery actually starts.
>> >
>> > Part 1.
>> >
>> >  A newly hotplugged device may have metadata for RAID (0.90, 1.x, IMSM, DDF,
>> >  other vendor metadata) or LVM or a filesystem.  It might have a partition
>> >  table which could be subordinate to or super-ordinate to other metadata.
>> >  (i.e. RAID in partitions, or partitions in RAID).  The metadata may or may
>> >  not be stale.  It may or may not match - either strongly or weakly -
>> >  metadata on devices in currently active arrays.
>> also the newly hotplugged device may have _data_ on it.
>>
>
>You mean completely raw data, no partitions, no filesystem structure etc?
>Yes, that is possible.  People who are likely to handle devices like that
>would choose more conservative configurations.
I can think of two scenarios.
1) an encrypted device (without LUKS header)
2) a device where the metadata is corrupted, and we plugged it in an
hurry to attempt data recovery (oh, we were in an hurry and forgot about
the mdadm policy)

What i am scared of are distributions thinking it would be cool and
putting a non-strictly conservative policy as default.

>> >  Some how from all of that information we need to decide if md can use the
>> >  device without asking, or possibly with a simple yes/no question, and we
>> >  need to decide what to actually do with the device.
>> how does the yes/no question part work?
>
>I imagine an Email to the admin "Hey boss, I just noticed you plugged in a
>drive that looks like it used to be part of some array.  We need a spare on
>this other array and the new device is big enough.  Shall I huh huh huh?  Go
>on let me..."
>
>Then the admin can choose to run the command "make it so", or not.
ah. ok, i tought you meant something real-time.
thinking about it, maybe something could be done using dbus...

>> we can also make /usr/bin/md-create-spare ...
>
>Yes, there is a place for something like that certainly.
>
>NeilBrown

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Auto Rebuild on hot-plug
  2010-03-30 15:53                 ` Doug Ledford
@ 2010-04-02 11:01                   ` John Robinson
  0 siblings, 0 replies; 33+ messages in thread
From: John Robinson @ 2010-04-02 11:01 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Dan Williams, Labun, Marcin, Neil Brown,
	Hawrylewicz Czarnowski, Przemyslaw, Ciechanowski, Ed,
	linux-raid@vger.kernel.org, Bill Davidsen

On 30/03/2010 16:53, Doug Ledford wrote:
> On 03/30/2010 08:10 AM, John Robinson wrote:
[...]
>> I wouldn't want to take the server down to shuffle the drives or cables.
>> But my point really is that if I have decided that I would want all the
>> drives in my chassis to have identical partition tables and carry an
>> active mirror of an array - in my example /boot - I would like to be
>> able to configure the hotplug arrangement to make it so, rather than
>> leaving me to have to manually regenerate the partition table, install
>> grub, add the spare and perhaps even grow the array.
> 
> I can (sorta) understand this.  I personally never create any more /boot
> partitions than the number of drives I can loose from my / array + 1.
> So, if I have raid5 / array, I do 2 /boot partitions.  Anything more is
> a waste since if you loose both of those boot drives, you also have too
> few drives for the / array.

A very fair point. But it's not really all that wasteful - I've had to 
use the first 100MB from at least two drives, meaning that space would 
effectively go to waste on the others. And 100MB out of 1TB isn't an 
awfully big waste anyway.

[...]
>>  This might mean that I end up something like the following:
>> DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0       action=include
>> DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0-part1 action=grow
>> DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0-part2 action=replace
>> DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]:0:0:0-part3 action=include
> 
> This I'm not so sure about.  I can try to make this a reality, but the
> issue here is that when you are allowed to specify things on a partition
> by partition basis, it becomes very easy to create conflicting commands.
>  For example, lets say you have part1 action=grow, but for the bare disk
> you have action=incremental.  And let's assume you plug in a bare disk.
>  In order to honor the part1 action=grow, we would have to partition the
> disk, which is in conflict with the bare disk action of incremental
> since that implies we would only use preexisting md raid partitions.

Yes, but in that case I've given specific instructions about what to do 
with bare drives. It'd be a bad configuration, and you might warn about 
it, but you couldn't honour the grow. Bear in mind, the two domain lines 
here don't overlap. If they did you've more of a quandry, or at least 
you should shout louder about it. I don't think you should be writing 
partition tables unless I've told you to - which I would have done in 
the following more general case:
DOMAIN path=pci-0000:00:1f.2-scsi-[0-4]* action=replace

>  I
> could *easily* see the feature of allowing per partition actions causing
> the overall code complexity to double or more.

I'm not sure why, since you probably ought to be doing some fairly 
rigorous checking of the configuration anyway to make sure domains and 
actions don't overlap or conflict.

>  You know, I'd rather
> provide a simple grub script that automatically setup all raid1 members
> as boot devices any time it was ran than try to handle this
> automatically ;-)  Maybe I should add that to the mdadm package on
> x86/x86_64, something like mdadm-grub-install or similar.

That would be fine too, as long as there's some way of calling it from 
the hotplug environment.

> As pointed out above, some of these are conflicting commands in that
> they tell us to modify the disk in one place, and leave it alone in
> another.

If the paths overlapped I'd agree, but they didn't, and I made sure the 
whole-drive action was sufficient to make sure the partition actions 
could be carried out. I agree though that there's plenty of scope for 
people writing duff configurations like the one you suggested, but I 
think there'll be scope for that whatever you do - even if it's 
foolproof, it may not be damn-fool-proof.

>  The basic assumption you are making here is that we will
> always be able to duplicate the partition table because all drives in a
> domain will have the same partition table.  And that's not always the case.

It might be a reasonable restriction for a first implementation, though. 
If not, you're going to have to store copies of the partition tables, 
boot areas, etc somewhere else so that when the drives they were on are 
hot-swapped, you can write the correct stuff back.

[...]
>> Sorry if I'm being long-winded, but hopefully you can see how I'd like
>> to be able to configure things. In the first instance, though, just
>> getting as far as the replace option would be great.
> 
> I see where you are going, I'm a little worried about getting there ;-)

I don't blame you. Isn't it just typical of a user who doesn't 
understand the work involved to demand the sky and the stars? Anyway 
thank you very much for taking the time to consider my thoughts.

Cheers,

John.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2010-04-02 11:01 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-25  0:35 Auto Rebuild on hot-plug Neil Brown
2010-03-25  2:47 ` Michael Evans
2010-03-31  1:18   ` Neil Brown
2010-03-31  2:46     ` Michael Evans
2010-03-25  8:01 ` Luca Berra
2010-03-31  1:26   ` Neil Brown
2010-03-31  6:10     ` Luca Berra
2010-03-25 14:10 ` John Robinson
2010-03-31  1:30   ` Neil Brown
2010-03-25 15:04 ` Labun, Marcin
2010-03-27  0:37   ` Dan Williams
2010-03-29 18:10     ` Doug Ledford
2010-03-29 18:36       ` John Robinson
2010-03-29 18:57         ` Doug Ledford
2010-03-29 22:36           ` John Robinson
2010-03-29 22:41             ` Dan Williams
2010-03-29 22:46               ` John Robinson
2010-03-29 23:35             ` Doug Ledford
2010-03-30 12:10               ` John Robinson
2010-03-30 15:53                 ` Doug Ledford
2010-04-02 11:01                   ` John Robinson
2010-03-29 21:36       ` Dan Williams
2010-03-29 23:30         ` Doug Ledford
2010-03-30  0:46           ` Dan Williams
2010-03-30 15:23             ` Doug Ledford
2010-03-30 17:47               ` Labun, Marcin
2010-03-30 23:47                 ` Dan Williams
2010-03-30 23:36               ` Dan Williams
2010-03-31  4:53               ` Neil Brown
2010-03-26  6:41 ` linbloke
2010-03-31  1:35   ` Neil Brown
2010-03-26  7:52 ` Majed B.
2010-03-31  1:42   ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).