More Hot Unplug/Plug work

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* More Hot Unplug/Plug work
@ 2010-04-27 16:45 Doug Ledford
  2010-04-27 19:41 ` Christian Gatzemeier
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Doug Ledford @ 2010-04-27 16:45 UTC (permalink / raw)
  To: Linux RAID Mailing List, Neil Brown, Labun, Marcin, Dan Williams

[-- Attachment #1: Type: text/plain, Size: 4980 bytes --]

So I pulled down Neil's git repo and started working from his hotunplug
branch, which was his version of my hotunplug patch.  I had to do a
couple minor fixes to it to make it work.  I then simply continued on
from there.  I have a branch in my git repo that tracks his hotunplug
branch and is also called hotunplug.  That's where my current work is at.

What I've done since then:

1) I've implemented a new config file line type: DOMAIN
   a) Each DOMAIN line must have at least one valid path= entry, but may
      have more than one path= entry.  path= entries are file globs and
      must match something in /dev/disk/by-path
   b) Each DOMAIN line must have one and only one action= entry.  Valid
      action items are: ignore, incremental, spare, grow, partition.
      In addition, a word me be prefixed with force- to indicate that
      we should skip certain safety checks and use the device even if it
      isn't clean.
   c) Each DOMAIN line may have a metadata entry, and may have a
      spare-group entry.
   d) For the partition action, a DOMAIN line must have a program= and
      a table= entry.  Currently, the program= entry must be an item
      out of a list of known partition programs (I'm working on getting
      sfdisk up and running, but for arches other than x86, other
      methods would be needed, and I'm planning on adding a method
      that allows us to call out to a user supplied script/program
      instead of a known internal method).  The table= entry points to
      a file that contains a method specific table indicating the
      necessary partition layout.  As mentioned in previous mails, we
      only support identical partition tables at this point.  That
      may never change, who knows.

2) Created a new udev rules file that gets installed as
05-md-early.rules.  This rule file, combined with our existing rules
file, is a key element to how this domain support works.  In particular,
udev rules allow us to separate out devices that already have some sort
of raid superblock from devices that don't.  We then add a new flag to
our incremental mode to indicate that a device currently does not belong
to us, and we perform a series of checks to see if it should, and if so,
we "grab" it (I would have preferred a better name, but the short
options for better names were already taken).  When called with the
"grab" flag, we follow a different code path where we check the domain
of the device against our DOMAIN entries and if we have a match, we
perform the specified action.  There will need to be some additional
work to catch certain corner cases, such as the case where we have
force-partition and we insert a disk that currently has a raid
superblock on the bare drive.  We will currently miss that situation and
not grab the device.  So, this is a work in progress and not yet complete.

3) Add IncrementalNew, IncrementalNewPart, and IncrementalNewDisk to
Incremental.c.  The IncrementalNew is called any time we are passed the
"grab" flag to incremental mode.  In IncrementalNew we merely figure out
if we match a domain entry, and if so then we check if that domain entry
is a partition entry, if it is then we pass it off to
IncrementalNewDisk, if not then we pass it off to IncrementalNewPart
(hmmm...renaming of these functions might be in order...it makes sense
to me, because I was thinking "this is what we do on a bare disk, this
is what we do on a partition", but considering that we call NewDisk on
partition domains and NewPart on everything else seems backwards).
These functions are partial stubs at the moment.  They do some sanity
checks, but not any real work.  However, they aren't intended to do a
lot of work themselves.  They are intended to figure out if they should
do work, then simply invoke Managedevs with the 'a' flag to do the
actual work.  And if the method is grow, then we will call Managedevs
with the 'a' disposition, then call Grow with the right options to do
the right thing.  The point being that all the code necessary to
automatically use a device already exists, we just have to invoke it
automatically instead of requiring a user to invoke it from the command
line.  I'm very big on reusing that existing code and not trying to
duplicate it here.

So, that's where things stand right now.  I'm going to keep working on
this as it's incomplete and doesn't actually do any work at the moment
(it's all sanity checks, config file parsing, and infrastructure, the
actual actions are not yet implemented), but I wanted to get out what I
have currently for people to see.  So, you can check it out here:

git://git.fedorapeople.org/~dledford/mdadm.git hotunplug

Comments welcome.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-04-27 16:45 More Hot Unplug/Plug work Doug Ledford
@ 2010-04-27 19:41 ` Christian Gatzemeier
  2010-04-28 16:08 ` Labun, Marcin
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Christian Gatzemeier @ 2010-04-27 19:41 UTC (permalink / raw)
  To: linux-raid

Doug Ledford <dledford <at> redhat.com> writes:

> we "grab" it (I would have preferred a better name, but the short
> options for better names were already taken).

Ah well, I don't know what may be taken, a quick rundown of thoughts though:

incremental: prep, enlist, use, new <device>






^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: More Hot Unplug/Plug work
  2010-04-27 16:45 More Hot Unplug/Plug work Doug Ledford
  2010-04-27 19:41 ` Christian Gatzemeier
@ 2010-04-28 16:08 ` Labun, Marcin
  2010-04-28 17:47   ` Doug Ledford
  2010-04-29 20:32 ` Dan Williams
  2010-04-29 21:22 ` Dan Williams
  3 siblings, 1 reply; 23+ messages in thread
From: Labun, Marcin @ 2010-04-28 16:08 UTC (permalink / raw)
  To: Doug Ledford, Neil Brown
  Cc: Linux RAID Mailing List, Williams, Dan J, Ciechanowski, Ed,
	Hawrylewicz Czarnowski, Przemyslaw

> -----Original Message-----
> From: Doug Ledford [mailto:dledford@redhat.com]
> Sent: Tuesday, April 27, 2010 6:45 PM
> To: Linux RAID Mailing List; Neil Brown; Labun, Marcin; Williams, Dan J
> Subject: More Hot Unplug/Plug work
> 
> So I pulled down Neil's git repo and started working from his hotunplug
> branch, which was his version of my hotunplug patch.  I had to do a
> couple minor fixes to it to make it work.  I then simply continued on
> from there.  I have a branch in my git repo that tracks his hotunplug
> branch and is also called hotunplug.  That's where my current work is
> at.
> 
> What I've done since then:
> 
> 1) I've implemented a new config file line type: DOMAIN
>    a) Each DOMAIN line must have at least one valid path= entry, but
> may
>       have more than one path= entry.  path= entries are file globs and
>       must match something in /dev/disk/by-path

DOMAIN is defined per container or raid volume for native metadata.
Each DOMAIN can have more than one path, so actually list of path define if a given disk belongs to domain or not. 
Do you plan to allow for the same path to be assigned to different containers (so path is shared between domains)?
If so the domains will have some or all paths overlapped, and some containers will share some paths.
Going further, thus causes that a new disk can be potentially grabbed by more than one container (because of shared path).
For example:
DOMAIN1: path=a path=b path=c
DOMAIN2: path=a path=d
DOMAIN3: path=d path=c
In this example disks from path c can appear in DOMAIN 1 and DOMAIN 3, but not in DOMAIN 2.
So, in case of Monitor, sharing a spare device will be per path basis.
The same for new disks in hot-plug feature.

In your repo domain_ent is a struct that contains domain paths.
The function arrays_in_domain returns a list of mdstat entries that are in the same domain as the constituent device name.
(so it requires devname and domain as input parameter).
In which case two containers will share the same DOMAIN?
It seems that this function shall return a list of mdstat entries that share a path to which a devname device belongs.
So, a given new device is tried to be grabbed by a list of a containers (or native volumes).

Can you send a config file example?

Marcin Labun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-04-28 16:08 ` Labun, Marcin
@ 2010-04-28 17:47   ` Doug Ledford
  2010-04-28 18:34     ` Labun, Marcin
  2010-04-28 20:59     ` Luca Berra
  0 siblings, 2 replies; 23+ messages in thread
From: Doug Ledford @ 2010-04-28 17:47 UTC (permalink / raw)
  To: Labun, Marcin
  Cc: Neil Brown, Linux RAID Mailing List, Williams, Dan J,
	Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw

[-- Attachment #1: Type: text/plain, Size: 5107 bytes --]

On 04/28/2010 12:08 PM, Labun, Marcin wrote:
> 
> 
>> -----Original Message-----
>> From: Doug Ledford [mailto:dledford@redhat.com]
>> Sent: Tuesday, April 27, 2010 6:45 PM
>> To: Linux RAID Mailing List; Neil Brown; Labun, Marcin; Williams, Dan J
>> Subject: More Hot Unplug/Plug work
>>
>> So I pulled down Neil's git repo and started working from his hotunplug
>> branch, which was his version of my hotunplug patch.  I had to do a
>> couple minor fixes to it to make it work.  I then simply continued on
>> from there.  I have a branch in my git repo that tracks his hotunplug
>> branch and is also called hotunplug.  That's where my current work is
>> at.
>>
>> What I've done since then:
>>
>> 1) I've implemented a new config file line type: DOMAIN
>>    a) Each DOMAIN line must have at least one valid path= entry, but
>> may
>>       have more than one path= entry.  path= entries are file globs and
>>       must match something in /dev/disk/by-path
> 
> DOMAIN is defined per container or raid volume for native metadata.

No, a DOMAIN can encompass more than a single volume, array, or container.

> Each DOMAIN can have more than one path, so actually list of path define if a given disk belongs to domain or not. 

Correct.

> Do you plan to allow for the same path to be assigned to different containers (so path is shared between domains)?

I had planned that a single DOMAIN can encompass multiple containers.
So I didn't planned on it a single path being in multiple DOMAINs, but I
did plan that a single domain could allow a device to be placed in
multiple different containers based upon need.  I don't have checks in
place to make sure the same path isn't listed in more than one domain,
although that would be a next step.

> If so the domains will have some or all paths overlapped, and some containers will share some paths.
> Going further, thus causes that a new disk can be potentially grabbed by more than one container (because of shared path).
> For example:
> DOMAIN1: path=a path=b path=c
> DOMAIN2: path=a path=d
> DOMAIN3: path=d path=c
> In this example disks from path c can appear in DOMAIN 1 and DOMAIN 3, but not in DOMAIN 2.

What exactly is the use case for overlapping paths in different domains?
 I'm happy to rework the code to support it if there's a valid use case,
but so far my design goal has been to have a path only appear in one
domain, and to then perform the appropriate action based upon that
domain.  So if more than one container array was present in a single
DOMAIN entry (lets assume that the domain entry path encompassed all 6
sata ports on a motherboard and therefore covered the entire platform
capability of the imsm motherboard bios), then we would add the new
drive as a spare to one of the imsm arrays.  It's not currently
deterministic which one we would add it to, but that would change as the
code matures and we would search for a degraded array that we could add
it to.  Only if there are no degraded arrays would we add it as a spare
to one of the arrays (non-deterministic which one).  If we add it as a
spare to one of the arrays, then monitor mode can move that spare around
as needed later based upon the spare-group settings.  Currently, there
is no correlation between spare-group and DOMAIN entries, but that might
change.

> So, in case of Monitor, sharing a spare device will be per path basis.

Currently, monitor mode still uses spare-group for controlling what
arrays can share spares.  It does not yet check any DOMAIN information.

> The same for new disks in hot-plug feature.
> 
> 
> In your repo domain_ent is a struct that contains domain paths.
> The function arrays_in_domain returns a list of mdstat entries that are in the same domain as the constituent device name.
> (so it requires devname and domain as input parameter).
> In which case two containers will share the same DOMAIN?

You get the list of containers, not just one.  See above about searching
the list for a degraded container and adding to it before a healthy
container.

> It seems that this function shall return a list of mdstat entries that share a path to which a devname device belongs.
> So, a given new device is tried to be grabbed by a list of a containers (or native volumes).

Yes.  There can be more than one array/container that this device might
go to.

> Can you send a config file example?

The first two entries are good, the third is a known bad line that I
just leave in there to make sure I don't partition the wrong thing.

DOMAIN path=pci-0000:00:1f.2-scsi-[2345]:0:0:0 action=partition
	table=/etc/mdadm.table program=sfdisk
DOMAIN path=pci-0000:00:1f.2-scsi-[2345]:0:0:0-part? action=spare
DOMAIN path=pci-0000:00:1f.2-scsi-[2345]:0:0:0*
	path=pci-0000:00:1f.2-scsi-[2345]:0:0:0-part* action=partition

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: More Hot Unplug/Plug work
  2010-04-28 17:47   ` Doug Ledford
@ 2010-04-28 18:34     ` Labun, Marcin
  2010-04-28 21:05       ` Doug Ledford
  2010-04-28 20:59     ` Luca Berra
  1 sibling, 1 reply; 23+ messages in thread
From: Labun, Marcin @ 2010-04-28 18:34 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Neil Brown, Linux RAID Mailing List, Williams, Dan J,
	Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw

> > Going further, thus causes that a new disk can be potentially grabbed
> by more than one container (because of shared path).
> > For example:
> > DOMAIN1: path=a path=b path=c
> > DOMAIN2: path=a path=d
> > DOMAIN3: path=d path=c
> > In this example disks from path c can appear in DOMAIN 1 and DOMAIN
> 3, but not in DOMAIN 2.
> 
> What exactly is the use case for overlapping paths in different
> domains?

OK, makes sense.
But if they are overlapped, will the config functions assign path are requested by configuration file
or treat it as misconfiguration?
So, do you plan to make changes similar to incremental in assembly to serve DOMAIN?
Should an array be split (not assembled) if a domain paths are dividing array into two separate DOMAIN?

>  I'm happy to rework the code to support it if there's a valid use
> case, but so far my design goal has been to have a path only appear in
> one domain, and to then perform the appropriate action based upon that
> domain.
What is then the purpose of metadata keyword?
My initial plan was to create a default configuration for a specific metadata, where user specifies actions 
but without paths letting metadata handler to use default ones.
In your description, I can see that the path are required.

> add it to.  Only if there are no degraded arrays would we add it as a
> spare to one of the arrays (non-deterministic which one).  If we add it
> as a spare to one of the arrays, then monitor mode can move that spare
> around as needed later based upon the spare-group settings.  Currently,
> there is no correlation between spare-group and DOMAIN entries, but
> that might change.

A spare should go to any container controlled by mdmon, so any that contains redundant volumes.

> 
> > So, in case of Monitor, sharing a spare device will be per path
> basis.
> 
> Currently, monitor mode still uses spare-group for controlling what
> arrays can share spares.  It does not yet check any DOMAIN information.

Yes, and I am now adding support for domains in monitor and for spare-groups for external metadata.

> 
> > The same for new disks in hot-plug feature.
> >
> >
> > In your repo domain_ent is a struct that contains domain paths.
> > The function arrays_in_domain returns a list of mdstat entries that
> are in the same domain as the constituent device name.
> > (so it requires devname and domain as input parameter).
> > In which case two containers will share the same DOMAIN?
> 
> You get the list of containers, not just one.  See above about
> searching the list for a degraded container and adding to it before a
> healthy container.
OK.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-04-28 18:34     ` Labun, Marcin
@ 2010-04-28 21:05       ` Doug Ledford
  2010-04-28 21:13         ` Dan Williams
  2010-04-29  1:01         ` Neil Brown
  0 siblings, 2 replies; 23+ messages in thread
From: Doug Ledford @ 2010-04-28 21:05 UTC (permalink / raw)
  To: Labun, Marcin
  Cc: Neil Brown, Linux RAID Mailing List, Williams, Dan J,
	Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw

[-- Attachment #1: Type: text/plain, Size: 4899 bytes --]

On 04/28/2010 02:34 PM, Labun, Marcin wrote:
>>> Going further, thus causes that a new disk can be potentially grabbed
>> by more than one container (because of shared path).
>>> For example:
>>> DOMAIN1: path=a path=b path=c
>>> DOMAIN2: path=a path=d
>>> DOMAIN3: path=d path=c
>>> In this example disks from path c can appear in DOMAIN 1 and DOMAIN
>> 3, but not in DOMAIN 2.
>>
>> What exactly is the use case for overlapping paths in different
>> domains?
> 
> OK, makes sense.
> But if they are overlapped, will the config functions assign path are requested by configuration file
> or treat it as misconfiguration?

For now it merely means that the first match found is the only one that
will ever get used.  I'm not entirely sure how feasible it is to detect
matching paths unless we are just talking about identical strings in the
path= statement.  But since the path= statement is passed to fnmatch(),
which treats it as a file glob, it would be possible to construct two
path statements that don't match but match the same set of files.  I
don't think we can reasonably detect this situation, so it may be that
the answer is "the first match found is used" and have that be the
official stance.

> So, do you plan to make changes similar to incremental in assembly to serve DOMAIN?

I had not planned on it, no.  The reason being that assembly isn't used
for hotplug.  I guess I could see a use case for this though in that if
you called mdadm -As then maybe we should consult the DOMAIN entries to
see if there are free drives inside of a DOMAIN listed as spare or grow
and whether or not we have any degraded arrays while assembling that
could use the drives.  Dunno if we want to do that though.  However, I
think I would prefer to get the incremental side of things working
first, then go there.

> Should an array be split (not assembled) if a domain paths are dividing array into two separate DOMAIN?

I don't think so.  Amongst other things, this would make it possible to
render a machine unbootable if you had a type in a domain path.  I think
I would prefer to allow established arrays to assemble regardless of
domain path entries.

>>  I'm happy to rework the code to support it if there's a valid use
>> case, but so far my design goal has been to have a path only appear in
>> one domain, and to then perform the appropriate action based upon that
>> domain.
> What is then the purpose of metadata keyword?

Mainly as a hint that a given domain uses a specific type of metadata.

> My initial plan was to create a default configuration for a specific metadata, where user specifies actions 
> but without paths letting metadata handler to use default ones.
> In your description, I can see that the path are required.

Yes.  We already have a default action for all paths: incremental.  This
is the same as how things work today without any new support.  And when
you combine incremental with the AUTO keyword in mdadm.conf, you can
control which devices are auto assembled on a metadata by metadata basis
without the use of DOMAINs.  The only purpose of a domain then is to
specify an action other than incremental for devices plugged into a
given domain.

>> add it to.  Only if there are no degraded arrays would we add it as a
>> spare to one of the arrays (non-deterministic which one).  If we add it
>> as a spare to one of the arrays, then monitor mode can move that spare
>> around as needed later based upon the spare-group settings.  Currently,
>> there is no correlation between spare-group and DOMAIN entries, but
>> that might change.
> 
> A spare should go to any container controlled by mdmon, so any that contains redundant volumes.

Yep.

>>
>>> So, in case of Monitor, sharing a spare device will be per path
>> basis.
>>
>> Currently, monitor mode still uses spare-group for controlling what
>> arrays can share spares.  It does not yet check any DOMAIN information.
> 
> Yes, and I am now adding support for domains in monitor and for spare-groups for external metadata.

Good to hear.

>>
>>> The same for new disks in hot-plug feature.
>>>
>>>
>>> In your repo domain_ent is a struct that contains domain paths.
>>> The function arrays_in_domain returns a list of mdstat entries that
>> are in the same domain as the constituent device name.
>>> (so it requires devname and domain as input parameter).
>>> In which case two containers will share the same DOMAIN?
>>
>> You get the list of containers, not just one.  See above about
>> searching the list for a degraded container and adding to it before a
>> healthy container.
> OK.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-04-28 21:05       ` Doug Ledford
@ 2010-04-28 21:13         ` Dan Williams
  2010-04-30 13:38           ` Doug Ledford
  2010-04-29  1:01         ` Neil Brown
  1 sibling, 1 reply; 23+ messages in thread
From: Dan Williams @ 2010-04-28 21:13 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Labun, Marcin, Neil Brown, Linux RAID Mailing List,
	Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw

Doug Ledford wrote:
> On 04/28/2010 02:34 PM, Labun, Marcin wrote:
>> Should an array be split (not assembled) if a domain paths are dividing array into two separate DOMAIN?
> 
> I don't think so.  Amongst other things, this would make it possible to
> render a machine unbootable if you had a type in a domain path.  I think
> I would prefer to allow established arrays to assemble regardless of
> domain path entries.

This is what I was calling the 'enforce=' policy in previous mails. 
Whether to block, warn, or ignore arrays that span a domain.  I can see 
someone wanting to have something like enforce=platform to make sure we 
Linux tries to assemble an array that the option-rom can't put together.

>>>  I'm happy to rework the code to support it if there's a valid use
>>> case, but so far my design goal has been to have a path only appear in
>>> one domain, and to then perform the appropriate action based upon that
>>> domain.
>> What is then the purpose of metadata keyword?
> 
> Mainly as a hint that a given domain uses a specific type of metadata.

Yeah, to protect against cases where a stale disk is plugged into an 
unexpected port.

--
Dan


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-04-28 21:13         ` Dan Williams
@ 2010-04-30 13:38           ` Doug Ledford
  0 siblings, 0 replies; 23+ messages in thread
From: Doug Ledford @ 2010-04-30 13:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: Labun, Marcin, Neil Brown, Linux RAID Mailing List,
	Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw

[-- Attachment #1: Type: text/plain, Size: 1333 bytes --]

On 04/28/2010 05:13 PM, Dan Williams wrote:
> Doug Ledford wrote:
>> On 04/28/2010 02:34 PM, Labun, Marcin wrote:
>>> Should an array be split (not assembled) if a domain paths are
>>> dividing array into two separate DOMAIN?
>>
>> I don't think so.  Amongst other things, this would make it possible to
>> render a machine unbootable if you had a type in a domain path.  I think
>> I would prefer to allow established arrays to assemble regardless of
>> domain path entries.
> 
> This is what I was calling the 'enforce=' policy in previous mails.
> Whether to block, warn, or ignore arrays that span a domain.  I can see
> someone wanting to have something like enforce=platform to make sure we
> Linux tries to assemble an array that the option-rom can't put together.

I would suggest that the proper way to handle this is to warn on
assembling an array that spans boundaries but proceed with the assembly
(including incremental), warn and require a force flag on creating an
array that spans boundaries, and warn and require the force flag to
automatically use devices that span boundaries.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-04-28 21:05       ` Doug Ledford
  2010-04-28 21:13         ` Dan Williams
@ 2010-04-29  1:01         ` Neil Brown
  2010-04-29  1:19           ` Dan Williams
  2010-04-30 15:52           ` Doug Ledford
  1 sibling, 2 replies; 23+ messages in thread
From: Neil Brown @ 2010-04-29  1:01 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Labun, Marcin, Linux RAID Mailing List, Williams, Dan J,
	Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw

On Wed, 28 Apr 2010 17:05:58 -0400
Doug Ledford <dledford@redhat.com> wrote:

> On 04/28/2010 02:34 PM, Labun, Marcin wrote:
> >>> Going further, thus causes that a new disk can be potentially grabbed
> >> by more than one container (because of shared path).
> >>> For example:
> >>> DOMAIN1: path=a path=b path=c
> >>> DOMAIN2: path=a path=d
> >>> DOMAIN3: path=d path=c
> >>> In this example disks from path c can appear in DOMAIN 1 and DOMAIN
> >> 3, but not in DOMAIN 2.
> >>
> >> What exactly is the use case for overlapping paths in different
> >> domains?
> > 
> > OK, makes sense.
> > But if they are overlapped, will the config functions assign path are requested by configuration file
> > or treat it as misconfiguration?
> 
> For now it merely means that the first match found is the only one that
> will ever get used.  I'm not entirely sure how feasible it is to detect
> matching paths unless we are just talking about identical strings in the
> path= statement.  But since the path= statement is passed to fnmatch(),
> which treats it as a file glob, it would be possible to construct two
> path statements that don't match but match the same set of files.  I
> don't think we can reasonably detect this situation, so it may be that
> the answer is "the first match found is used" and have that be the
> official stance.

I think we do need an "official stance" here.
glob is good for lots of things, but it is hard to say "everything except".
The best way to do that is to have a clear ordering with more general globs
later in the order.
   path=abcd  action=foo
   path=abc*  action=bar
   path=*     action=baz

So the last line doesn't really mean "do baz on everything" but rather
"do baz on everything else".

You could impose ordering explicitly with a priority number or a
"this domain takes precedence over that domain" tag, but I suspect
simple ordering in the config file is easiest and so best.

An important question to ask here though is whether people will want to
generate the "domain" lines automatically and if so, how we can make it hard
for people to get that wrong.
Inserting a line in the middle of a file is probably more of a challenge than
inserting a line with a specific priority or depends-on tag.

So before we get too much further down this path, I think it would be good to
have some concrete scenarios about how this functionality will actually be
put into effect.  I'd love to just expect people to always edit mdadm.conf to
meet their specific needs, but experience shows that is naive - people will
write scripts based on imperfect understanding, then share those scripts
with others....

> 
> > So, do you plan to make changes similar to incremental in assembly to serve DOMAIN?
> 
> I had not planned on it, no.  The reason being that assembly isn't used
> for hotplug.  I guess I could see a use case for this though in that if
> you called mdadm -As then maybe we should consult the DOMAIN entries to
> see if there are free drives inside of a DOMAIN listed as spare or grow
> and whether or not we have any degraded arrays while assembling that
> could use the drives.  Dunno if we want to do that though.  However, I
> think I would prefer to get the incremental side of things working
> first, then go there.
> 
> > Should an array be split (not assembled) if a domain paths are dividing array into two separate DOMAIN?
> 
> I don't think so.  Amongst other things, this would make it possible to
> render a machine unbootable if you had a type in a domain path.  I think
> I would prefer to allow established arrays to assemble regardless of
> domain path entries.
> 
> >>  I'm happy to rework the code to support it if there's a valid use
> >> case, but so far my design goal has been to have a path only appear in
> >> one domain, and to then perform the appropriate action based upon that
> >> domain.
> > What is then the purpose of metadata keyword?
> 
> Mainly as a hint that a given domain uses a specific type of metadata.
> 
> > My initial plan was to create a default configuration for a specific metadata, where user specifies actions 
> > but without paths letting metadata handler to use default ones.
> > In your description, I can see that the path are required.
> 
> Yes.  We already have a default action for all paths: incremental.  This
> is the same as how things work today without any new support.  And when
> you combine incremental with the AUTO keyword in mdadm.conf, you can
> control which devices are auto assembled on a metadata by metadata basis
> without the use of DOMAINs. 

>                               The only purpose of a domain then is to
> specify an action other than incremental for devices plugged into a
> given domain.

I like this statement.  It is simple and to the point and seems to capture
the key ideas.

The question is:  is it true? :-)
It is suggested that 'domain' also involved in spare-groups and could be used
to warn against, or disable, a 'create' or 'add' which violated policy.

So maybe:
  The purpose of a domain is to guide:
   - 'incremental' by specifying actions for hot-plug devices other than the
     default
   - 'create' and 'add' by identifying configurations that breach policy
   - 'monitor' by providing an alternate way of specifying spare-groups

It is a lot more wordy, but still seems useful.

While 'incremental' would not benefit from overlapping domains (as each
hotplugged device only wants one action), the other two might.

Suppose I want to configure array A to use only a certain set of drives,
and array B that can use any drive at all.  Then if we disallow overlapping
domains, there is no domain that describes the drives that B can be made from.

Does that matter?  Is it too hypothetical a situation?

Here is another interesting question.  Suppose I have two drive chassis, each
connected to the host by a fibre.  When I create arrays from all these drives,
I want them to be balanced across the two chassis, both for performance
reasons and for redundancy reasons.
Is there any way we can tell mdadm about this, possible through 'domains'.

This could be an issue when building a RAID10 (alternate across the chassis
is best) or when finding a spare for a RAID1 (choosing from the 'other'
chassis is best).

I don't really want to solve this now, but I do want to be sure that our
concept of 'domain' is big enough that we will be able to fit that sort of
thing into it one day.

Maybe a 'domain' is simply a mechanism to add tags to devices, and possibly
by implication to arrays that contain those devices.
The mechanism for resolving when multiple domains add conflicting tags to
the one device would be dependant on the tag.  Maybe first-wins.  Maybe
all are combined.

So we add an 'action' tag for --incremental, and the first wins (maybe)
We add a 'sparegroup' tag for --monitor
We add some other tag for balancing (share=1/2, share=2/2 ???)

I'm not sure how this fits with imposing platform constraints.
As platform constraints are closely tied to metadata types, it might be OK
to have a metadata-specific tags (imsm=???) and leave to details to the
metadata handler???

Dan: help me understand these platform constraints: what is the most complex
  constraint that you can think of that you might want to impose?

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-04-29  1:01         ` Neil Brown
@ 2010-04-29  1:19           ` Dan Williams
  2010-04-29  2:37             ` Neil Brown
  2010-04-30 11:14             ` John Robinson
  2010-04-30 15:52           ` Doug Ledford
  1 sibling, 2 replies; 23+ messages in thread
From: Dan Williams @ 2010-04-29  1:19 UTC (permalink / raw)
  To: Neil Brown
  Cc: Doug Ledford, Labun, Marcin, Linux RAID Mailing List,
	Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw

Neil Brown wrote:
> I'm not sure how this fits with imposing platform constraints.
> As platform constraints are closely tied to metadata types, it might be OK
> to have a metadata-specific tags (imsm=???) and leave to details to the
> metadata handler???
> 
> Dan: help me understand these platform constraints: what is the most complex
>   constraint that you can think of that you might want to impose?

At this point we really only need one constraint: prevent controller 
spanning.  If for example I take an existing imsm member off of ahci and 
reattach it via a usb-to-sata enclosure mdadm needs a policy to prevent 
associating that drive with anything on ahci.

In a pinch this policy can be disabled, but you wouldn't want to rebuild 
across usb or any other controller because the option-rom only talks 
ahci and will mark the drive missing.

So something like DOMAIN spanning=imsm, to tell mdadm to follow imsm 
rules for this domain.  Where 'spanning' is policy tag??

--
Dan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-04-29  1:19           ` Dan Williams
@ 2010-04-29  2:37             ` Neil Brown
  2010-04-29 18:22               ` Labun, Marcin
                                 ` (2 more replies)
  2010-04-30 11:14             ` John Robinson
  1 sibling, 3 replies; 23+ messages in thread
From: Neil Brown @ 2010-04-29  2:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Doug Ledford, Labun, Marcin, Linux RAID Mailing List,
	Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw

On Wed, 28 Apr 2010 18:19:45 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Neil Brown wrote:
> > I'm not sure how this fits with imposing platform constraints.
> > As platform constraints are closely tied to metadata types, it might be OK
> > to have a metadata-specific tags (imsm=???) and leave to details to the
> > metadata handler???
> > 
> > Dan: help me understand these platform constraints: what is the most complex
> >   constraint that you can think of that you might want to impose?
> 
> At this point we really only need one constraint: prevent controller 
> spanning.  If for example I take an existing imsm member off of ahci and 
> reattach it via a usb-to-sata enclosure mdadm needs a policy to prevent 
> associating that drive with anything on ahci.
> 
> In a pinch this policy can be disabled, but you wouldn't want to rebuild 
> across usb or any other controller because the option-rom only talks 
> ahci and will mark the drive missing.
> 
> So something like DOMAIN spanning=imsm, to tell mdadm to follow imsm 
> rules for this domain.  Where 'spanning' is policy tag??
>

Thanks.

So we have two different ideas here.

1/ A given set of devices (paths) are all attached to the one controller.
2/ A given array is not allowed to span controllers

The first statement is somewhat similar to a statement about sparegroups.
It groups devices together is some way.

The second is a policy statement, and is metadata specific to some extent.
If I create a native-metadata array using the controller, then adding other
devices from a different controller is a non-issue.  It is only when an
IMSM array is created that it is an issue (and then - only if the array is to
be used for boot for for multi-boot).

So the ARRAY line could have "exclusive-group=foo" where 'foo' is a group
name similar to those used for 'spare-group='
But that isn't much fun for auto-detect and auto-assembly.

Maybe we want to extend the 'auto' line.  It gives policy on a per-metadata
basis. Maybe:

  POLICY auto-assemble  +1.x -all
  POLICY same-group  imsm

where 'same-group' means that all the devices in an array must be in the
same spare-group.  The 'domain' lines assign spare-groups to devices.

Maybe "same-group" could be "same-$tag" so we could have different tags
for different metadatas....

Is this working for anyone else, or have I lost the plot??

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: More Hot Unplug/Plug work
  2010-04-29  2:37             ` Neil Brown
@ 2010-04-29 18:22               ` Labun, Marcin
  2010-04-29 21:55               ` Dan Williams
  2010-04-30 16:13               ` Doug Ledford
  2 siblings, 0 replies; 23+ messages in thread
From: Labun, Marcin @ 2010-04-29 18:22 UTC (permalink / raw)
  To: Neil Brown, Williams, Dan J
  Cc: Doug Ledford, Linux RAID Mailing List, Ciechanowski, Ed,
	Hawrylewicz Czarnowski, Przemyslaw

> Maybe we want to extend the 'auto' line.  It gives policy on a per-
> metadata
> basis. Maybe:
> 
>   POLICY auto-assemble  +1.x -all
>   POLICY same-group  imsm
So, the policy would be global for all arrays of some metadata type.
For sure, we want "controller spanning disable" to be default policy for imsm, so this line would be used rather in 
situation when user would like to change it to a non-default one.
Where for instance native metadata would have default  policy "spanning enabled".

Another metadata internal domain example is a new field "pool id" that creates spare sharing domain and stores it in metadata.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-04-29  2:37             ` Neil Brown
  2010-04-29 18:22               ` Labun, Marcin
@ 2010-04-29 21:55               ` Dan Williams
  2010-05-03  5:58                 ` Neil Brown
  2010-04-30 16:13               ` Doug Ledford
  2 siblings, 1 reply; 23+ messages in thread
From: Dan Williams @ 2010-04-29 21:55 UTC (permalink / raw)
  To: Neil Brown
  Cc: Doug Ledford, Labun, Marcin, Linux RAID Mailing List,
	Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw

On Wed, Apr 28, 2010 at 7:37 PM, Neil Brown <neilb@suse.de> wrote:
>> So something like DOMAIN spanning=imsm, to tell mdadm to follow imsm
>> rules for this domain.  Where 'spanning' is policy tag??
>>
>
> Thanks.
>
> So we have two different ideas here.
>
> 1/ A given set of devices (paths) are all attached to the one controller.
> 2/ A given array is not allowed to span controllers
>
> The first statement is somewhat similar to a statement about sparegroups.
> It groups devices together is some way.
>
> The second is a policy statement, and is metadata specific to some extent.
> If I create a native-metadata array using the controller, then adding other
> devices from a different controller is a non-issue.  It is only when an
> IMSM array is created that it is an issue (and then - only if the array is to
> be used for boot for for multi-boot).
>
> So the ARRAY line could have "exclusive-group=foo" where 'foo' is a group
> name similar to those used for 'spare-group='
> But that isn't much fun for auto-detect and auto-assembly.
>
> Maybe we want to extend the 'auto' line.  It gives policy on a per-metadata
> basis. Maybe:
>
>  POLICY auto-assemble  +1.x -all
>  POLICY same-group  imsm
>
> where 'same-group' means that all the devices in an array must be in the
> same spare-group.  The 'domain' lines assign spare-groups to devices.
>
> Maybe "same-group" could be "same-$tag" so we could have different tags
> for different metadatas....
>
> Is this working for anyone else, or have I lost the plot??
>

I am not grokking the separate POLICY line, especially for defining
the spare-migration border because that is already what DOMAIN is
specifying.

Here is what I think we need to allow for simple honoring of platform
constraints but without needing to expose all the nuances of those
constraints in config-file syntax... yet.

1/ Allow path= to take a metadata name this allows the handler to
identify its known controller ports, alleviating the user from needing
to track which ports are allowed, especially as it may change over
time.  If someone really wants to see which ports a metadata handler
cares about we could have a DOMAIN line dumped by --detail-platform
--brief -e imsm.  However for simplicity I would rather just dump:

DOMAIN path=imsm action=spare-same-port spare-migration=imsm

2/ I think we should always block configurations that cross domain
boundaries.  One can always append more path= lines to override this.

3/ The metadata handler may want to restrict/control where spares are
placed in a domain.  To enable interaction with CIM we are looking to
add a storage-pool id to the metadata.  The primary usage of this will
be to essentially encode a spare-group number in the metadata.  This
seems to require a spare-migration= option to the DOMAIN line.  By
default it is 'all' but it can be set to a metadata-name to let the
handler apply its internal migration policy.

That should cover everything I would like to expose  Comments?

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-04-29 21:55               ` Dan Williams
@ 2010-05-03  5:58                 ` Neil Brown
  2010-05-08  1:06                   ` Dan Williams
  0 siblings, 1 reply; 23+ messages in thread
From: Neil Brown @ 2010-05-03  5:58 UTC (permalink / raw)
  To: Dan Williams
  Cc: Doug Ledford, Labun, Marcin, Linux RAID Mailing List,
	Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw

On Thu, 29 Apr 2010 14:55:23 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> On Wed, Apr 28, 2010 at 7:37 PM, Neil Brown <neilb@suse.de> wrote:
> >> So something like DOMAIN spanning=imsm, to tell mdadm to follow imsm
> >> rules for this domain.  Where 'spanning' is policy tag??
> >>
> >
> > Thanks.
> >
> > So we have two different ideas here.
> >
> > 1/ A given set of devices (paths) are all attached to the one controller.
> > 2/ A given array is not allowed to span controllers
> >
> > The first statement is somewhat similar to a statement about sparegroups.
> > It groups devices together is some way.
> >
> > The second is a policy statement, and is metadata specific to some extent.
> > If I create a native-metadata array using the controller, then adding other
> > devices from a different controller is a non-issue.  It is only when an
> > IMSM array is created that it is an issue (and then - only if the array is to
> > be used for boot for for multi-boot).
> >
> > So the ARRAY line could have "exclusive-group=foo" where 'foo' is a group
> > name similar to those used for 'spare-group='
> > But that isn't much fun for auto-detect and auto-assembly.
> >
> > Maybe we want to extend the 'auto' line.  It gives policy on a per-metadata
> > basis. Maybe:
> >
> >  POLICY auto-assemble  +1.x -all
> >  POLICY same-group  imsm
> >
> > where 'same-group' means that all the devices in an array must be in the
> > same spare-group.  The 'domain' lines assign spare-groups to devices.
> >
> > Maybe "same-group" could be "same-$tag" so we could have different tags
> > for different metadatas....
> >
> > Is this working for anyone else, or have I lost the plot??
> >
> 
> I am not grokking the separate POLICY line, especially for defining
> the spare-migration border because that is already what DOMAIN is
> specifying.

Is it?  This is what I'm not yet 100% convinced about.
We seem to be saying:
 - A DOMAIN is a set of devices that are handled the same way for
    hotplug
 - A DOMAIN is a set of devices that define a boundary on spare
    migration

and I'm not sure those sets are necessarily isomorphic - though I agree that
they will often be the same.

Does each DOMAIN line define a separate migration boundary so that devices
cannot migrate 'across domains'??
If we were to require that, I would probably want multiple 'path=' words
allowed for a single domain so we can create a union.

> 
> Here is what I think we need to allow for simple honoring of platform
> constraints but without needing to expose all the nuances of those
> constraints in config-file syntax... yet.
> 
> 1/ Allow path= to take a metadata name this allows the handler to
> identify its known controller ports, alleviating the user from needing
> to track which ports are allowed, especially as it may change over
> time.  If someone really wants to see which ports a metadata handler
> cares about we could have a DOMAIN line dumped by --detail-platform
> --brief -e imsm.  However for simplicity I would rather just dump:
> 
> DOMAIN path=imsm action=spare-same-port spare-migration=imsm
> 

So "path=imsm" means "all devices which are attached to a controller which
seems to understand IMSM natively".
What if a system had two such controllers - one on-board and one on a plug-in
card.  This might not be possibly for IMSM but would be for DDF.
I presume the default would be that the controllers are separate domains -
would you agree?  So the above DOMAIN line would potentially create multiple
'domains' at least for spare-migration.

> 2/ I think we should always block configurations that cross domain
> boundaries.  One can always append more path= lines to override this.

I think we all agree on this.  Require --force to create an array, or add
devices to an array, where that would cross an established spare-group...
The details are still a bit vague for me but the principle is good.

> 
> 3/ The metadata handler may want to restrict/control where spares are
> placed in a domain.  To enable interaction with CIM we are looking to
> add a storage-pool id to the metadata.  The primary usage of this will
> be to essentially encode a spare-group number in the metadata.  This
> seems to require a spare-migration= option to the DOMAIN line.  By
> default it is 'all' but it can be set to a metadata-name to let the
> handler apply its internal migration policy.

I'm not following you.  Are you talking about subsets of a domain? Subdomains?
Do the storage-pools follow hardware port locations, or dynamic configuration
of individual devices (hence being recorded in metadata).

This is how I think spare migration should work:
 Spare migration is controlled entirely by the 'spare-group' attribute.
 A spare-group is an attribute of a device. A device may have multiple
   spare-group attributes (it might be in multiple groups).
 There are two ways a device can be assigned a spare-group.
  1/ If an array is tagged with a spare-group= in mdadm.conf then any device
    in that array gets that spare-group attribute
  2/ If a DOMAIN is tagged with a spare-group attribute then any device
    in that domain gets that spare-group attribute

 When mdadm --monitor needs to find a hot spare for an array or container
 which is degraded, it collects a list of spare-group attributes
 for all devices in the array, then finds any device (of suitable size)
 that has a spare-group attribute matching any of those.
 Possibly a weighting should prefer spare-groups that are more prevalent in
 the array, so that if you add a foreign device in an emergency, mdadm won't
 feel too free to add other foreign devices (but is still allowed to).

 You seem to be suggesting that the spare-group tag could also be specified
 by the metadata.  I think I'm happy with that.

 A DOMAIN line without an explicit spare-group= tag might imply an implicit
 spare-group= tag where the spare-group name is some generated string that
 is unique to that DOMAIN line.
 So all devices in a DOMAIN line are effectively interchangeable, but it is
 easy to stretch the migration barrier around multiple domains by giving
 them all a matching spare-group tag.

When you create an array, every pair of devices much share a spare-group, or
else one of them must not be in an spare-group.  Is that right?

NeilBrown

> 
> That should cover everything I would like to expose  Comments?
> 
> --
> Dan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-05-03  5:58                 ` Neil Brown
@ 2010-05-08  1:06                   ` Dan Williams
  0 siblings, 0 replies; 23+ messages in thread
From: Dan Williams @ 2010-05-08  1:06 UTC (permalink / raw)
  To: Neil Brown
  Cc: Doug Ledford, Labun, Marcin, Linux RAID Mailing List,
	Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw

On Sun, May 2, 2010 at 10:58 PM, Neil Brown <neilb@suse.de> wrote:
> On Thu, 29 Apr 2010 14:55:23 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
>> I am not grokking the separate POLICY line, especially for defining
>> the spare-migration border because that is already what DOMAIN is
>> specifying.
>
> Is it?  This is what I'm not yet 100% convinced about.
> We seem to be saying:
>  - A DOMAIN is a set of devices that are handled the same way for
>    hotplug
>  - A DOMAIN is a set of devices that define a boundary on spare
>    migration

The definition I have been carrying around is slightly more nuanced.
The DOMAIN defines a maximal boundary, but there might be metadata
specific modifiers that further restrict the possible actions.  For
example a DOMAIN path=ddf domain would handle all hotplug events on
"ddf" ports the same way with the caveat that the ddf handler would
know about controller spanning rules in the multi-controller case.
Otherwise if you define path=<pci-device-path+partitions> then
wysiwyg, i.e. no arrays assembling across these boundaries.

>
> and I'm not sure those sets are necessarily isomorphic - though I agree that
> they will often be the same.
>
> Does each DOMAIN line define a separate migration boundary so that devices
> cannot migrate 'across domains'??
> If we were to require that, I would probably want multiple 'path=' words
> allowed for a single domain so we can create a union.

Yes, we should do that regardless because it would be hard to write a
glob that covers disparate controllers otherwise.

>>
>> Here is what I think we need to allow for simple honoring of platform
>> constraints but without needing to expose all the nuances of those
>> constraints in config-file syntax... yet.
>>
>> 1/ Allow path= to take a metadata name this allows the handler to
>> identify its known controller ports, alleviating the user from needing
>> to track which ports are allowed, especially as it may change over
>> time.  If someone really wants to see which ports a metadata handler
>> cares about we could have a DOMAIN line dumped by --detail-platform
>> --brief -e imsm.  However for simplicity I would rather just dump:
>>
>> DOMAIN path=imsm action=spare-same-port spare-migration=imsm
>>
>
> So "path=imsm" means "all devices which are attached to a controller which
> seems to understand IMSM natively".
> What if a system had two such controllers - one on-board and one on a plug-in
> card.  This might not be possibly for IMSM but would be for DDF.
> I presume the default would be that the controllers are separate domains -
> would you agree?

The controllers may restrict spare migration but I would still see
this as one ddf DOMAIN where the paths and spare migration constraints
are internally determined by the handler, but the hotplug policy is
global for the "ddf-DOMAIN".

> So the above DOMAIN line would potentially create multiple
> 'domains' at least for spare-migration.

Yes.

>
>> 2/ I think we should always block configurations that cross domain
>> boundaries.  One can always append more path= lines to override this.
>
> I think we all agree on this.  Require --force to create an array, or add
> devices to an array, where that would cross an established spare-group...
> The details are still a bit vague for me but the principle is good.
>
>>
>> 3/ The metadata handler may want to restrict/control where spares are
>> placed in a domain.  To enable interaction with CIM we are looking to
>> add a storage-pool id to the metadata.  The primary usage of this will
>> be to essentially encode a spare-group number in the metadata.  This
>> seems to require a spare-migration= option to the DOMAIN line.  By
>> default it is 'all' but it can be set to a metadata-name to let the
>> handler apply its internal migration policy.
>
> I'm not following you.  Are you talking about subsets of a domain? Subdomains?
> Do the storage-pools follow hardware port locations, or dynamic configuration
> of individual devices (hence being recorded in metadata).

Dynamic configuration, but I would still call this the imsm-DOMAIN
with metadata specific spare-migration-boundaries.

>
> This is how I think spare migration should work:
>  Spare migration is controlled entirely by the 'spare-group' attribute.
>  A spare-group is an attribute of a device. A device may have multiple
>   spare-group attributes (it might be in multiple groups).
>  There are two ways a device can be assigned a spare-group.
>  1/ If an array is tagged with a spare-group= in mdadm.conf then any device
>    in that array gets that spare-group attribute
>  2/ If a DOMAIN is tagged with a spare-group attribute then any device
>    in that domain gets that spare-group attribute
>
>  When mdadm --monitor needs to find a hot spare for an array or container
>  which is degraded, it collects a list of spare-group attributes
>  for all devices in the array, then finds any device (of suitable size)
>  that has a spare-group attribute matching any of those.
>  Possibly a weighting should prefer spare-groups that are more prevalent in
>  the array, so that if you add a foreign device in an emergency, mdadm won't
>  feel too free to add other foreign devices (but is still allowed to).
>
>  You seem to be suggesting that the spare-group tag could also be specified
>  by the metadata.  I think I'm happy with that.

Yeah, metadata implied spare-groups that sub-divide the domain.

>
>  A DOMAIN line without an explicit spare-group= tag might imply an implicit
>  spare-group= tag where the spare-group name is some generated string that
>  is unique to that DOMAIN line.
>  So all devices in a DOMAIN line are effectively interchangeable, but it is
>  easy to stretch the migration barrier around multiple domains by giving
>  them all a matching spare-group tag.
>
> When you create an array, every pair of devices much share a spare-group, or
> else one of them must not be in an spare-group.  Is that right?

...once you allow for $metadata-DOMAINs I am having trouble
conceptualizing the use case for allowing spares to migrate across the
explicit union of path= boundaries?  Unless you are trying to codify
what the metadata handlers would be doing internally.  In which case I
would expect to replace a single spare-group= identifier with multiple
mutually exclusive spare-path= lines to subdivide a DOMAIN into spare
migration sub-domains with the same hot-plug policy.

...or am I still misunderstanding your spare-group= vs DOMAIN distinction?

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-04-29  2:37             ` Neil Brown
  2010-04-29 18:22               ` Labun, Marcin
  2010-04-29 21:55               ` Dan Williams
@ 2010-04-30 16:13               ` Doug Ledford
  2 siblings, 0 replies; 23+ messages in thread
From: Doug Ledford @ 2010-04-30 16:13 UTC (permalink / raw)
  To: Neil Brown
  Cc: Dan Williams, Labun, Marcin, Linux RAID Mailing List,
	Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw

[-- Attachment #1: Type: text/plain, Size: 2989 bytes --]

On 04/28/2010 10:37 PM, Neil Brown wrote:
> On Wed, 28 Apr 2010 18:19:45 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
>> Neil Brown wrote:
>>> I'm not sure how this fits with imposing platform constraints.
>>> As platform constraints are closely tied to metadata types, it might be OK
>>> to have a metadata-specific tags (imsm=???) and leave to details to the
>>> metadata handler???
>>>
>>> Dan: help me understand these platform constraints: what is the most complex
>>>   constraint that you can think of that you might want to impose?
>>
>> At this point we really only need one constraint: prevent controller 
>> spanning.  If for example I take an existing imsm member off of ahci and 
>> reattach it via a usb-to-sata enclosure mdadm needs a policy to prevent 
>> associating that drive with anything on ahci.
>>
>> In a pinch this policy can be disabled, but you wouldn't want to rebuild 
>> across usb or any other controller because the option-rom only talks 
>> ahci and will mark the drive missing.
>>
>> So something like DOMAIN spanning=imsm, to tell mdadm to follow imsm 
>> rules for this domain.  Where 'spanning' is policy tag??
>>
> 
> Thanks.
> 
> So we have two different ideas here.
> 
> 1/ A given set of devices (paths) are all attached to the one controller.
> 2/ A given array is not allowed to span controllers
> 
> The first statement is somewhat similar to a statement about sparegroups.
> It groups devices together is some way.
> 
> The second is a policy statement, and is metadata specific to some extent.
> If I create a native-metadata array using the controller, then adding other
> devices from a different controller is a non-issue.  It is only when an
> IMSM array is created that it is an issue (and then - only if the array is to
> be used for boot for for multi-boot).
> 
> So the ARRAY line could have "exclusive-group=foo" where 'foo' is a group
> name similar to those used for 'spare-group='
> But that isn't much fun for auto-detect and auto-assembly.
> 
> Maybe we want to extend the 'auto' line.  It gives policy on a per-metadata
> basis. Maybe:
> 
>   POLICY auto-assemble  +1.x -all
>   POLICY same-group  imsm
> 
> where 'same-group' means that all the devices in an array must be in the
> same spare-group.  The 'domain' lines assign spare-groups to devices.
> 
> Maybe "same-group" could be "same-$tag" so we could have different tags
> for different metadatas....
> 
> Is this working for anyone else, or have I lost the plot??
> 
> NeilBrown

I keep going back to the idea of just implement the no-spanning policy
for imsm/ddf as the default with a force-override flag and don't bother
putting it into the config anywhere, it simply is.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-04-29  1:19           ` Dan Williams
  2010-04-29  2:37             ` Neil Brown
@ 2010-04-30 11:14             ` John Robinson
  1 sibling, 0 replies; 23+ messages in thread
From: John Robinson @ 2010-04-30 11:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Neil Brown, Doug Ledford, Labun, Marcin, Linux RAID Mailing List,
	Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw

On 29/04/2010 02:19, Dan Williams wrote:
> Neil Brown wrote:
>> I'm not sure how this fits with imposing platform constraints.
>> As platform constraints are closely tied to metadata types, it might 
>> be OK
>> to have a metadata-specific tags (imsm=???) and leave to details to the
>> metadata handler???
>>
>> Dan: help me understand these platform constraints: what is the most 
>> complex
>>   constraint that you can think of that you might want to impose?
> 
> At this point we really only need one constraint: prevent controller 
> spanning.  If for example I take an existing imsm member off of ahci and 
> reattach it via a usb-to-sata enclosure mdadm needs a policy to prevent 
> associating that drive with anything on ahci.
> 
> In a pinch this policy can be disabled, but you wouldn't want to rebuild 
> across usb or any other controller because the option-rom only talks 
> ahci and will mark the drive missing.
> 
> So something like DOMAIN spanning=imsm, to tell mdadm to follow imsm 
> rules for this domain.  Where 'spanning' is policy tag??

Why isn't DOMAIN path=pci-0000:00:1f.2-scsi-[012345]* enough? Can't 
arrays span multiple Intel/imsm controllers? And if I start off my array 
on one Intel/imsm controller, and I add another JBOD controller, 
shouldn't I be allowed to grow my array to span the two controllers
without rebuilding the array with new metadata? I know I can't expect 
the option ROM to cope with or boot off this array but would the option 
ROM in some manner damage such an array?

Cheers,

John.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-04-29  1:01         ` Neil Brown
  2010-04-29  1:19           ` Dan Williams
@ 2010-04-30 15:52           ` Doug Ledford
  1 sibling, 0 replies; 23+ messages in thread
From: Doug Ledford @ 2010-04-30 15:52 UTC (permalink / raw)
  To: Neil Brown
  Cc: Labun, Marcin, Linux RAID Mailing List, Williams, Dan J,
	Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw

[-- Attachment #1: Type: text/plain, Size: 19716 bytes --]

On 04/28/2010 09:01 PM, Neil Brown wrote:
> On Wed, 28 Apr 2010 17:05:58 -0400
> Doug Ledford <dledford@redhat.com> wrote:
> 
>> On 04/28/2010 02:34 PM, Labun, Marcin wrote:
>>>>> Going further, thus causes that a new disk can be potentially grabbed
>>>> by more than one container (because of shared path).
>>>>> For example:
>>>>> DOMAIN1: path=a path=b path=c
>>>>> DOMAIN2: path=a path=d
>>>>> DOMAIN3: path=d path=c
>>>>> In this example disks from path c can appear in DOMAIN 1 and DOMAIN
>>>> 3, but not in DOMAIN 2.
>>>>
>>>> What exactly is the use case for overlapping paths in different
>>>> domains?
>>>
>>> OK, makes sense.
>>> But if they are overlapped, will the config functions assign path are requested by configuration file
>>> or treat it as misconfiguration?
>>
>> For now it merely means that the first match found is the only one that
>> will ever get used.  I'm not entirely sure how feasible it is to detect
>> matching paths unless we are just talking about identical strings in the
>> path= statement.  But since the path= statement is passed to fnmatch(),
>> which treats it as a file glob, it would be possible to construct two
>> path statements that don't match but match the same set of files.  I
>> don't think we can reasonably detect this situation, so it may be that
>> the answer is "the first match found is used" and have that be the
>> official stance.
> 
> I think we do need an "official stance" here.
> glob is good for lots of things, but it is hard to say "everything except".
> The best way to do that is to have a clear ordering with more general globs
> later in the order.
>    path=abcd  action=foo
>    path=abc*  action=bar
>    path=*     action=baz
> 
> So the last line doesn't really mean "do baz on everything" but rather
> "do baz on everything else".
> 
> You could impose ordering explicitly with a priority number or a
> "this domain takes precedence over that domain" tag, but I suspect
> simple ordering in the config file is easiest and so best.
> 
> An important question to ask here though is whether people will want to
> generate the "domain" lines automatically and if so, how we can make it hard
> for people to get that wrong.
> Inserting a line in the middle of a file is probably more of a challenge than
> inserting a line with a specific priority or depends-on tag.
> 
> So before we get too much further down this path, I think it would be good to
> have some concrete scenarios about how this functionality will actually be
> put into effect.  I'd love to just expect people to always edit mdadm.conf to
> meet their specific needs, but experience shows that is naive - people will
> write scripts based on imperfect understanding, then share those scripts
> with others....

OK, so here's some scenarios that I've been working from that display
how I envision this being used:

1) IMSM arrays: a single domain entry with a path= that specifies all
(or some) ports on the controller in question and encompasses all the
containers on that controller.  The action would be spare or grow and
what container we add any new drives to would depend on the various
container's conditions and types.

2) IMSM arrays + native arrays: similar to above but split the ports
between IMSM use and native use.  No overlapping paths, some paths to
one, some paths to other.

3) native arrays: one or more domains, no overlapping paths, actions
dependent on domain.  As an example of this type of setup, let me detail
how I commonly configure my machines when I have multiple drives I want
in multiple arrays.  Let's assume a machine with 6 drives (like the one
I'm using right now).  I use a raid5 for my / partition, so I can
tolerate at most 1 drive failure on my machine before it's unusable.
So, my standard partitioning method is to create a 1gig partition on sda
and sdb and make those into a raid1 /boot partition.  Then I do 1gig on
all remaining drives and make that into a raid5 swap partition.  Then I
do the remaining space on all drives as a raid5 root partition.  I don't
do any more than two drives in the raid1 /boot partition because if I
ever loose two drives and can't use the machine anyway, so more than
that in the /boot partition is a waste.  So, in my machine's case, here
are the domain entries I would create:

DOMAIN path=blah[123456] action=force-partition table=/etc/mdadm.table
	program=sfdisk
DOMAIN path=blah[12]-part1 action=force-spare
DOMAIN path=blah[3456]-part1 action=force-grow
DOMAIN path=blah*-part2 action=force-grow

Assuming that blah in the above is the path to my PCI sata controller,
the first entry would tell mdadm that if a bare disk is inserted into
slots 1 through 6, then force the disk to have the correct partition
table for my usage (Dan, I think this should clear up the confusion
about the partition action you had in another email, but I'll address it
there to...partition is really only for native array types, IMSM will
never use it).  The second entry says if it's sda or sdb and it's the
first partition (so sda1 or sdb1), then force it to be added as a spare
to any arrays in the domain.  Because of how the arrays_in_domain
function works, this will only ever match the raid1 /boot array, so we
know for a fact that it will always get added to the raid1 /boot array.
 And because that array only exists on sda1 and sdb1 anyway, we know
that if we ever plug a drive into either of those slots, then the array
will already be degraded, and this spare will be used to bring the array
back into good condition.  The third domain says on the remaining ports
to take the first partition and grow (if possible, spare if the array is
degraded) any existing array.  This means that my raid5 swap partition
will either get repaired, or grown, depending on the situation.  The
final entry makes it so that the second partition on any disk inserted
is used to grow (or spare if degraded) the / partition.

One of the things that the current code relies upon is something that we
talked about earlier.  For native array types, we only allow identical
partition tables.  We don't try to do things like add /dev/sdd4 to an
array comprised of entries such as /dev/sdc3.  Finding a suitable
partition when partition tables are not identical is beyond the initial
version of this code.  Because of this requirement, the arrays_in_domain
function makes use of this to narrow down arrays that might match a
domain based upon partition number.  So if the current pathname include
part? in its path, the function only returns arrays with the same part
in their path.  That considerably eases the matching process.

> 
>>
>>> So, do you plan to make changes similar to incremental in assembly to serve DOMAIN?
>>
>> I had not planned on it, no.  The reason being that assembly isn't used
>> for hotplug.  I guess I could see a use case for this though in that if
>> you called mdadm -As then maybe we should consult the DOMAIN entries to
>> see if there are free drives inside of a DOMAIN listed as spare or grow
>> and whether or not we have any degraded arrays while assembling that
>> could use the drives.  Dunno if we want to do that though.  However, I
>> think I would prefer to get the incremental side of things working
>> first, then go there.
>>
>>> Should an array be split (not assembled) if a domain paths are dividing array into two separate DOMAIN?
>>
>> I don't think so.  Amongst other things, this would make it possible to
>> render a machine unbootable if you had a type in a domain path.  I think
>> I would prefer to allow established arrays to assemble regardless of
>> domain path entries.
>>
>>>>  I'm happy to rework the code to support it if there's a valid use
>>>> case, but so far my design goal has been to have a path only appear in
>>>> one domain, and to then perform the appropriate action based upon that
>>>> domain.
>>> What is then the purpose of metadata keyword?
>>
>> Mainly as a hint that a given domain uses a specific type of metadata.

I want to address this in a bit more detail.  One of the conceptual
problems I've been wrestling with in my mind if not on paper yet is the
problem of telling a drive that is intended to be wiped out and reused
from a drive that is part of your desired working set.  Let's think
about my above example for native arrays, where there are three arrays,
a /boot, a swap, and a / array.  Much of this talk has centered around
"what do we do when we get a hotplug event for a drive and array <blah>
is degraded".  That's the easy case.  The hard case is "what do we do if
array <blah> is degraded and the user shuts the machine down, puts in a
new-to-this-machine drive (possibly with existing md raid superblocks),
and then boots the machine back up and expects us to do the right
thing".  For anyone that doesn't have true hotplug hardware, this is
going to be the common case.  If the drive is installed in the last
place in the system and it's the last drive we detect, then we have a
chance of doing the right thing.  But if it's installed to replace
/dev/sda, we are *screwed*.  It will be the first drive we detect.  And
we won't know *what* to do with it.  And if it has a superblock on it,
we won't even know that it's not supposed to be here.  We will happily
attempt incremental assembly on this drive, possibly starting arrays
that have never existed on this machine before.  So, I'm actually
finding the metadata keyword less useful than possibly adding a UUID
keyword and allowing a domain to be restricted to one or more UUIDs.
Then if we find an errant UUID in the domain, we know not to assemble it
and in fact if the force-spare or force-grow keywords are present we
know to wipe it out and use it for our own purposes.  However, that
doesn't solve the whole problem that if it's /dev/sda then we won't have
any other arrays assembled yet, so the second thing we are going to have
to do is defer our use of the drive until a later time.  Specifically
I'm thinking we might have to write a map entry for the drive into the
mapfile, then when we run mdadm -IRs (because all distros do this after
scsi_wait_scan has completed...right?) we can revisit what to do with
the drive.  The other option is to add the drive to the mapfile, then
when mdadm --monitor mode is started have it process the drive because
all of our arrays should be up and running by the time we start the
monitor process.  Those are the only two solutions I have to this issue
at the moment.  Thoughts welcome.

>>> My initial plan was to create a default configuration for a specific metadata, where user specifies actions 
>>> but without paths letting metadata handler to use default ones.
>>> In your description, I can see that the path are required.
>>
>> Yes.  We already have a default action for all paths: incremental.  This
>> is the same as how things work today without any new support.  And when
>> you combine incremental with the AUTO keyword in mdadm.conf, you can
>> control which devices are auto assembled on a metadata by metadata basis
>> without the use of DOMAINs. 
> 
> 
>>                               The only purpose of a domain then is to
>> specify an action other than incremental for devices plugged into a
>> given domain.
> 
> I like this statement.  It is simple and to the point and seems to capture
> the key ideas.
> 
> The question is:  is it true? :-)

Well, for the initial implementation I would say it's true ;-)
Certainly all the other things you bring up here make my brain hurt.

> It is suggested that 'domain' also involved in spare-groups and could be used
> to warn against, or disable, a 'create' or 'add' which violated policy.
> 
> So maybe:
>   The purpose of a domain is to guide:
>    - 'incremental' by specifying actions for hot-plug devices other than the
>      default

Yes.

>    - 'create' and 'add' by identifying configurations that breach policy

We don't really need domains for this.  The only things that have hard
policy requirements are BIOS based arrays, and that's metadata/platform
specific.  We could simply test for and warn on create/add operations
that violate platform capability without regard to domains.

>    - 'monitor' by providing an alternate way of specifying spare-groups

Although this can be done, it need not be done.  I'm still not entirely
convinced of the value of the spare-group tag on domain lines.

> It is a lot more wordy, but still seems useful.
> 
> While 'incremental' would not benefit from overlapping domains (as each
> hotplugged device only wants one action), the other two might.
> 
> Suppose I want to configure array A to use only a certain set of drives,
> and array B that can use any drive at all.  Then if we disallow overlapping
> domains, there is no domain that describes the drives that B can be made from.
> 
> Does that matter?  Is it too hypothetical a situation?

Let's see if we can construct such a situation.  Let's assume that we
are talking about IMSM based arrays.  Let's assume we have a SAS
controller and we have more than 6 ports available (may or may not be
possible, I don't know, but for the sake of argument we need it).  Let's
next assume we have a 3 disk raid5 on ports 0, 1, and 2.  And let's
assume we have a 3 disk raid5 on ports 4, 5, and 6.  Let's then assume
we only want the first raid5 to be allowed to use ports 0 through 4, and
that the second raid5 is allowed to use ports 0 through 7.  To create
that config, we create the two following DOMAIN lines:

DOMAIN path=blah[01234] action=grow
DOMAIN path=blah[01234567] action=grow

Now let's assume that we plug a disk into port 3.  What happens?

Currently, conf_get_domain() will return one, and only one, domain for a
given device.  And it doesn't search for best match (which would be very
difficult to do as we use fnmatch to test the glob match, which means
that really the path= statement is more or less opaque to us, we don't
process it ourselves and don't evaluate it ourselves, we just pass it
off to fnmatch and let it tell us if things matched), it just finds the
first match and returns it.  So, right now anyway, we will match the
first domain and the first domain only.  That means we will then return
that domain, then later when we call arrays_in_domain we will pass in
our device path plus our matched domain and as a result we will search
mdstat and we will find both raid5 arrays in our requested domain (the
current search returns any array with at least one member in the domain,
maybe that should be any array where all members are in the domain).

Now, at this point, if one or the other array is degraded, then what to
do is obvious.  However, if both arrays are degraded or neither array is
degraded, then our choice is not obvious.  I'm having a hard time coming
up with a good answer to that issue.  It's not clear which array we
should grow if both are clean, nor which one takes priority if both are
degraded.  We would have to add a new tag, maybe priority=, to the ARRAY
lines in order to make this decision obvious.  Short of that, the proper
course of action is probably to do nothing and let the user sort it out.

Now let's assume that we plug a disk into port 7.  We search and find
the second domain.  Then we call arrays_in_domain() and we get both
raid5 arrays again because both of them have members in the domain.
Regardless of anything else, it's clear that this situation did *not* do
what you wanted.  It did not specify that array 1 can only be on the
first 5 ports, and it did not specify that array 2 can use all 8 ports.
 If we changed the second domain path to be blah[567] then it would
work, but I don't think that this combination of domains and the
resulting actions is all that clear to understand from a user's
perspective.  I think right now trying to do what you are suggesting is
confusing from a domain line.  Maybe we need to add something to array
lines for this.  Maybe the array line needs an allowed_path entry that
could be used to limit what paths an array will accept devices from.
But this then assumes we will create an array line for all arrays (or
for ones where we want to limit their paths) and I'm not sure people
will do (or want to do) that.

So, while I can see a possible scenario that matches your hypothetical,
I'm finding that the domain construct is a very clunky way to try and
implement the constraints of your hypothetical.

> Here is another interesting question.  Suppose I have two drive chassis, each
> connected to the host by a fibre.  When I create arrays from all these drives,
> I want them to be balanced across the two chassis, both for performance
> reasons and for redundancy reasons.
> Is there any way we can tell mdadm about this, possible through 'domains'.

This is actually the first thing that makes me see the use of
spare-group on a domain line.  We could construct two different domains,
one for each chassis, but with the same spare-group tag.  This would
imply that both domains are available as spares to the same arrays, but
allows us to then add a policy to mdadm for how to select spares from
domains.  We could add a priority tag to the domain lines.  If two
domains share the same spare-group tag, and the domains have the same
priority, then we could round-robin allocate from domains (what you are
asking about), but if they have different priorities then we could
allocate solely from the higher (or lower, implementation defined)
priority domain until there is nothing left to allocate from it and then
switch to the other domain.  I could actually also see adding a
write_mostly flag to an entire domain in case the chassis that domain
represents is remote via wan.

> This could be an issue when building a RAID10 (alternate across the chassis
> is best) or when finding a spare for a RAID1 (choosing from the 'other'
> chassis is best).
> 
> I don't really want to solve this now, but I do want to be sure that our
> concept of 'domain' is big enough that we will be able to fit that sort of
> thing into it one day.
> 
> Maybe a 'domain' is simply a mechanism to add tags to devices, and possibly
> by implication to arrays that contain those devices.
> The mechanism for resolving when multiple domains add conflicting tags to
> the one device would be dependant on the tag.  Maybe first-wins.  Maybe
> all are combined.
> 
> So we add an 'action' tag for --incremental, and the first wins (maybe)
> We add a 'sparegroup' tag for --monitor
> We add some other tag for balancing (share=1/2, share=2/2 ???)
> 
> I'm not sure how this fits with imposing platform constraints.
> As platform constraints are closely tied to metadata types, it might be OK
> to have a metadata-specific tags (imsm=???) and leave to details to the
> metadata handler???

I'm more and more of the mind that we need to leave platform constraints
out of the domain issue and instead just implement proper platform
constraint checks and overrides in the various parts of mdadm that need
it regardless of domains.

> Dan: help me understand these platform constraints: what is the most complex
>   constraint that you can think of that you might want to impose?
> 
> NeilBrown

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-04-28 17:47   ` Doug Ledford
  2010-04-28 18:34     ` Labun, Marcin
@ 2010-04-28 20:59     ` Luca Berra
  2010-04-28 21:16       ` Doug Ledford
  1 sibling, 1 reply; 23+ messages in thread
From: Luca Berra @ 2010-04-28 20:59 UTC (permalink / raw)
  To: Linux RAID Mailing List

On Wed, Apr 28, 2010 at 01:47:55PM -0400, Doug Ledford wrote:
>DOMAIN path=pci-0000:00:1f.2-scsi-[2345]:0:0:0 action=partition
>	table=/etc/mdadm.table program=sfdisk
i admit i did not take the time to pull from your git, so tell me to
rtfc if needed.
it seems you are assuming program will take table as stdin.
would not it be better to use somethink like
action=initialize command="sfdisk %d < /etc/mdadm.table" ?
where command is invoked via a shell and %d is replaced with the device
node. (more escapes could also be useful, e.g. the sysfs node)

besides that is there any provisioning to check that the device really
is empty before running action?

Regards,
L.



-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-04-28 20:59     ` Luca Berra
@ 2010-04-28 21:16       ` Doug Ledford
  0 siblings, 0 replies; 23+ messages in thread
From: Doug Ledford @ 2010-04-28 21:16 UTC (permalink / raw)
  To: Linux RAID Mailing List

[-- Attachment #1: Type: text/plain, Size: 2496 bytes --]

On 04/28/2010 04:59 PM, Luca Berra wrote:
> On Wed, Apr 28, 2010 at 01:47:55PM -0400, Doug Ledford wrote:
>> DOMAIN path=pci-0000:00:1f.2-scsi-[2345]:0:0:0 action=partition
>>     table=/etc/mdadm.table program=sfdisk
> i admit i did not take the time to pull from your git, so tell me to
> rtfc if needed.

rtfc ;-)

> it seems you are assuming program will take table as stdin.

No, table is program specific.  In this case, for sfdisk, it would be
something taken as stdin.  However, in the code, there is a specific
handler for the sfdisk program type.  That handler provides a validate
routine to check the contents of the table= entry and make sure it's
valid, a check routine to check the table on a given disk and see if it
matches what it's supposed to be, and a write routine to update the
disks table to what it should be.  How it goes about doing these things
is particular to the sfdisk handler.  I do have plans to add a more
generic simple script handler that would allow you to pass things in as
you suggest, but I have not yet implemented it.  And part of the reason
is that I'm extremely leary of the security implications of allowing a
text file to spell out a program to be called by a root invoked system
daemon.  I can see a million different ways to compromise a system when
a daemon with raw disk access reads a command from a text file.

> would not it be better to use somethink like
> action=initialize command="sfdisk %d < /etc/mdadm.table" ?
> where command is invoked via a shell and %d is replaced with the device
> node. (more escapes could also be useful, e.g. the sysfs node)

This is precisely what the sfdisk handler will be doing, only it won't
be reading the command from the text file, it will have the knowledge of
how to invoke sfdisk safely compiled into the program where compromise
is much more difficult.

> besides that is there any provisioning to check that the device really
> is empty before running action?

Yes.  In the code that tries to take new disks, it requires either the
force-partition option or that the device be declared clean, which per
Neil's suggestion is that both the first 4k and last 4k of the device is
comprised entirely of one of three patterns: 0x00, 0x5a, 0xff.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-04-27 16:45 More Hot Unplug/Plug work Doug Ledford
  2010-04-27 19:41 ` Christian Gatzemeier
  2010-04-28 16:08 ` Labun, Marcin
@ 2010-04-29 20:32 ` Dan Williams
  2010-04-29 21:22 ` Dan Williams
  3 siblings, 0 replies; 23+ messages in thread
From: Dan Williams @ 2010-04-29 20:32 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Linux RAID Mailing List, Neil Brown, Labun, Marcin

Doug Ledford wrote:
> So, that's where things stand right now.  I'm going to keep working on
> this as it's incomplete and doesn't actually do any work at the moment
> (it's all sanity checks, config file parsing, and infrastructure, the
> actual actions are not yet implemented), but I wanted to get out what I
> have currently for people to see.  So, you can check it out here:
> 
> git://git.fedorapeople.org/~dledford/mdadm.git hotunplug
> 
> Comments welcome.
> 

Quick friendly request... may I ask that you add subject lines to your 
commits?  The git shortlog of your changes is a tad garish.

...now to actually review it.

Regards,
Dan


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-04-27 16:45 More Hot Unplug/Plug work Doug Ledford
                   ` (2 preceding siblings ...)
  2010-04-29 20:32 ` Dan Williams
@ 2010-04-29 21:22 ` Dan Williams
  2010-04-30 16:26   ` Doug Ledford
  3 siblings, 1 reply; 23+ messages in thread
From: Dan Williams @ 2010-04-29 21:22 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Linux RAID Mailing List, Neil Brown, Labun, Marcin

On Tue, Apr 27, 2010 at 9:45 AM, Doug Ledford <dledford@redhat.com> wrote:
> So I pulled down Neil's git repo and started working from his hotunplug
> branch, which was his version of my hotunplug patch.  I had to do a
> couple minor fixes to it to make it work.  I then simply continued on
> from there.  I have a branch in my git repo that tracks his hotunplug
> branch and is also called hotunplug.  That's where my current work is at.
>
> What I've done since then:
>
> 1) I've implemented a new config file line type: DOMAIN
>   a) Each DOMAIN line must have at least one valid path= entry, but may
>      have more than one path= entry.  path= entries are file globs and
>      must match something in /dev/disk/by-path
>   b) Each DOMAIN line must have one and only one action= entry.  Valid
>      action items are: ignore, incremental, spare, grow, partition.
>      In addition, a word me be prefixed with force- to indicate that
>      we should skip certain safety checks and use the device even if it
>      isn't clean.

Just to clarify that we are on the same page with these actions:
* incremental is the default action that "does the right thing" if the
drive already has metadata.  I assume we need checks here to reject
disks with ambiguous (multiple valid metadata records)
* spare: implies incremental, but if it is a 'bare' device write a spare record
* grow: implies incremental but if it is a 'bare' device write a spare
record, if there is a degraded array in the domain rebuild it
otherwise grow an(y?) array in the domain
* partition: if the device has a partition that matches the specified
table then add the partitions incrementally

A few comments:
1/ Does 'partition' need to be split to 'partition-spare' and
'partition-grow' to imply the action post partitioning?
2/ One of the safety checks for hot-inserting a spare is that it
occurs on a port that was recently unplugged.  Should that be a
default policy or do we need a different flavor spare action like
'spare-same-port'.

>   c) Each DOMAIN line may have a metadata entry, and may have a
>      spare-group entry.

What is the purpose of the spare group?  I thought we were assuming
that all DOMAIN members were automatically in the same spare group.
Is this to augment the policy to allow spares to float between
DOMAINs?  Something like the following where the different domains
allow spares to cross boundaries?
DOMAIN path=A spare-group=B action=grow
DOMAIN path=B spare-group=A action=spare

>   d) For the partition action, a DOMAIN line must have a program= and
>      a table= entry.  Currently, the program= entry must be an item
>      out of a list of known partition programs (I'm working on getting
>      sfdisk up and running, but for arches other than x86, other
>      methods would be needed, and I'm planning on adding a method
>      that allows us to call out to a user supplied script/program
>      instead of a known internal method).  The table= entry points to
>      a file that contains a method specific table indicating the
>      necessary partition layout.  As mentioned in previous mails, we
>      only support identical partition tables at this point.  That
>      may never change, who knows.
>
> 2) Created a new udev rules file that gets installed as
> 05-md-early.rules.  This rule file, combined with our existing rules
> file, is a key element to how this domain support works.  In particular,
> udev rules allow us to separate out devices that already have some sort
> of raid superblock from devices that don't.  We then add a new flag to
> our incremental mode to indicate that a device currently does not belong
> to us, and we perform a series of checks to see if it should, and if so,
> we "grab" it (I would have preferred a better name, but the short
> options for better names were already taken).  When called with the
> "grab" flag, we follow a different code path where we check the domain
> of the device against our DOMAIN entries and if we have a match, we
> perform the specified action.  There will need to be some additional
> work to catch certain corner cases, such as the case where we have
> force-partition and we insert a disk that currently has a raid
> superblock on the bare drive.  We will currently miss that situation and
> not grab the device.  So, this is a work in progress and not yet complete.
>

I notice this rules file grabs all events.  Did you see, or disagree,
with the suggestion to have a mdadm --activate-domains command to
generate udev rules for the paths we care about?

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: More Hot Unplug/Plug work
  2010-04-29 21:22 ` Dan Williams
@ 2010-04-30 16:26   ` Doug Ledford
  0 siblings, 0 replies; 23+ messages in thread
From: Doug Ledford @ 2010-04-30 16:26 UTC (permalink / raw)
  To: Dan Williams; +Cc: Linux RAID Mailing List, Neil Brown, Labun, Marcin

[-- Attachment #1: Type: text/plain, Size: 8001 bytes --]

On 04/29/2010 05:22 PM, Dan Williams wrote:
> On Tue, Apr 27, 2010 at 9:45 AM, Doug Ledford <dledford@redhat.com> wrote:
>> So I pulled down Neil's git repo and started working from his hotunplug
>> branch, which was his version of my hotunplug patch.  I had to do a
>> couple minor fixes to it to make it work.  I then simply continued on
>> from there.  I have a branch in my git repo that tracks his hotunplug
>> branch and is also called hotunplug.  That's where my current work is at.
>>
>> What I've done since then:
>>
>> 1) I've implemented a new config file line type: DOMAIN
>>   a) Each DOMAIN line must have at least one valid path= entry, but may
>>      have more than one path= entry.  path= entries are file globs and
>>      must match something in /dev/disk/by-path
>>   b) Each DOMAIN line must have one and only one action= entry.  Valid
>>      action items are: ignore, incremental, spare, grow, partition.
>>      In addition, a word me be prefixed with force- to indicate that
>>      we should skip certain safety checks and use the device even if it
>>      isn't clean.
> 
> Just to clarify that we are on the same page with these actions:
> * incremental is the default action that "does the right thing" if the
> drive already has metadata.  I assume we need checks here to reject
> disks with ambiguous (multiple valid metadata records)
> * spare: implies incremental, but if it is a 'bare' device write a spare record
> * grow: implies incremental but if it is a 'bare' device write a spare
> record, if there is a degraded array in the domain rebuild it
> otherwise grow an(y?) array in the domain
> * partition: if the device has a partition that matches the specified
> table then add the partitions incrementally

No, partition is an action, so a partition domain (which is limited to
being a whole disk device) causes us to write out a partition table on
the device.  This is only useful for native array types, not for imsm
arrays.

> A few comments:
> 1/ Does 'partition' need to be split to 'partition-spare' and
> 'partition-grow' to imply the action post partitioning?

No, because once you write the partition table out and cause the kernel
to reread the partition table, you will get separate incremental events
for the partitions themselves and they will match different domains (you
would have one domain line for the partition domain and as many domain
lines as you need for the actual partitions themselves).

> 2/ One of the safety checks for hot-inserting a spare is that it
> occurs on a port that was recently unplugged.  Should that be a
> default policy or do we need a different flavor spare action like
> 'spare-same-port'.

No, I canned this aspect.  The more I thought about it the more I
disliked it.  I suppose it could be added in for paranoia's sake, but
here's why I dropped it:

1) We don't know that the user will necessarily plug the new spare
device into the same port.  Maybe it was the port that went bad and not
the drive and they are using a new port as a result.
2) We specifically talked about this setup acting like a hardware raid
chassis and in that situation the hardware chassis grabs a new drive
regardless of whether it goes into the same slot as an old drive.
3) What happens if the technician removes the dead drive and then gets a
page they must answer before inserting the new drive and we time things
out.  Then the technician is left wondering why the drive didn't get
used like it should.
4) Maybe they have only one drive carrier and once they remove the old
drive they must unmount it from the carrier and mount the new drive to
the carrier before inserting the new drive and we time things out.
5) Maybe they are leaving the defunct drive in place and putting this
drive into an empty slot and want it to be used for rebuild regardless.

Really, the whole concept of a same-port action with a timeout is a nice
way to cover our ass and not much more.  But our asses are already
covered by the fact that we require a clean drive or the use of the
force- option on the action.  So I just didn't see much real benefit or
use for the same port stuff.

>>   c) Each DOMAIN line may have a metadata entry, and may have a
>>      spare-group entry.
> 
> What is the purpose of the spare group?  I thought we were assuming
> that all DOMAIN members were automatically in the same spare group.
> Is this to augment the policy to allow spares to float between
> DOMAINs?  Something like the following where the different domains
> allow spares to cross boundaries?
> DOMAIN path=A spare-group=B action=grow
> DOMAIN path=B spare-group=A action=spare

The above is possible, but also the use of different domains in the same
spare group with different priorities as outlined in a previous mail
would be useful too.

>>   d) For the partition action, a DOMAIN line must have a program= and
>>      a table= entry.  Currently, the program= entry must be an item
>>      out of a list of known partition programs (I'm working on getting
>>      sfdisk up and running, but for arches other than x86, other
>>      methods would be needed, and I'm planning on adding a method
>>      that allows us to call out to a user supplied script/program
>>      instead of a known internal method).  The table= entry points to
>>      a file that contains a method specific table indicating the
>>      necessary partition layout.  As mentioned in previous mails, we
>>      only support identical partition tables at this point.  That
>>      may never change, who knows.
>>
>> 2) Created a new udev rules file that gets installed as
>> 05-md-early.rules.  This rule file, combined with our existing rules
>> file, is a key element to how this domain support works.  In particular,
>> udev rules allow us to separate out devices that already have some sort
>> of raid superblock from devices that don't.  We then add a new flag to
>> our incremental mode to indicate that a device currently does not belong
>> to us, and we perform a series of checks to see if it should, and if so,
>> we "grab" it (I would have preferred a better name, but the short
>> options for better names were already taken).  When called with the
>> "grab" flag, we follow a different code path where we check the domain
>> of the device against our DOMAIN entries and if we have a match, we
>> perform the specified action.  There will need to be some additional
>> work to catch certain corner cases, such as the case where we have
>> force-partition and we insert a disk that currently has a raid
>> superblock on the bare drive.  We will currently miss that situation and
>> not grab the device.  So, this is a work in progress and not yet complete.
>>
> 
> I notice this rules file grabs all events.  Did you see, or disagree,
> with the suggestion to have a mdadm --activate-domains command to
> generate udev rules for the paths we care about?

I saw it, and did it this way for the same list of reasons I listed
above in regards to same-port and timeouts.  In addition,
--activate-domains means that changes to the config file would not be
immediately active, and that would likely violate the principle of least
surprise.  However, I am actively working on trying to make the checks
we perform fast so that essentially the cost is a fork/exec of code most
likely already in page cache and if there is nothing to do we want to
exit quickly and with minimal touching of any physical media.
Considering that udev already touches the physical media to populate the
database for the device, our cost is incrementally negligible unless we
pass all of our simple checks and end up needing to go to media.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2010-05-08  1:06 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-27 16:45 More Hot Unplug/Plug work Doug Ledford
2010-04-27 19:41 ` Christian Gatzemeier
2010-04-28 16:08 ` Labun, Marcin
2010-04-28 17:47   ` Doug Ledford
2010-04-28 18:34     ` Labun, Marcin
2010-04-28 21:05       ` Doug Ledford
2010-04-28 21:13         ` Dan Williams
2010-04-30 13:38           ` Doug Ledford
2010-04-29  1:01         ` Neil Brown
2010-04-29  1:19           ` Dan Williams
2010-04-29  2:37             ` Neil Brown
2010-04-29 18:22               ` Labun, Marcin
2010-04-29 21:55               ` Dan Williams
2010-05-03  5:58                 ` Neil Brown
2010-05-08  1:06                   ` Dan Williams
2010-04-30 16:13               ` Doug Ledford
2010-04-30 11:14             ` John Robinson
2010-04-30 15:52           ` Doug Ledford
2010-04-28 20:59     ` Luca Berra
2010-04-28 21:16       ` Doug Ledford
2010-04-29 20:32 ` Dan Williams
2010-04-29 21:22 ` Dan Williams
2010-04-30 16:26   ` Doug Ledford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).