* More Hot Unplug/Plug work
@ 2010-04-27 16:45 Doug Ledford
2010-04-27 19:41 ` Christian Gatzemeier
` (3 more replies)
0 siblings, 4 replies; 23+ messages in thread
From: Doug Ledford @ 2010-04-27 16:45 UTC (permalink / raw)
To: Linux RAID Mailing List, Neil Brown, Labun, Marcin, Dan Williams
[-- Attachment #1: Type: text/plain, Size: 4980 bytes --]
So I pulled down Neil's git repo and started working from his hotunplug
branch, which was his version of my hotunplug patch. I had to do a
couple minor fixes to it to make it work. I then simply continued on
from there. I have a branch in my git repo that tracks his hotunplug
branch and is also called hotunplug. That's where my current work is at.
What I've done since then:
1) I've implemented a new config file line type: DOMAIN
a) Each DOMAIN line must have at least one valid path= entry, but may
have more than one path= entry. path= entries are file globs and
must match something in /dev/disk/by-path
b) Each DOMAIN line must have one and only one action= entry. Valid
action items are: ignore, incremental, spare, grow, partition.
In addition, a word me be prefixed with force- to indicate that
we should skip certain safety checks and use the device even if it
isn't clean.
c) Each DOMAIN line may have a metadata entry, and may have a
spare-group entry.
d) For the partition action, a DOMAIN line must have a program= and
a table= entry. Currently, the program= entry must be an item
out of a list of known partition programs (I'm working on getting
sfdisk up and running, but for arches other than x86, other
methods would be needed, and I'm planning on adding a method
that allows us to call out to a user supplied script/program
instead of a known internal method). The table= entry points to
a file that contains a method specific table indicating the
necessary partition layout. As mentioned in previous mails, we
only support identical partition tables at this point. That
may never change, who knows.
2) Created a new udev rules file that gets installed as
05-md-early.rules. This rule file, combined with our existing rules
file, is a key element to how this domain support works. In particular,
udev rules allow us to separate out devices that already have some sort
of raid superblock from devices that don't. We then add a new flag to
our incremental mode to indicate that a device currently does not belong
to us, and we perform a series of checks to see if it should, and if so,
we "grab" it (I would have preferred a better name, but the short
options for better names were already taken). When called with the
"grab" flag, we follow a different code path where we check the domain
of the device against our DOMAIN entries and if we have a match, we
perform the specified action. There will need to be some additional
work to catch certain corner cases, such as the case where we have
force-partition and we insert a disk that currently has a raid
superblock on the bare drive. We will currently miss that situation and
not grab the device. So, this is a work in progress and not yet complete.
3) Add IncrementalNew, IncrementalNewPart, and IncrementalNewDisk to
Incremental.c. The IncrementalNew is called any time we are passed the
"grab" flag to incremental mode. In IncrementalNew we merely figure out
if we match a domain entry, and if so then we check if that domain entry
is a partition entry, if it is then we pass it off to
IncrementalNewDisk, if not then we pass it off to IncrementalNewPart
(hmmm...renaming of these functions might be in order...it makes sense
to me, because I was thinking "this is what we do on a bare disk, this
is what we do on a partition", but considering that we call NewDisk on
partition domains and NewPart on everything else seems backwards).
These functions are partial stubs at the moment. They do some sanity
checks, but not any real work. However, they aren't intended to do a
lot of work themselves. They are intended to figure out if they should
do work, then simply invoke Managedevs with the 'a' flag to do the
actual work. And if the method is grow, then we will call Managedevs
with the 'a' disposition, then call Grow with the right options to do
the right thing. The point being that all the code necessary to
automatically use a device already exists, we just have to invoke it
automatically instead of requiring a user to invoke it from the command
line. I'm very big on reusing that existing code and not trying to
duplicate it here.
So, that's where things stand right now. I'm going to keep working on
this as it's incomplete and doesn't actually do any work at the moment
(it's all sanity checks, config file parsing, and infrastructure, the
actual actions are not yet implemented), but I wanted to get out what I
have currently for people to see. So, you can check it out here:
git://git.fedorapeople.org/~dledford/mdadm.git hotunplug
Comments welcome.
--
Doug Ledford <dledford@redhat.com>
GPG KeyID: CFBFF194
http://people.redhat.com/dledford
Infiniband specific RPMs available at
http://people.redhat.com/dledford/Infiniband
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: More Hot Unplug/Plug work 2010-04-27 16:45 More Hot Unplug/Plug work Doug Ledford @ 2010-04-27 19:41 ` Christian Gatzemeier 2010-04-28 16:08 ` Labun, Marcin ` (2 subsequent siblings) 3 siblings, 0 replies; 23+ messages in thread From: Christian Gatzemeier @ 2010-04-27 19:41 UTC (permalink / raw) To: linux-raid Doug Ledford <dledford <at> redhat.com> writes: > we "grab" it (I would have preferred a better name, but the short > options for better names were already taken). Ah well, I don't know what may be taken, a quick rundown of thoughts though: incremental: prep, enlist, use, new <device> ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: More Hot Unplug/Plug work 2010-04-27 16:45 More Hot Unplug/Plug work Doug Ledford 2010-04-27 19:41 ` Christian Gatzemeier @ 2010-04-28 16:08 ` Labun, Marcin 2010-04-28 17:47 ` Doug Ledford 2010-04-29 20:32 ` Dan Williams 2010-04-29 21:22 ` Dan Williams 3 siblings, 1 reply; 23+ messages in thread From: Labun, Marcin @ 2010-04-28 16:08 UTC (permalink / raw) To: Doug Ledford, Neil Brown Cc: Linux RAID Mailing List, Williams, Dan J, Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw > -----Original Message----- > From: Doug Ledford [mailto:dledford@redhat.com] > Sent: Tuesday, April 27, 2010 6:45 PM > To: Linux RAID Mailing List; Neil Brown; Labun, Marcin; Williams, Dan J > Subject: More Hot Unplug/Plug work > > So I pulled down Neil's git repo and started working from his hotunplug > branch, which was his version of my hotunplug patch. I had to do a > couple minor fixes to it to make it work. I then simply continued on > from there. I have a branch in my git repo that tracks his hotunplug > branch and is also called hotunplug. That's where my current work is > at. > > What I've done since then: > > 1) I've implemented a new config file line type: DOMAIN > a) Each DOMAIN line must have at least one valid path= entry, but > may > have more than one path= entry. path= entries are file globs and > must match something in /dev/disk/by-path DOMAIN is defined per container or raid volume for native metadata. Each DOMAIN can have more than one path, so actually list of path define if a given disk belongs to domain or not. Do you plan to allow for the same path to be assigned to different containers (so path is shared between domains)? If so the domains will have some or all paths overlapped, and some containers will share some paths. Going further, thus causes that a new disk can be potentially grabbed by more than one container (because of shared path). For example: DOMAIN1: path=a path=b path=c DOMAIN2: path=a path=d DOMAIN3: path=d path=c In this example disks from path c can appear in DOMAIN 1 and DOMAIN 3, but not in DOMAIN 2. So, in case of Monitor, sharing a spare device will be per path basis. The same for new disks in hot-plug feature. In your repo domain_ent is a struct that contains domain paths. The function arrays_in_domain returns a list of mdstat entries that are in the same domain as the constituent device name. (so it requires devname and domain as input parameter). In which case two containers will share the same DOMAIN? It seems that this function shall return a list of mdstat entries that share a path to which a devname device belongs. So, a given new device is tried to be grabbed by a list of a containers (or native volumes). Can you send a config file example? Marcin Labun ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: More Hot Unplug/Plug work 2010-04-28 16:08 ` Labun, Marcin @ 2010-04-28 17:47 ` Doug Ledford 2010-04-28 18:34 ` Labun, Marcin 2010-04-28 20:59 ` Luca Berra 0 siblings, 2 replies; 23+ messages in thread From: Doug Ledford @ 2010-04-28 17:47 UTC (permalink / raw) To: Labun, Marcin Cc: Neil Brown, Linux RAID Mailing List, Williams, Dan J, Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw [-- Attachment #1: Type: text/plain, Size: 5107 bytes --] On 04/28/2010 12:08 PM, Labun, Marcin wrote: > > >> -----Original Message----- >> From: Doug Ledford [mailto:dledford@redhat.com] >> Sent: Tuesday, April 27, 2010 6:45 PM >> To: Linux RAID Mailing List; Neil Brown; Labun, Marcin; Williams, Dan J >> Subject: More Hot Unplug/Plug work >> >> So I pulled down Neil's git repo and started working from his hotunplug >> branch, which was his version of my hotunplug patch. I had to do a >> couple minor fixes to it to make it work. I then simply continued on >> from there. I have a branch in my git repo that tracks his hotunplug >> branch and is also called hotunplug. That's where my current work is >> at. >> >> What I've done since then: >> >> 1) I've implemented a new config file line type: DOMAIN >> a) Each DOMAIN line must have at least one valid path= entry, but >> may >> have more than one path= entry. path= entries are file globs and >> must match something in /dev/disk/by-path > > DOMAIN is defined per container or raid volume for native metadata. No, a DOMAIN can encompass more than a single volume, array, or container. > Each DOMAIN can have more than one path, so actually list of path define if a given disk belongs to domain or not. Correct. > Do you plan to allow for the same path to be assigned to different containers (so path is shared between domains)? I had planned that a single DOMAIN can encompass multiple containers. So I didn't planned on it a single path being in multiple DOMAINs, but I did plan that a single domain could allow a device to be placed in multiple different containers based upon need. I don't have checks in place to make sure the same path isn't listed in more than one domain, although that would be a next step. > If so the domains will have some or all paths overlapped, and some containers will share some paths. > Going further, thus causes that a new disk can be potentially grabbed by more than one container (because of shared path). > For example: > DOMAIN1: path=a path=b path=c > DOMAIN2: path=a path=d > DOMAIN3: path=d path=c > In this example disks from path c can appear in DOMAIN 1 and DOMAIN 3, but not in DOMAIN 2. What exactly is the use case for overlapping paths in different domains? I'm happy to rework the code to support it if there's a valid use case, but so far my design goal has been to have a path only appear in one domain, and to then perform the appropriate action based upon that domain. So if more than one container array was present in a single DOMAIN entry (lets assume that the domain entry path encompassed all 6 sata ports on a motherboard and therefore covered the entire platform capability of the imsm motherboard bios), then we would add the new drive as a spare to one of the imsm arrays. It's not currently deterministic which one we would add it to, but that would change as the code matures and we would search for a degraded array that we could add it to. Only if there are no degraded arrays would we add it as a spare to one of the arrays (non-deterministic which one). If we add it as a spare to one of the arrays, then monitor mode can move that spare around as needed later based upon the spare-group settings. Currently, there is no correlation between spare-group and DOMAIN entries, but that might change. > So, in case of Monitor, sharing a spare device will be per path basis. Currently, monitor mode still uses spare-group for controlling what arrays can share spares. It does not yet check any DOMAIN information. > The same for new disks in hot-plug feature. > > > In your repo domain_ent is a struct that contains domain paths. > The function arrays_in_domain returns a list of mdstat entries that are in the same domain as the constituent device name. > (so it requires devname and domain as input parameter). > In which case two containers will share the same DOMAIN? You get the list of containers, not just one. See above about searching the list for a degraded container and adding to it before a healthy container. > It seems that this function shall return a list of mdstat entries that share a path to which a devname device belongs. > So, a given new device is tried to be grabbed by a list of a containers (or native volumes). Yes. There can be more than one array/container that this device might go to. > Can you send a config file example? The first two entries are good, the third is a known bad line that I just leave in there to make sure I don't partition the wrong thing. DOMAIN path=pci-0000:00:1f.2-scsi-[2345]:0:0:0 action=partition table=/etc/mdadm.table program=sfdisk DOMAIN path=pci-0000:00:1f.2-scsi-[2345]:0:0:0-part? action=spare DOMAIN path=pci-0000:00:1f.2-scsi-[2345]:0:0:0* path=pci-0000:00:1f.2-scsi-[2345]:0:0:0-part* action=partition -- Doug Ledford <dledford@redhat.com> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: More Hot Unplug/Plug work 2010-04-28 17:47 ` Doug Ledford @ 2010-04-28 18:34 ` Labun, Marcin 2010-04-28 21:05 ` Doug Ledford 2010-04-28 20:59 ` Luca Berra 1 sibling, 1 reply; 23+ messages in thread From: Labun, Marcin @ 2010-04-28 18:34 UTC (permalink / raw) To: Doug Ledford Cc: Neil Brown, Linux RAID Mailing List, Williams, Dan J, Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw > > Going further, thus causes that a new disk can be potentially grabbed > by more than one container (because of shared path). > > For example: > > DOMAIN1: path=a path=b path=c > > DOMAIN2: path=a path=d > > DOMAIN3: path=d path=c > > In this example disks from path c can appear in DOMAIN 1 and DOMAIN > 3, but not in DOMAIN 2. > > What exactly is the use case for overlapping paths in different > domains? OK, makes sense. But if they are overlapped, will the config functions assign path are requested by configuration file or treat it as misconfiguration? So, do you plan to make changes similar to incremental in assembly to serve DOMAIN? Should an array be split (not assembled) if a domain paths are dividing array into two separate DOMAIN? > I'm happy to rework the code to support it if there's a valid use > case, but so far my design goal has been to have a path only appear in > one domain, and to then perform the appropriate action based upon that > domain. What is then the purpose of metadata keyword? My initial plan was to create a default configuration for a specific metadata, where user specifies actions but without paths letting metadata handler to use default ones. In your description, I can see that the path are required. > add it to. Only if there are no degraded arrays would we add it as a > spare to one of the arrays (non-deterministic which one). If we add it > as a spare to one of the arrays, then monitor mode can move that spare > around as needed later based upon the spare-group settings. Currently, > there is no correlation between spare-group and DOMAIN entries, but > that might change. A spare should go to any container controlled by mdmon, so any that contains redundant volumes. > > > So, in case of Monitor, sharing a spare device will be per path > basis. > > Currently, monitor mode still uses spare-group for controlling what > arrays can share spares. It does not yet check any DOMAIN information. Yes, and I am now adding support for domains in monitor and for spare-groups for external metadata. > > > The same for new disks in hot-plug feature. > > > > > > In your repo domain_ent is a struct that contains domain paths. > > The function arrays_in_domain returns a list of mdstat entries that > are in the same domain as the constituent device name. > > (so it requires devname and domain as input parameter). > > In which case two containers will share the same DOMAIN? > > You get the list of containers, not just one. See above about > searching the list for a degraded container and adding to it before a > healthy container. OK. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: More Hot Unplug/Plug work 2010-04-28 18:34 ` Labun, Marcin @ 2010-04-28 21:05 ` Doug Ledford 2010-04-28 21:13 ` Dan Williams 2010-04-29 1:01 ` Neil Brown 0 siblings, 2 replies; 23+ messages in thread From: Doug Ledford @ 2010-04-28 21:05 UTC (permalink / raw) To: Labun, Marcin Cc: Neil Brown, Linux RAID Mailing List, Williams, Dan J, Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw [-- Attachment #1: Type: text/plain, Size: 4899 bytes --] On 04/28/2010 02:34 PM, Labun, Marcin wrote: >>> Going further, thus causes that a new disk can be potentially grabbed >> by more than one container (because of shared path). >>> For example: >>> DOMAIN1: path=a path=b path=c >>> DOMAIN2: path=a path=d >>> DOMAIN3: path=d path=c >>> In this example disks from path c can appear in DOMAIN 1 and DOMAIN >> 3, but not in DOMAIN 2. >> >> What exactly is the use case for overlapping paths in different >> domains? > > OK, makes sense. > But if they are overlapped, will the config functions assign path are requested by configuration file > or treat it as misconfiguration? For now it merely means that the first match found is the only one that will ever get used. I'm not entirely sure how feasible it is to detect matching paths unless we are just talking about identical strings in the path= statement. But since the path= statement is passed to fnmatch(), which treats it as a file glob, it would be possible to construct two path statements that don't match but match the same set of files. I don't think we can reasonably detect this situation, so it may be that the answer is "the first match found is used" and have that be the official stance. > So, do you plan to make changes similar to incremental in assembly to serve DOMAIN? I had not planned on it, no. The reason being that assembly isn't used for hotplug. I guess I could see a use case for this though in that if you called mdadm -As then maybe we should consult the DOMAIN entries to see if there are free drives inside of a DOMAIN listed as spare or grow and whether or not we have any degraded arrays while assembling that could use the drives. Dunno if we want to do that though. However, I think I would prefer to get the incremental side of things working first, then go there. > Should an array be split (not assembled) if a domain paths are dividing array into two separate DOMAIN? I don't think so. Amongst other things, this would make it possible to render a machine unbootable if you had a type in a domain path. I think I would prefer to allow established arrays to assemble regardless of domain path entries. >> I'm happy to rework the code to support it if there's a valid use >> case, but so far my design goal has been to have a path only appear in >> one domain, and to then perform the appropriate action based upon that >> domain. > What is then the purpose of metadata keyword? Mainly as a hint that a given domain uses a specific type of metadata. > My initial plan was to create a default configuration for a specific metadata, where user specifies actions > but without paths letting metadata handler to use default ones. > In your description, I can see that the path are required. Yes. We already have a default action for all paths: incremental. This is the same as how things work today without any new support. And when you combine incremental with the AUTO keyword in mdadm.conf, you can control which devices are auto assembled on a metadata by metadata basis without the use of DOMAINs. The only purpose of a domain then is to specify an action other than incremental for devices plugged into a given domain. >> add it to. Only if there are no degraded arrays would we add it as a >> spare to one of the arrays (non-deterministic which one). If we add it >> as a spare to one of the arrays, then monitor mode can move that spare >> around as needed later based upon the spare-group settings. Currently, >> there is no correlation between spare-group and DOMAIN entries, but >> that might change. > > A spare should go to any container controlled by mdmon, so any that contains redundant volumes. Yep. >> >>> So, in case of Monitor, sharing a spare device will be per path >> basis. >> >> Currently, monitor mode still uses spare-group for controlling what >> arrays can share spares. It does not yet check any DOMAIN information. > > Yes, and I am now adding support for domains in monitor and for spare-groups for external metadata. Good to hear. >> >>> The same for new disks in hot-plug feature. >>> >>> >>> In your repo domain_ent is a struct that contains domain paths. >>> The function arrays_in_domain returns a list of mdstat entries that >> are in the same domain as the constituent device name. >>> (so it requires devname and domain as input parameter). >>> In which case two containers will share the same DOMAIN? >> >> You get the list of containers, not just one. See above about >> searching the list for a degraded container and adding to it before a >> healthy container. > OK. -- Doug Ledford <dledford@redhat.com> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: More Hot Unplug/Plug work 2010-04-28 21:05 ` Doug Ledford @ 2010-04-28 21:13 ` Dan Williams 2010-04-30 13:38 ` Doug Ledford 2010-04-29 1:01 ` Neil Brown 1 sibling, 1 reply; 23+ messages in thread From: Dan Williams @ 2010-04-28 21:13 UTC (permalink / raw) To: Doug Ledford Cc: Labun, Marcin, Neil Brown, Linux RAID Mailing List, Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw Doug Ledford wrote: > On 04/28/2010 02:34 PM, Labun, Marcin wrote: >> Should an array be split (not assembled) if a domain paths are dividing array into two separate DOMAIN? > > I don't think so. Amongst other things, this would make it possible to > render a machine unbootable if you had a type in a domain path. I think > I would prefer to allow established arrays to assemble regardless of > domain path entries. This is what I was calling the 'enforce=' policy in previous mails. Whether to block, warn, or ignore arrays that span a domain. I can see someone wanting to have something like enforce=platform to make sure we Linux tries to assemble an array that the option-rom can't put together. >>> I'm happy to rework the code to support it if there's a valid use >>> case, but so far my design goal has been to have a path only appear in >>> one domain, and to then perform the appropriate action based upon that >>> domain. >> What is then the purpose of metadata keyword? > > Mainly as a hint that a given domain uses a specific type of metadata. Yeah, to protect against cases where a stale disk is plugged into an unexpected port. -- Dan ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: More Hot Unplug/Plug work 2010-04-28 21:13 ` Dan Williams @ 2010-04-30 13:38 ` Doug Ledford 0 siblings, 0 replies; 23+ messages in thread From: Doug Ledford @ 2010-04-30 13:38 UTC (permalink / raw) To: Dan Williams Cc: Labun, Marcin, Neil Brown, Linux RAID Mailing List, Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw [-- Attachment #1: Type: text/plain, Size: 1333 bytes --] On 04/28/2010 05:13 PM, Dan Williams wrote: > Doug Ledford wrote: >> On 04/28/2010 02:34 PM, Labun, Marcin wrote: >>> Should an array be split (not assembled) if a domain paths are >>> dividing array into two separate DOMAIN? >> >> I don't think so. Amongst other things, this would make it possible to >> render a machine unbootable if you had a type in a domain path. I think >> I would prefer to allow established arrays to assemble regardless of >> domain path entries. > > This is what I was calling the 'enforce=' policy in previous mails. > Whether to block, warn, or ignore arrays that span a domain. I can see > someone wanting to have something like enforce=platform to make sure we > Linux tries to assemble an array that the option-rom can't put together. I would suggest that the proper way to handle this is to warn on assembling an array that spans boundaries but proceed with the assembly (including incremental), warn and require a force flag on creating an array that spans boundaries, and warn and require the force flag to automatically use devices that span boundaries. -- Doug Ledford <dledford@redhat.com> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: More Hot Unplug/Plug work 2010-04-28 21:05 ` Doug Ledford 2010-04-28 21:13 ` Dan Williams @ 2010-04-29 1:01 ` Neil Brown 2010-04-29 1:19 ` Dan Williams 2010-04-30 15:52 ` Doug Ledford 1 sibling, 2 replies; 23+ messages in thread From: Neil Brown @ 2010-04-29 1:01 UTC (permalink / raw) To: Doug Ledford Cc: Labun, Marcin, Linux RAID Mailing List, Williams, Dan J, Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw On Wed, 28 Apr 2010 17:05:58 -0400 Doug Ledford <dledford@redhat.com> wrote: > On 04/28/2010 02:34 PM, Labun, Marcin wrote: > >>> Going further, thus causes that a new disk can be potentially grabbed > >> by more than one container (because of shared path). > >>> For example: > >>> DOMAIN1: path=a path=b path=c > >>> DOMAIN2: path=a path=d > >>> DOMAIN3: path=d path=c > >>> In this example disks from path c can appear in DOMAIN 1 and DOMAIN > >> 3, but not in DOMAIN 2. > >> > >> What exactly is the use case for overlapping paths in different > >> domains? > > > > OK, makes sense. > > But if they are overlapped, will the config functions assign path are requested by configuration file > > or treat it as misconfiguration? > > For now it merely means that the first match found is the only one that > will ever get used. I'm not entirely sure how feasible it is to detect > matching paths unless we are just talking about identical strings in the > path= statement. But since the path= statement is passed to fnmatch(), > which treats it as a file glob, it would be possible to construct two > path statements that don't match but match the same set of files. I > don't think we can reasonably detect this situation, so it may be that > the answer is "the first match found is used" and have that be the > official stance. I think we do need an "official stance" here. glob is good for lots of things, but it is hard to say "everything except". The best way to do that is to have a clear ordering with more general globs later in the order. path=abcd action=foo path=abc* action=bar path=* action=baz So the last line doesn't really mean "do baz on everything" but rather "do baz on everything else". You could impose ordering explicitly with a priority number or a "this domain takes precedence over that domain" tag, but I suspect simple ordering in the config file is easiest and so best. An important question to ask here though is whether people will want to generate the "domain" lines automatically and if so, how we can make it hard for people to get that wrong. Inserting a line in the middle of a file is probably more of a challenge than inserting a line with a specific priority or depends-on tag. So before we get too much further down this path, I think it would be good to have some concrete scenarios about how this functionality will actually be put into effect. I'd love to just expect people to always edit mdadm.conf to meet their specific needs, but experience shows that is naive - people will write scripts based on imperfect understanding, then share those scripts with others.... > > > So, do you plan to make changes similar to incremental in assembly to serve DOMAIN? > > I had not planned on it, no. The reason being that assembly isn't used > for hotplug. I guess I could see a use case for this though in that if > you called mdadm -As then maybe we should consult the DOMAIN entries to > see if there are free drives inside of a DOMAIN listed as spare or grow > and whether or not we have any degraded arrays while assembling that > could use the drives. Dunno if we want to do that though. However, I > think I would prefer to get the incremental side of things working > first, then go there. > > > Should an array be split (not assembled) if a domain paths are dividing array into two separate DOMAIN? > > I don't think so. Amongst other things, this would make it possible to > render a machine unbootable if you had a type in a domain path. I think > I would prefer to allow established arrays to assemble regardless of > domain path entries. > > >> I'm happy to rework the code to support it if there's a valid use > >> case, but so far my design goal has been to have a path only appear in > >> one domain, and to then perform the appropriate action based upon that > >> domain. > > What is then the purpose of metadata keyword? > > Mainly as a hint that a given domain uses a specific type of metadata. > > > My initial plan was to create a default configuration for a specific metadata, where user specifies actions > > but without paths letting metadata handler to use default ones. > > In your description, I can see that the path are required. > > Yes. We already have a default action for all paths: incremental. This > is the same as how things work today without any new support. And when > you combine incremental with the AUTO keyword in mdadm.conf, you can > control which devices are auto assembled on a metadata by metadata basis > without the use of DOMAINs. > The only purpose of a domain then is to > specify an action other than incremental for devices plugged into a > given domain. I like this statement. It is simple and to the point and seems to capture the key ideas. The question is: is it true? :-) It is suggested that 'domain' also involved in spare-groups and could be used to warn against, or disable, a 'create' or 'add' which violated policy. So maybe: The purpose of a domain is to guide: - 'incremental' by specifying actions for hot-plug devices other than the default - 'create' and 'add' by identifying configurations that breach policy - 'monitor' by providing an alternate way of specifying spare-groups It is a lot more wordy, but still seems useful. While 'incremental' would not benefit from overlapping domains (as each hotplugged device only wants one action), the other two might. Suppose I want to configure array A to use only a certain set of drives, and array B that can use any drive at all. Then if we disallow overlapping domains, there is no domain that describes the drives that B can be made from. Does that matter? Is it too hypothetical a situation? Here is another interesting question. Suppose I have two drive chassis, each connected to the host by a fibre. When I create arrays from all these drives, I want them to be balanced across the two chassis, both for performance reasons and for redundancy reasons. Is there any way we can tell mdadm about this, possible through 'domains'. This could be an issue when building a RAID10 (alternate across the chassis is best) or when finding a spare for a RAID1 (choosing from the 'other' chassis is best). I don't really want to solve this now, but I do want to be sure that our concept of 'domain' is big enough that we will be able to fit that sort of thing into it one day. Maybe a 'domain' is simply a mechanism to add tags to devices, and possibly by implication to arrays that contain those devices. The mechanism for resolving when multiple domains add conflicting tags to the one device would be dependant on the tag. Maybe first-wins. Maybe all are combined. So we add an 'action' tag for --incremental, and the first wins (maybe) We add a 'sparegroup' tag for --monitor We add some other tag for balancing (share=1/2, share=2/2 ???) I'm not sure how this fits with imposing platform constraints. As platform constraints are closely tied to metadata types, it might be OK to have a metadata-specific tags (imsm=???) and leave to details to the metadata handler??? Dan: help me understand these platform constraints: what is the most complex constraint that you can think of that you might want to impose? NeilBrown ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: More Hot Unplug/Plug work 2010-04-29 1:01 ` Neil Brown @ 2010-04-29 1:19 ` Dan Williams 2010-04-29 2:37 ` Neil Brown 2010-04-30 11:14 ` John Robinson 2010-04-30 15:52 ` Doug Ledford 1 sibling, 2 replies; 23+ messages in thread From: Dan Williams @ 2010-04-29 1:19 UTC (permalink / raw) To: Neil Brown Cc: Doug Ledford, Labun, Marcin, Linux RAID Mailing List, Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw Neil Brown wrote: > I'm not sure how this fits with imposing platform constraints. > As platform constraints are closely tied to metadata types, it might be OK > to have a metadata-specific tags (imsm=???) and leave to details to the > metadata handler??? > > Dan: help me understand these platform constraints: what is the most complex > constraint that you can think of that you might want to impose? At this point we really only need one constraint: prevent controller spanning. If for example I take an existing imsm member off of ahci and reattach it via a usb-to-sata enclosure mdadm needs a policy to prevent associating that drive with anything on ahci. In a pinch this policy can be disabled, but you wouldn't want to rebuild across usb or any other controller because the option-rom only talks ahci and will mark the drive missing. So something like DOMAIN spanning=imsm, to tell mdadm to follow imsm rules for this domain. Where 'spanning' is policy tag?? -- Dan ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: More Hot Unplug/Plug work 2010-04-29 1:19 ` Dan Williams @ 2010-04-29 2:37 ` Neil Brown 2010-04-29 18:22 ` Labun, Marcin ` (2 more replies) 2010-04-30 11:14 ` John Robinson 1 sibling, 3 replies; 23+ messages in thread From: Neil Brown @ 2010-04-29 2:37 UTC (permalink / raw) To: Dan Williams Cc: Doug Ledford, Labun, Marcin, Linux RAID Mailing List, Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw On Wed, 28 Apr 2010 18:19:45 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > Neil Brown wrote: > > I'm not sure how this fits with imposing platform constraints. > > As platform constraints are closely tied to metadata types, it might be OK > > to have a metadata-specific tags (imsm=???) and leave to details to the > > metadata handler??? > > > > Dan: help me understand these platform constraints: what is the most complex > > constraint that you can think of that you might want to impose? > > At this point we really only need one constraint: prevent controller > spanning. If for example I take an existing imsm member off of ahci and > reattach it via a usb-to-sata enclosure mdadm needs a policy to prevent > associating that drive with anything on ahci. > > In a pinch this policy can be disabled, but you wouldn't want to rebuild > across usb or any other controller because the option-rom only talks > ahci and will mark the drive missing. > > So something like DOMAIN spanning=imsm, to tell mdadm to follow imsm > rules for this domain. Where 'spanning' is policy tag?? > Thanks. So we have two different ideas here. 1/ A given set of devices (paths) are all attached to the one controller. 2/ A given array is not allowed to span controllers The first statement is somewhat similar to a statement about sparegroups. It groups devices together is some way. The second is a policy statement, and is metadata specific to some extent. If I create a native-metadata array using the controller, then adding other devices from a different controller is a non-issue. It is only when an IMSM array is created that it is an issue (and then - only if the array is to be used for boot for for multi-boot). So the ARRAY line could have "exclusive-group=foo" where 'foo' is a group name similar to those used for 'spare-group=' But that isn't much fun for auto-detect and auto-assembly. Maybe we want to extend the 'auto' line. It gives policy on a per-metadata basis. Maybe: POLICY auto-assemble +1.x -all POLICY same-group imsm where 'same-group' means that all the devices in an array must be in the same spare-group. The 'domain' lines assign spare-groups to devices. Maybe "same-group" could be "same-$tag" so we could have different tags for different metadatas.... Is this working for anyone else, or have I lost the plot?? NeilBrown ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: More Hot Unplug/Plug work 2010-04-29 2:37 ` Neil Brown @ 2010-04-29 18:22 ` Labun, Marcin 2010-04-29 21:55 ` Dan Williams 2010-04-30 16:13 ` Doug Ledford 2 siblings, 0 replies; 23+ messages in thread From: Labun, Marcin @ 2010-04-29 18:22 UTC (permalink / raw) To: Neil Brown, Williams, Dan J Cc: Doug Ledford, Linux RAID Mailing List, Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw > Maybe we want to extend the 'auto' line. It gives policy on a per- > metadata > basis. Maybe: > > POLICY auto-assemble +1.x -all > POLICY same-group imsm So, the policy would be global for all arrays of some metadata type. For sure, we want "controller spanning disable" to be default policy for imsm, so this line would be used rather in situation when user would like to change it to a non-default one. Where for instance native metadata would have default policy "spanning enabled". Another metadata internal domain example is a new field "pool id" that creates spare sharing domain and stores it in metadata. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: More Hot Unplug/Plug work 2010-04-29 2:37 ` Neil Brown 2010-04-29 18:22 ` Labun, Marcin @ 2010-04-29 21:55 ` Dan Williams 2010-05-03 5:58 ` Neil Brown 2010-04-30 16:13 ` Doug Ledford 2 siblings, 1 reply; 23+ messages in thread From: Dan Williams @ 2010-04-29 21:55 UTC (permalink / raw) To: Neil Brown Cc: Doug Ledford, Labun, Marcin, Linux RAID Mailing List, Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw On Wed, Apr 28, 2010 at 7:37 PM, Neil Brown <neilb@suse.de> wrote: >> So something like DOMAIN spanning=imsm, to tell mdadm to follow imsm >> rules for this domain. Where 'spanning' is policy tag?? >> > > Thanks. > > So we have two different ideas here. > > 1/ A given set of devices (paths) are all attached to the one controller. > 2/ A given array is not allowed to span controllers > > The first statement is somewhat similar to a statement about sparegroups. > It groups devices together is some way. > > The second is a policy statement, and is metadata specific to some extent. > If I create a native-metadata array using the controller, then adding other > devices from a different controller is a non-issue. It is only when an > IMSM array is created that it is an issue (and then - only if the array is to > be used for boot for for multi-boot). > > So the ARRAY line could have "exclusive-group=foo" where 'foo' is a group > name similar to those used for 'spare-group=' > But that isn't much fun for auto-detect and auto-assembly. > > Maybe we want to extend the 'auto' line. It gives policy on a per-metadata > basis. Maybe: > > POLICY auto-assemble +1.x -all > POLICY same-group imsm > > where 'same-group' means that all the devices in an array must be in the > same spare-group. The 'domain' lines assign spare-groups to devices. > > Maybe "same-group" could be "same-$tag" so we could have different tags > for different metadatas.... > > Is this working for anyone else, or have I lost the plot?? > I am not grokking the separate POLICY line, especially for defining the spare-migration border because that is already what DOMAIN is specifying. Here is what I think we need to allow for simple honoring of platform constraints but without needing to expose all the nuances of those constraints in config-file syntax... yet. 1/ Allow path= to take a metadata name this allows the handler to identify its known controller ports, alleviating the user from needing to track which ports are allowed, especially as it may change over time. If someone really wants to see which ports a metadata handler cares about we could have a DOMAIN line dumped by --detail-platform --brief -e imsm. However for simplicity I would rather just dump: DOMAIN path=imsm action=spare-same-port spare-migration=imsm 2/ I think we should always block configurations that cross domain boundaries. One can always append more path= lines to override this. 3/ The metadata handler may want to restrict/control where spares are placed in a domain. To enable interaction with CIM we are looking to add a storage-pool id to the metadata. The primary usage of this will be to essentially encode a spare-group number in the metadata. This seems to require a spare-migration= option to the DOMAIN line. By default it is 'all' but it can be set to a metadata-name to let the handler apply its internal migration policy. That should cover everything I would like to expose Comments? -- Dan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: More Hot Unplug/Plug work 2010-04-29 21:55 ` Dan Williams @ 2010-05-03 5:58 ` Neil Brown 2010-05-08 1:06 ` Dan Williams 0 siblings, 1 reply; 23+ messages in thread From: Neil Brown @ 2010-05-03 5:58 UTC (permalink / raw) To: Dan Williams Cc: Doug Ledford, Labun, Marcin, Linux RAID Mailing List, Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw On Thu, 29 Apr 2010 14:55:23 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > On Wed, Apr 28, 2010 at 7:37 PM, Neil Brown <neilb@suse.de> wrote: > >> So something like DOMAIN spanning=imsm, to tell mdadm to follow imsm > >> rules for this domain. Where 'spanning' is policy tag?? > >> > > > > Thanks. > > > > So we have two different ideas here. > > > > 1/ A given set of devices (paths) are all attached to the one controller. > > 2/ A given array is not allowed to span controllers > > > > The first statement is somewhat similar to a statement about sparegroups. > > It groups devices together is some way. > > > > The second is a policy statement, and is metadata specific to some extent. > > If I create a native-metadata array using the controller, then adding other > > devices from a different controller is a non-issue. It is only when an > > IMSM array is created that it is an issue (and then - only if the array is to > > be used for boot for for multi-boot). > > > > So the ARRAY line could have "exclusive-group=foo" where 'foo' is a group > > name similar to those used for 'spare-group=' > > But that isn't much fun for auto-detect and auto-assembly. > > > > Maybe we want to extend the 'auto' line. It gives policy on a per-metadata > > basis. Maybe: > > > > POLICY auto-assemble +1.x -all > > POLICY same-group imsm > > > > where 'same-group' means that all the devices in an array must be in the > > same spare-group. The 'domain' lines assign spare-groups to devices. > > > > Maybe "same-group" could be "same-$tag" so we could have different tags > > for different metadatas.... > > > > Is this working for anyone else, or have I lost the plot?? > > > > I am not grokking the separate POLICY line, especially for defining > the spare-migration border because that is already what DOMAIN is > specifying. Is it? This is what I'm not yet 100% convinced about. We seem to be saying: - A DOMAIN is a set of devices that are handled the same way for hotplug - A DOMAIN is a set of devices that define a boundary on spare migration and I'm not sure those sets are necessarily isomorphic - though I agree that they will often be the same. Does each DOMAIN line define a separate migration boundary so that devices cannot migrate 'across domains'?? If we were to require that, I would probably want multiple 'path=' words allowed for a single domain so we can create a union. > > Here is what I think we need to allow for simple honoring of platform > constraints but without needing to expose all the nuances of those > constraints in config-file syntax... yet. > > 1/ Allow path= to take a metadata name this allows the handler to > identify its known controller ports, alleviating the user from needing > to track which ports are allowed, especially as it may change over > time. If someone really wants to see which ports a metadata handler > cares about we could have a DOMAIN line dumped by --detail-platform > --brief -e imsm. However for simplicity I would rather just dump: > > DOMAIN path=imsm action=spare-same-port spare-migration=imsm > So "path=imsm" means "all devices which are attached to a controller which seems to understand IMSM natively". What if a system had two such controllers - one on-board and one on a plug-in card. This might not be possibly for IMSM but would be for DDF. I presume the default would be that the controllers are separate domains - would you agree? So the above DOMAIN line would potentially create multiple 'domains' at least for spare-migration. > 2/ I think we should always block configurations that cross domain > boundaries. One can always append more path= lines to override this. I think we all agree on this. Require --force to create an array, or add devices to an array, where that would cross an established spare-group... The details are still a bit vague for me but the principle is good. > > 3/ The metadata handler may want to restrict/control where spares are > placed in a domain. To enable interaction with CIM we are looking to > add a storage-pool id to the metadata. The primary usage of this will > be to essentially encode a spare-group number in the metadata. This > seems to require a spare-migration= option to the DOMAIN line. By > default it is 'all' but it can be set to a metadata-name to let the > handler apply its internal migration policy. I'm not following you. Are you talking about subsets of a domain? Subdomains? Do the storage-pools follow hardware port locations, or dynamic configuration of individual devices (hence being recorded in metadata). This is how I think spare migration should work: Spare migration is controlled entirely by the 'spare-group' attribute. A spare-group is an attribute of a device. A device may have multiple spare-group attributes (it might be in multiple groups). There are two ways a device can be assigned a spare-group. 1/ If an array is tagged with a spare-group= in mdadm.conf then any device in that array gets that spare-group attribute 2/ If a DOMAIN is tagged with a spare-group attribute then any device in that domain gets that spare-group attribute When mdadm --monitor needs to find a hot spare for an array or container which is degraded, it collects a list of spare-group attributes for all devices in the array, then finds any device (of suitable size) that has a spare-group attribute matching any of those. Possibly a weighting should prefer spare-groups that are more prevalent in the array, so that if you add a foreign device in an emergency, mdadm won't feel too free to add other foreign devices (but is still allowed to). You seem to be suggesting that the spare-group tag could also be specified by the metadata. I think I'm happy with that. A DOMAIN line without an explicit spare-group= tag might imply an implicit spare-group= tag where the spare-group name is some generated string that is unique to that DOMAIN line. So all devices in a DOMAIN line are effectively interchangeable, but it is easy to stretch the migration barrier around multiple domains by giving them all a matching spare-group tag. When you create an array, every pair of devices much share a spare-group, or else one of them must not be in an spare-group. Is that right? NeilBrown > > That should cover everything I would like to expose Comments? > > -- > Dan > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: More Hot Unplug/Plug work 2010-05-03 5:58 ` Neil Brown @ 2010-05-08 1:06 ` Dan Williams 0 siblings, 0 replies; 23+ messages in thread From: Dan Williams @ 2010-05-08 1:06 UTC (permalink / raw) To: Neil Brown Cc: Doug Ledford, Labun, Marcin, Linux RAID Mailing List, Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw On Sun, May 2, 2010 at 10:58 PM, Neil Brown <neilb@suse.de> wrote: > On Thu, 29 Apr 2010 14:55:23 -0700 > Dan Williams <dan.j.williams@intel.com> wrote: >> I am not grokking the separate POLICY line, especially for defining >> the spare-migration border because that is already what DOMAIN is >> specifying. > > Is it? This is what I'm not yet 100% convinced about. > We seem to be saying: > - A DOMAIN is a set of devices that are handled the same way for > hotplug > - A DOMAIN is a set of devices that define a boundary on spare > migration The definition I have been carrying around is slightly more nuanced. The DOMAIN defines a maximal boundary, but there might be metadata specific modifiers that further restrict the possible actions. For example a DOMAIN path=ddf domain would handle all hotplug events on "ddf" ports the same way with the caveat that the ddf handler would know about controller spanning rules in the multi-controller case. Otherwise if you define path=<pci-device-path+partitions> then wysiwyg, i.e. no arrays assembling across these boundaries. > > and I'm not sure those sets are necessarily isomorphic - though I agree that > they will often be the same. > > Does each DOMAIN line define a separate migration boundary so that devices > cannot migrate 'across domains'?? > If we were to require that, I would probably want multiple 'path=' words > allowed for a single domain so we can create a union. Yes, we should do that regardless because it would be hard to write a glob that covers disparate controllers otherwise. >> >> Here is what I think we need to allow for simple honoring of platform >> constraints but without needing to expose all the nuances of those >> constraints in config-file syntax... yet. >> >> 1/ Allow path= to take a metadata name this allows the handler to >> identify its known controller ports, alleviating the user from needing >> to track which ports are allowed, especially as it may change over >> time. If someone really wants to see which ports a metadata handler >> cares about we could have a DOMAIN line dumped by --detail-platform >> --brief -e imsm. However for simplicity I would rather just dump: >> >> DOMAIN path=imsm action=spare-same-port spare-migration=imsm >> > > So "path=imsm" means "all devices which are attached to a controller which > seems to understand IMSM natively". > What if a system had two such controllers - one on-board and one on a plug-in > card. This might not be possibly for IMSM but would be for DDF. > I presume the default would be that the controllers are separate domains - > would you agree? The controllers may restrict spare migration but I would still see this as one ddf DOMAIN where the paths and spare migration constraints are internally determined by the handler, but the hotplug policy is global for the "ddf-DOMAIN". > So the above DOMAIN line would potentially create multiple > 'domains' at least for spare-migration. Yes. > >> 2/ I think we should always block configurations that cross domain >> boundaries. One can always append more path= lines to override this. > > I think we all agree on this. Require --force to create an array, or add > devices to an array, where that would cross an established spare-group... > The details are still a bit vague for me but the principle is good. > >> >> 3/ The metadata handler may want to restrict/control where spares are >> placed in a domain. To enable interaction with CIM we are looking to >> add a storage-pool id to the metadata. The primary usage of this will >> be to essentially encode a spare-group number in the metadata. This >> seems to require a spare-migration= option to the DOMAIN line. By >> default it is 'all' but it can be set to a metadata-name to let the >> handler apply its internal migration policy. > > I'm not following you. Are you talking about subsets of a domain? Subdomains? > Do the storage-pools follow hardware port locations, or dynamic configuration > of individual devices (hence being recorded in metadata). Dynamic configuration, but I would still call this the imsm-DOMAIN with metadata specific spare-migration-boundaries. > > This is how I think spare migration should work: > Spare migration is controlled entirely by the 'spare-group' attribute. > A spare-group is an attribute of a device. A device may have multiple > spare-group attributes (it might be in multiple groups). > There are two ways a device can be assigned a spare-group. > 1/ If an array is tagged with a spare-group= in mdadm.conf then any device > in that array gets that spare-group attribute > 2/ If a DOMAIN is tagged with a spare-group attribute then any device > in that domain gets that spare-group attribute > > When mdadm --monitor needs to find a hot spare for an array or container > which is degraded, it collects a list of spare-group attributes > for all devices in the array, then finds any device (of suitable size) > that has a spare-group attribute matching any of those. > Possibly a weighting should prefer spare-groups that are more prevalent in > the array, so that if you add a foreign device in an emergency, mdadm won't > feel too free to add other foreign devices (but is still allowed to). > > You seem to be suggesting that the spare-group tag could also be specified > by the metadata. I think I'm happy with that. Yeah, metadata implied spare-groups that sub-divide the domain. > > A DOMAIN line without an explicit spare-group= tag might imply an implicit > spare-group= tag where the spare-group name is some generated string that > is unique to that DOMAIN line. > So all devices in a DOMAIN line are effectively interchangeable, but it is > easy to stretch the migration barrier around multiple domains by giving > them all a matching spare-group tag. > > When you create an array, every pair of devices much share a spare-group, or > else one of them must not be in an spare-group. Is that right? ...once you allow for $metadata-DOMAINs I am having trouble conceptualizing the use case for allowing spares to migrate across the explicit union of path= boundaries? Unless you are trying to codify what the metadata handlers would be doing internally. In which case I would expect to replace a single spare-group= identifier with multiple mutually exclusive spare-path= lines to subdivide a DOMAIN into spare migration sub-domains with the same hot-plug policy. ...or am I still misunderstanding your spare-group= vs DOMAIN distinction? -- Dan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: More Hot Unplug/Plug work 2010-04-29 2:37 ` Neil Brown 2010-04-29 18:22 ` Labun, Marcin 2010-04-29 21:55 ` Dan Williams @ 2010-04-30 16:13 ` Doug Ledford 2 siblings, 0 replies; 23+ messages in thread From: Doug Ledford @ 2010-04-30 16:13 UTC (permalink / raw) To: Neil Brown Cc: Dan Williams, Labun, Marcin, Linux RAID Mailing List, Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw [-- Attachment #1: Type: text/plain, Size: 2989 bytes --] On 04/28/2010 10:37 PM, Neil Brown wrote: > On Wed, 28 Apr 2010 18:19:45 -0700 > Dan Williams <dan.j.williams@intel.com> wrote: > >> Neil Brown wrote: >>> I'm not sure how this fits with imposing platform constraints. >>> As platform constraints are closely tied to metadata types, it might be OK >>> to have a metadata-specific tags (imsm=???) and leave to details to the >>> metadata handler??? >>> >>> Dan: help me understand these platform constraints: what is the most complex >>> constraint that you can think of that you might want to impose? >> >> At this point we really only need one constraint: prevent controller >> spanning. If for example I take an existing imsm member off of ahci and >> reattach it via a usb-to-sata enclosure mdadm needs a policy to prevent >> associating that drive with anything on ahci. >> >> In a pinch this policy can be disabled, but you wouldn't want to rebuild >> across usb or any other controller because the option-rom only talks >> ahci and will mark the drive missing. >> >> So something like DOMAIN spanning=imsm, to tell mdadm to follow imsm >> rules for this domain. Where 'spanning' is policy tag?? >> > > Thanks. > > So we have two different ideas here. > > 1/ A given set of devices (paths) are all attached to the one controller. > 2/ A given array is not allowed to span controllers > > The first statement is somewhat similar to a statement about sparegroups. > It groups devices together is some way. > > The second is a policy statement, and is metadata specific to some extent. > If I create a native-metadata array using the controller, then adding other > devices from a different controller is a non-issue. It is only when an > IMSM array is created that it is an issue (and then - only if the array is to > be used for boot for for multi-boot). > > So the ARRAY line could have "exclusive-group=foo" where 'foo' is a group > name similar to those used for 'spare-group=' > But that isn't much fun for auto-detect and auto-assembly. > > Maybe we want to extend the 'auto' line. It gives policy on a per-metadata > basis. Maybe: > > POLICY auto-assemble +1.x -all > POLICY same-group imsm > > where 'same-group' means that all the devices in an array must be in the > same spare-group. The 'domain' lines assign spare-groups to devices. > > Maybe "same-group" could be "same-$tag" so we could have different tags > for different metadatas.... > > Is this working for anyone else, or have I lost the plot?? > > NeilBrown I keep going back to the idea of just implement the no-spanning policy for imsm/ddf as the default with a force-override flag and don't bother putting it into the config anywhere, it simply is. -- Doug Ledford <dledford@redhat.com> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: More Hot Unplug/Plug work 2010-04-29 1:19 ` Dan Williams 2010-04-29 2:37 ` Neil Brown @ 2010-04-30 11:14 ` John Robinson 1 sibling, 0 replies; 23+ messages in thread From: John Robinson @ 2010-04-30 11:14 UTC (permalink / raw) To: Dan Williams Cc: Neil Brown, Doug Ledford, Labun, Marcin, Linux RAID Mailing List, Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw On 29/04/2010 02:19, Dan Williams wrote: > Neil Brown wrote: >> I'm not sure how this fits with imposing platform constraints. >> As platform constraints are closely tied to metadata types, it might >> be OK >> to have a metadata-specific tags (imsm=???) and leave to details to the >> metadata handler??? >> >> Dan: help me understand these platform constraints: what is the most >> complex >> constraint that you can think of that you might want to impose? > > At this point we really only need one constraint: prevent controller > spanning. If for example I take an existing imsm member off of ahci and > reattach it via a usb-to-sata enclosure mdadm needs a policy to prevent > associating that drive with anything on ahci. > > In a pinch this policy can be disabled, but you wouldn't want to rebuild > across usb or any other controller because the option-rom only talks > ahci and will mark the drive missing. > > So something like DOMAIN spanning=imsm, to tell mdadm to follow imsm > rules for this domain. Where 'spanning' is policy tag?? Why isn't DOMAIN path=pci-0000:00:1f.2-scsi-[012345]* enough? Can't arrays span multiple Intel/imsm controllers? And if I start off my array on one Intel/imsm controller, and I add another JBOD controller, shouldn't I be allowed to grow my array to span the two controllers without rebuilding the array with new metadata? I know I can't expect the option ROM to cope with or boot off this array but would the option ROM in some manner damage such an array? Cheers, John. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: More Hot Unplug/Plug work 2010-04-29 1:01 ` Neil Brown 2010-04-29 1:19 ` Dan Williams @ 2010-04-30 15:52 ` Doug Ledford 1 sibling, 0 replies; 23+ messages in thread From: Doug Ledford @ 2010-04-30 15:52 UTC (permalink / raw) To: Neil Brown Cc: Labun, Marcin, Linux RAID Mailing List, Williams, Dan J, Ciechanowski, Ed, Hawrylewicz Czarnowski, Przemyslaw [-- Attachment #1: Type: text/plain, Size: 19716 bytes --] On 04/28/2010 09:01 PM, Neil Brown wrote: > On Wed, 28 Apr 2010 17:05:58 -0400 > Doug Ledford <dledford@redhat.com> wrote: > >> On 04/28/2010 02:34 PM, Labun, Marcin wrote: >>>>> Going further, thus causes that a new disk can be potentially grabbed >>>> by more than one container (because of shared path). >>>>> For example: >>>>> DOMAIN1: path=a path=b path=c >>>>> DOMAIN2: path=a path=d >>>>> DOMAIN3: path=d path=c >>>>> In this example disks from path c can appear in DOMAIN 1 and DOMAIN >>>> 3, but not in DOMAIN 2. >>>> >>>> What exactly is the use case for overlapping paths in different >>>> domains? >>> >>> OK, makes sense. >>> But if they are overlapped, will the config functions assign path are requested by configuration file >>> or treat it as misconfiguration? >> >> For now it merely means that the first match found is the only one that >> will ever get used. I'm not entirely sure how feasible it is to detect >> matching paths unless we are just talking about identical strings in the >> path= statement. But since the path= statement is passed to fnmatch(), >> which treats it as a file glob, it would be possible to construct two >> path statements that don't match but match the same set of files. I >> don't think we can reasonably detect this situation, so it may be that >> the answer is "the first match found is used" and have that be the >> official stance. > > I think we do need an "official stance" here. > glob is good for lots of things, but it is hard to say "everything except". > The best way to do that is to have a clear ordering with more general globs > later in the order. > path=abcd action=foo > path=abc* action=bar > path=* action=baz > > So the last line doesn't really mean "do baz on everything" but rather > "do baz on everything else". > > You could impose ordering explicitly with a priority number or a > "this domain takes precedence over that domain" tag, but I suspect > simple ordering in the config file is easiest and so best. > > An important question to ask here though is whether people will want to > generate the "domain" lines automatically and if so, how we can make it hard > for people to get that wrong. > Inserting a line in the middle of a file is probably more of a challenge than > inserting a line with a specific priority or depends-on tag. > > So before we get too much further down this path, I think it would be good to > have some concrete scenarios about how this functionality will actually be > put into effect. I'd love to just expect people to always edit mdadm.conf to > meet their specific needs, but experience shows that is naive - people will > write scripts based on imperfect understanding, then share those scripts > with others.... OK, so here's some scenarios that I've been working from that display how I envision this being used: 1) IMSM arrays: a single domain entry with a path= that specifies all (or some) ports on the controller in question and encompasses all the containers on that controller. The action would be spare or grow and what container we add any new drives to would depend on the various container's conditions and types. 2) IMSM arrays + native arrays: similar to above but split the ports between IMSM use and native use. No overlapping paths, some paths to one, some paths to other. 3) native arrays: one or more domains, no overlapping paths, actions dependent on domain. As an example of this type of setup, let me detail how I commonly configure my machines when I have multiple drives I want in multiple arrays. Let's assume a machine with 6 drives (like the one I'm using right now). I use a raid5 for my / partition, so I can tolerate at most 1 drive failure on my machine before it's unusable. So, my standard partitioning method is to create a 1gig partition on sda and sdb and make those into a raid1 /boot partition. Then I do 1gig on all remaining drives and make that into a raid5 swap partition. Then I do the remaining space on all drives as a raid5 root partition. I don't do any more than two drives in the raid1 /boot partition because if I ever loose two drives and can't use the machine anyway, so more than that in the /boot partition is a waste. So, in my machine's case, here are the domain entries I would create: DOMAIN path=blah[123456] action=force-partition table=/etc/mdadm.table program=sfdisk DOMAIN path=blah[12]-part1 action=force-spare DOMAIN path=blah[3456]-part1 action=force-grow DOMAIN path=blah*-part2 action=force-grow Assuming that blah in the above is the path to my PCI sata controller, the first entry would tell mdadm that if a bare disk is inserted into slots 1 through 6, then force the disk to have the correct partition table for my usage (Dan, I think this should clear up the confusion about the partition action you had in another email, but I'll address it there to...partition is really only for native array types, IMSM will never use it). The second entry says if it's sda or sdb and it's the first partition (so sda1 or sdb1), then force it to be added as a spare to any arrays in the domain. Because of how the arrays_in_domain function works, this will only ever match the raid1 /boot array, so we know for a fact that it will always get added to the raid1 /boot array. And because that array only exists on sda1 and sdb1 anyway, we know that if we ever plug a drive into either of those slots, then the array will already be degraded, and this spare will be used to bring the array back into good condition. The third domain says on the remaining ports to take the first partition and grow (if possible, spare if the array is degraded) any existing array. This means that my raid5 swap partition will either get repaired, or grown, depending on the situation. The final entry makes it so that the second partition on any disk inserted is used to grow (or spare if degraded) the / partition. One of the things that the current code relies upon is something that we talked about earlier. For native array types, we only allow identical partition tables. We don't try to do things like add /dev/sdd4 to an array comprised of entries such as /dev/sdc3. Finding a suitable partition when partition tables are not identical is beyond the initial version of this code. Because of this requirement, the arrays_in_domain function makes use of this to narrow down arrays that might match a domain based upon partition number. So if the current pathname include part? in its path, the function only returns arrays with the same part in their path. That considerably eases the matching process. > >> >>> So, do you plan to make changes similar to incremental in assembly to serve DOMAIN? >> >> I had not planned on it, no. The reason being that assembly isn't used >> for hotplug. I guess I could see a use case for this though in that if >> you called mdadm -As then maybe we should consult the DOMAIN entries to >> see if there are free drives inside of a DOMAIN listed as spare or grow >> and whether or not we have any degraded arrays while assembling that >> could use the drives. Dunno if we want to do that though. However, I >> think I would prefer to get the incremental side of things working >> first, then go there. >> >>> Should an array be split (not assembled) if a domain paths are dividing array into two separate DOMAIN? >> >> I don't think so. Amongst other things, this would make it possible to >> render a machine unbootable if you had a type in a domain path. I think >> I would prefer to allow established arrays to assemble regardless of >> domain path entries. >> >>>> I'm happy to rework the code to support it if there's a valid use >>>> case, but so far my design goal has been to have a path only appear in >>>> one domain, and to then perform the appropriate action based upon that >>>> domain. >>> What is then the purpose of metadata keyword? >> >> Mainly as a hint that a given domain uses a specific type of metadata. I want to address this in a bit more detail. One of the conceptual problems I've been wrestling with in my mind if not on paper yet is the problem of telling a drive that is intended to be wiped out and reused from a drive that is part of your desired working set. Let's think about my above example for native arrays, where there are three arrays, a /boot, a swap, and a / array. Much of this talk has centered around "what do we do when we get a hotplug event for a drive and array <blah> is degraded". That's the easy case. The hard case is "what do we do if array <blah> is degraded and the user shuts the machine down, puts in a new-to-this-machine drive (possibly with existing md raid superblocks), and then boots the machine back up and expects us to do the right thing". For anyone that doesn't have true hotplug hardware, this is going to be the common case. If the drive is installed in the last place in the system and it's the last drive we detect, then we have a chance of doing the right thing. But if it's installed to replace /dev/sda, we are *screwed*. It will be the first drive we detect. And we won't know *what* to do with it. And if it has a superblock on it, we won't even know that it's not supposed to be here. We will happily attempt incremental assembly on this drive, possibly starting arrays that have never existed on this machine before. So, I'm actually finding the metadata keyword less useful than possibly adding a UUID keyword and allowing a domain to be restricted to one or more UUIDs. Then if we find an errant UUID in the domain, we know not to assemble it and in fact if the force-spare or force-grow keywords are present we know to wipe it out and use it for our own purposes. However, that doesn't solve the whole problem that if it's /dev/sda then we won't have any other arrays assembled yet, so the second thing we are going to have to do is defer our use of the drive until a later time. Specifically I'm thinking we might have to write a map entry for the drive into the mapfile, then when we run mdadm -IRs (because all distros do this after scsi_wait_scan has completed...right?) we can revisit what to do with the drive. The other option is to add the drive to the mapfile, then when mdadm --monitor mode is started have it process the drive because all of our arrays should be up and running by the time we start the monitor process. Those are the only two solutions I have to this issue at the moment. Thoughts welcome. >>> My initial plan was to create a default configuration for a specific metadata, where user specifies actions >>> but without paths letting metadata handler to use default ones. >>> In your description, I can see that the path are required. >> >> Yes. We already have a default action for all paths: incremental. This >> is the same as how things work today without any new support. And when >> you combine incremental with the AUTO keyword in mdadm.conf, you can >> control which devices are auto assembled on a metadata by metadata basis >> without the use of DOMAINs. > > >> The only purpose of a domain then is to >> specify an action other than incremental for devices plugged into a >> given domain. > > I like this statement. It is simple and to the point and seems to capture > the key ideas. > > The question is: is it true? :-) Well, for the initial implementation I would say it's true ;-) Certainly all the other things you bring up here make my brain hurt. > It is suggested that 'domain' also involved in spare-groups and could be used > to warn against, or disable, a 'create' or 'add' which violated policy. > > So maybe: > The purpose of a domain is to guide: > - 'incremental' by specifying actions for hot-plug devices other than the > default Yes. > - 'create' and 'add' by identifying configurations that breach policy We don't really need domains for this. The only things that have hard policy requirements are BIOS based arrays, and that's metadata/platform specific. We could simply test for and warn on create/add operations that violate platform capability without regard to domains. > - 'monitor' by providing an alternate way of specifying spare-groups Although this can be done, it need not be done. I'm still not entirely convinced of the value of the spare-group tag on domain lines. > It is a lot more wordy, but still seems useful. > > While 'incremental' would not benefit from overlapping domains (as each > hotplugged device only wants one action), the other two might. > > Suppose I want to configure array A to use only a certain set of drives, > and array B that can use any drive at all. Then if we disallow overlapping > domains, there is no domain that describes the drives that B can be made from. > > Does that matter? Is it too hypothetical a situation? Let's see if we can construct such a situation. Let's assume that we are talking about IMSM based arrays. Let's assume we have a SAS controller and we have more than 6 ports available (may or may not be possible, I don't know, but for the sake of argument we need it). Let's next assume we have a 3 disk raid5 on ports 0, 1, and 2. And let's assume we have a 3 disk raid5 on ports 4, 5, and 6. Let's then assume we only want the first raid5 to be allowed to use ports 0 through 4, and that the second raid5 is allowed to use ports 0 through 7. To create that config, we create the two following DOMAIN lines: DOMAIN path=blah[01234] action=grow DOMAIN path=blah[01234567] action=grow Now let's assume that we plug a disk into port 3. What happens? Currently, conf_get_domain() will return one, and only one, domain for a given device. And it doesn't search for best match (which would be very difficult to do as we use fnmatch to test the glob match, which means that really the path= statement is more or less opaque to us, we don't process it ourselves and don't evaluate it ourselves, we just pass it off to fnmatch and let it tell us if things matched), it just finds the first match and returns it. So, right now anyway, we will match the first domain and the first domain only. That means we will then return that domain, then later when we call arrays_in_domain we will pass in our device path plus our matched domain and as a result we will search mdstat and we will find both raid5 arrays in our requested domain (the current search returns any array with at least one member in the domain, maybe that should be any array where all members are in the domain). Now, at this point, if one or the other array is degraded, then what to do is obvious. However, if both arrays are degraded or neither array is degraded, then our choice is not obvious. I'm having a hard time coming up with a good answer to that issue. It's not clear which array we should grow if both are clean, nor which one takes priority if both are degraded. We would have to add a new tag, maybe priority=, to the ARRAY lines in order to make this decision obvious. Short of that, the proper course of action is probably to do nothing and let the user sort it out. Now let's assume that we plug a disk into port 7. We search and find the second domain. Then we call arrays_in_domain() and we get both raid5 arrays again because both of them have members in the domain. Regardless of anything else, it's clear that this situation did *not* do what you wanted. It did not specify that array 1 can only be on the first 5 ports, and it did not specify that array 2 can use all 8 ports. If we changed the second domain path to be blah[567] then it would work, but I don't think that this combination of domains and the resulting actions is all that clear to understand from a user's perspective. I think right now trying to do what you are suggesting is confusing from a domain line. Maybe we need to add something to array lines for this. Maybe the array line needs an allowed_path entry that could be used to limit what paths an array will accept devices from. But this then assumes we will create an array line for all arrays (or for ones where we want to limit their paths) and I'm not sure people will do (or want to do) that. So, while I can see a possible scenario that matches your hypothetical, I'm finding that the domain construct is a very clunky way to try and implement the constraints of your hypothetical. > Here is another interesting question. Suppose I have two drive chassis, each > connected to the host by a fibre. When I create arrays from all these drives, > I want them to be balanced across the two chassis, both for performance > reasons and for redundancy reasons. > Is there any way we can tell mdadm about this, possible through 'domains'. This is actually the first thing that makes me see the use of spare-group on a domain line. We could construct two different domains, one for each chassis, but with the same spare-group tag. This would imply that both domains are available as spares to the same arrays, but allows us to then add a policy to mdadm for how to select spares from domains. We could add a priority tag to the domain lines. If two domains share the same spare-group tag, and the domains have the same priority, then we could round-robin allocate from domains (what you are asking about), but if they have different priorities then we could allocate solely from the higher (or lower, implementation defined) priority domain until there is nothing left to allocate from it and then switch to the other domain. I could actually also see adding a write_mostly flag to an entire domain in case the chassis that domain represents is remote via wan. > This could be an issue when building a RAID10 (alternate across the chassis > is best) or when finding a spare for a RAID1 (choosing from the 'other' > chassis is best). > > I don't really want to solve this now, but I do want to be sure that our > concept of 'domain' is big enough that we will be able to fit that sort of > thing into it one day. > > Maybe a 'domain' is simply a mechanism to add tags to devices, and possibly > by implication to arrays that contain those devices. > The mechanism for resolving when multiple domains add conflicting tags to > the one device would be dependant on the tag. Maybe first-wins. Maybe > all are combined. > > So we add an 'action' tag for --incremental, and the first wins (maybe) > We add a 'sparegroup' tag for --monitor > We add some other tag for balancing (share=1/2, share=2/2 ???) > > I'm not sure how this fits with imposing platform constraints. > As platform constraints are closely tied to metadata types, it might be OK > to have a metadata-specific tags (imsm=???) and leave to details to the > metadata handler??? I'm more and more of the mind that we need to leave platform constraints out of the domain issue and instead just implement proper platform constraint checks and overrides in the various parts of mdadm that need it regardless of domains. > Dan: help me understand these platform constraints: what is the most complex > constraint that you can think of that you might want to impose? > > NeilBrown -- Doug Ledford <dledford@redhat.com> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: More Hot Unplug/Plug work 2010-04-28 17:47 ` Doug Ledford 2010-04-28 18:34 ` Labun, Marcin @ 2010-04-28 20:59 ` Luca Berra 2010-04-28 21:16 ` Doug Ledford 1 sibling, 1 reply; 23+ messages in thread From: Luca Berra @ 2010-04-28 20:59 UTC (permalink / raw) To: Linux RAID Mailing List On Wed, Apr 28, 2010 at 01:47:55PM -0400, Doug Ledford wrote: >DOMAIN path=pci-0000:00:1f.2-scsi-[2345]:0:0:0 action=partition > table=/etc/mdadm.table program=sfdisk i admit i did not take the time to pull from your git, so tell me to rtfc if needed. it seems you are assuming program will take table as stdin. would not it be better to use somethink like action=initialize command="sfdisk %d < /etc/mdadm.table" ? where command is invoked via a shell and %d is replaced with the device node. (more escapes could also be useful, e.g. the sysfs node) besides that is there any provisioning to check that the device really is empty before running action? Regards, L. -- Luca Berra -- bluca@comedia.it Communication Media & Services S.r.l. /"\ \ / ASCII RIBBON CAMPAIGN X AGAINST HTML MAIL / \ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: More Hot Unplug/Plug work 2010-04-28 20:59 ` Luca Berra @ 2010-04-28 21:16 ` Doug Ledford 0 siblings, 0 replies; 23+ messages in thread From: Doug Ledford @ 2010-04-28 21:16 UTC (permalink / raw) To: Linux RAID Mailing List [-- Attachment #1: Type: text/plain, Size: 2496 bytes --] On 04/28/2010 04:59 PM, Luca Berra wrote: > On Wed, Apr 28, 2010 at 01:47:55PM -0400, Doug Ledford wrote: >> DOMAIN path=pci-0000:00:1f.2-scsi-[2345]:0:0:0 action=partition >> table=/etc/mdadm.table program=sfdisk > i admit i did not take the time to pull from your git, so tell me to > rtfc if needed. rtfc ;-) > it seems you are assuming program will take table as stdin. No, table is program specific. In this case, for sfdisk, it would be something taken as stdin. However, in the code, there is a specific handler for the sfdisk program type. That handler provides a validate routine to check the contents of the table= entry and make sure it's valid, a check routine to check the table on a given disk and see if it matches what it's supposed to be, and a write routine to update the disks table to what it should be. How it goes about doing these things is particular to the sfdisk handler. I do have plans to add a more generic simple script handler that would allow you to pass things in as you suggest, but I have not yet implemented it. And part of the reason is that I'm extremely leary of the security implications of allowing a text file to spell out a program to be called by a root invoked system daemon. I can see a million different ways to compromise a system when a daemon with raw disk access reads a command from a text file. > would not it be better to use somethink like > action=initialize command="sfdisk %d < /etc/mdadm.table" ? > where command is invoked via a shell and %d is replaced with the device > node. (more escapes could also be useful, e.g. the sysfs node) This is precisely what the sfdisk handler will be doing, only it won't be reading the command from the text file, it will have the knowledge of how to invoke sfdisk safely compiled into the program where compromise is much more difficult. > besides that is there any provisioning to check that the device really > is empty before running action? Yes. In the code that tries to take new disks, it requires either the force-partition option or that the device be declared clean, which per Neil's suggestion is that both the first 4k and last 4k of the device is comprised entirely of one of three patterns: 0x00, 0x5a, 0xff. -- Doug Ledford <dledford@redhat.com> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: More Hot Unplug/Plug work 2010-04-27 16:45 More Hot Unplug/Plug work Doug Ledford 2010-04-27 19:41 ` Christian Gatzemeier 2010-04-28 16:08 ` Labun, Marcin @ 2010-04-29 20:32 ` Dan Williams 2010-04-29 21:22 ` Dan Williams 3 siblings, 0 replies; 23+ messages in thread From: Dan Williams @ 2010-04-29 20:32 UTC (permalink / raw) To: Doug Ledford; +Cc: Linux RAID Mailing List, Neil Brown, Labun, Marcin Doug Ledford wrote: > So, that's where things stand right now. I'm going to keep working on > this as it's incomplete and doesn't actually do any work at the moment > (it's all sanity checks, config file parsing, and infrastructure, the > actual actions are not yet implemented), but I wanted to get out what I > have currently for people to see. So, you can check it out here: > > git://git.fedorapeople.org/~dledford/mdadm.git hotunplug > > Comments welcome. > Quick friendly request... may I ask that you add subject lines to your commits? The git shortlog of your changes is a tad garish. ...now to actually review it. Regards, Dan ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: More Hot Unplug/Plug work 2010-04-27 16:45 More Hot Unplug/Plug work Doug Ledford ` (2 preceding siblings ...) 2010-04-29 20:32 ` Dan Williams @ 2010-04-29 21:22 ` Dan Williams 2010-04-30 16:26 ` Doug Ledford 3 siblings, 1 reply; 23+ messages in thread From: Dan Williams @ 2010-04-29 21:22 UTC (permalink / raw) To: Doug Ledford; +Cc: Linux RAID Mailing List, Neil Brown, Labun, Marcin On Tue, Apr 27, 2010 at 9:45 AM, Doug Ledford <dledford@redhat.com> wrote: > So I pulled down Neil's git repo and started working from his hotunplug > branch, which was his version of my hotunplug patch. I had to do a > couple minor fixes to it to make it work. I then simply continued on > from there. I have a branch in my git repo that tracks his hotunplug > branch and is also called hotunplug. That's where my current work is at. > > What I've done since then: > > 1) I've implemented a new config file line type: DOMAIN > a) Each DOMAIN line must have at least one valid path= entry, but may > have more than one path= entry. path= entries are file globs and > must match something in /dev/disk/by-path > b) Each DOMAIN line must have one and only one action= entry. Valid > action items are: ignore, incremental, spare, grow, partition. > In addition, a word me be prefixed with force- to indicate that > we should skip certain safety checks and use the device even if it > isn't clean. Just to clarify that we are on the same page with these actions: * incremental is the default action that "does the right thing" if the drive already has metadata. I assume we need checks here to reject disks with ambiguous (multiple valid metadata records) * spare: implies incremental, but if it is a 'bare' device write a spare record * grow: implies incremental but if it is a 'bare' device write a spare record, if there is a degraded array in the domain rebuild it otherwise grow an(y?) array in the domain * partition: if the device has a partition that matches the specified table then add the partitions incrementally A few comments: 1/ Does 'partition' need to be split to 'partition-spare' and 'partition-grow' to imply the action post partitioning? 2/ One of the safety checks for hot-inserting a spare is that it occurs on a port that was recently unplugged. Should that be a default policy or do we need a different flavor spare action like 'spare-same-port'. > c) Each DOMAIN line may have a metadata entry, and may have a > spare-group entry. What is the purpose of the spare group? I thought we were assuming that all DOMAIN members were automatically in the same spare group. Is this to augment the policy to allow spares to float between DOMAINs? Something like the following where the different domains allow spares to cross boundaries? DOMAIN path=A spare-group=B action=grow DOMAIN path=B spare-group=A action=spare > d) For the partition action, a DOMAIN line must have a program= and > a table= entry. Currently, the program= entry must be an item > out of a list of known partition programs (I'm working on getting > sfdisk up and running, but for arches other than x86, other > methods would be needed, and I'm planning on adding a method > that allows us to call out to a user supplied script/program > instead of a known internal method). The table= entry points to > a file that contains a method specific table indicating the > necessary partition layout. As mentioned in previous mails, we > only support identical partition tables at this point. That > may never change, who knows. > > 2) Created a new udev rules file that gets installed as > 05-md-early.rules. This rule file, combined with our existing rules > file, is a key element to how this domain support works. In particular, > udev rules allow us to separate out devices that already have some sort > of raid superblock from devices that don't. We then add a new flag to > our incremental mode to indicate that a device currently does not belong > to us, and we perform a series of checks to see if it should, and if so, > we "grab" it (I would have preferred a better name, but the short > options for better names were already taken). When called with the > "grab" flag, we follow a different code path where we check the domain > of the device against our DOMAIN entries and if we have a match, we > perform the specified action. There will need to be some additional > work to catch certain corner cases, such as the case where we have > force-partition and we insert a disk that currently has a raid > superblock on the bare drive. We will currently miss that situation and > not grab the device. So, this is a work in progress and not yet complete. > I notice this rules file grabs all events. Did you see, or disagree, with the suggestion to have a mdadm --activate-domains command to generate udev rules for the paths we care about? -- Dan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: More Hot Unplug/Plug work 2010-04-29 21:22 ` Dan Williams @ 2010-04-30 16:26 ` Doug Ledford 0 siblings, 0 replies; 23+ messages in thread From: Doug Ledford @ 2010-04-30 16:26 UTC (permalink / raw) To: Dan Williams; +Cc: Linux RAID Mailing List, Neil Brown, Labun, Marcin [-- Attachment #1: Type: text/plain, Size: 8001 bytes --] On 04/29/2010 05:22 PM, Dan Williams wrote: > On Tue, Apr 27, 2010 at 9:45 AM, Doug Ledford <dledford@redhat.com> wrote: >> So I pulled down Neil's git repo and started working from his hotunplug >> branch, which was his version of my hotunplug patch. I had to do a >> couple minor fixes to it to make it work. I then simply continued on >> from there. I have a branch in my git repo that tracks his hotunplug >> branch and is also called hotunplug. That's where my current work is at. >> >> What I've done since then: >> >> 1) I've implemented a new config file line type: DOMAIN >> a) Each DOMAIN line must have at least one valid path= entry, but may >> have more than one path= entry. path= entries are file globs and >> must match something in /dev/disk/by-path >> b) Each DOMAIN line must have one and only one action= entry. Valid >> action items are: ignore, incremental, spare, grow, partition. >> In addition, a word me be prefixed with force- to indicate that >> we should skip certain safety checks and use the device even if it >> isn't clean. > > Just to clarify that we are on the same page with these actions: > * incremental is the default action that "does the right thing" if the > drive already has metadata. I assume we need checks here to reject > disks with ambiguous (multiple valid metadata records) > * spare: implies incremental, but if it is a 'bare' device write a spare record > * grow: implies incremental but if it is a 'bare' device write a spare > record, if there is a degraded array in the domain rebuild it > otherwise grow an(y?) array in the domain > * partition: if the device has a partition that matches the specified > table then add the partitions incrementally No, partition is an action, so a partition domain (which is limited to being a whole disk device) causes us to write out a partition table on the device. This is only useful for native array types, not for imsm arrays. > A few comments: > 1/ Does 'partition' need to be split to 'partition-spare' and > 'partition-grow' to imply the action post partitioning? No, because once you write the partition table out and cause the kernel to reread the partition table, you will get separate incremental events for the partitions themselves and they will match different domains (you would have one domain line for the partition domain and as many domain lines as you need for the actual partitions themselves). > 2/ One of the safety checks for hot-inserting a spare is that it > occurs on a port that was recently unplugged. Should that be a > default policy or do we need a different flavor spare action like > 'spare-same-port'. No, I canned this aspect. The more I thought about it the more I disliked it. I suppose it could be added in for paranoia's sake, but here's why I dropped it: 1) We don't know that the user will necessarily plug the new spare device into the same port. Maybe it was the port that went bad and not the drive and they are using a new port as a result. 2) We specifically talked about this setup acting like a hardware raid chassis and in that situation the hardware chassis grabs a new drive regardless of whether it goes into the same slot as an old drive. 3) What happens if the technician removes the dead drive and then gets a page they must answer before inserting the new drive and we time things out. Then the technician is left wondering why the drive didn't get used like it should. 4) Maybe they have only one drive carrier and once they remove the old drive they must unmount it from the carrier and mount the new drive to the carrier before inserting the new drive and we time things out. 5) Maybe they are leaving the defunct drive in place and putting this drive into an empty slot and want it to be used for rebuild regardless. Really, the whole concept of a same-port action with a timeout is a nice way to cover our ass and not much more. But our asses are already covered by the fact that we require a clean drive or the use of the force- option on the action. So I just didn't see much real benefit or use for the same port stuff. >> c) Each DOMAIN line may have a metadata entry, and may have a >> spare-group entry. > > What is the purpose of the spare group? I thought we were assuming > that all DOMAIN members were automatically in the same spare group. > Is this to augment the policy to allow spares to float between > DOMAINs? Something like the following where the different domains > allow spares to cross boundaries? > DOMAIN path=A spare-group=B action=grow > DOMAIN path=B spare-group=A action=spare The above is possible, but also the use of different domains in the same spare group with different priorities as outlined in a previous mail would be useful too. >> d) For the partition action, a DOMAIN line must have a program= and >> a table= entry. Currently, the program= entry must be an item >> out of a list of known partition programs (I'm working on getting >> sfdisk up and running, but for arches other than x86, other >> methods would be needed, and I'm planning on adding a method >> that allows us to call out to a user supplied script/program >> instead of a known internal method). The table= entry points to >> a file that contains a method specific table indicating the >> necessary partition layout. As mentioned in previous mails, we >> only support identical partition tables at this point. That >> may never change, who knows. >> >> 2) Created a new udev rules file that gets installed as >> 05-md-early.rules. This rule file, combined with our existing rules >> file, is a key element to how this domain support works. In particular, >> udev rules allow us to separate out devices that already have some sort >> of raid superblock from devices that don't. We then add a new flag to >> our incremental mode to indicate that a device currently does not belong >> to us, and we perform a series of checks to see if it should, and if so, >> we "grab" it (I would have preferred a better name, but the short >> options for better names were already taken). When called with the >> "grab" flag, we follow a different code path where we check the domain >> of the device against our DOMAIN entries and if we have a match, we >> perform the specified action. There will need to be some additional >> work to catch certain corner cases, such as the case where we have >> force-partition and we insert a disk that currently has a raid >> superblock on the bare drive. We will currently miss that situation and >> not grab the device. So, this is a work in progress and not yet complete. >> > > I notice this rules file grabs all events. Did you see, or disagree, > with the suggestion to have a mdadm --activate-domains command to > generate udev rules for the paths we care about? I saw it, and did it this way for the same list of reasons I listed above in regards to same-port and timeouts. In addition, --activate-domains means that changes to the config file would not be immediately active, and that would likely violate the principle of least surprise. However, I am actively working on trying to make the checks we perform fast so that essentially the cost is a fork/exec of code most likely already in page cache and if there is nothing to do we want to exit quickly and with minimal touching of any physical media. Considering that udev already touches the physical media to populate the database for the device, our cost is incrementally negligible unless we pass all of our simple checks and end up needing to go to media. -- Doug Ledford <dledford@redhat.com> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2010-05-08 1:06 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-04-27 16:45 More Hot Unplug/Plug work Doug Ledford 2010-04-27 19:41 ` Christian Gatzemeier 2010-04-28 16:08 ` Labun, Marcin 2010-04-28 17:47 ` Doug Ledford 2010-04-28 18:34 ` Labun, Marcin 2010-04-28 21:05 ` Doug Ledford 2010-04-28 21:13 ` Dan Williams 2010-04-30 13:38 ` Doug Ledford 2010-04-29 1:01 ` Neil Brown 2010-04-29 1:19 ` Dan Williams 2010-04-29 2:37 ` Neil Brown 2010-04-29 18:22 ` Labun, Marcin 2010-04-29 21:55 ` Dan Williams 2010-05-03 5:58 ` Neil Brown 2010-05-08 1:06 ` Dan Williams 2010-04-30 16:13 ` Doug Ledford 2010-04-30 11:14 ` John Robinson 2010-04-30 15:52 ` Doug Ledford 2010-04-28 20:59 ` Luca Berra 2010-04-28 21:16 ` Doug Ledford 2010-04-29 20:32 ` Dan Williams 2010-04-29 21:22 ` Dan Williams 2010-04-30 16:26 ` Doug Ledford
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).