RFC - device names and mdadm with some reference to udev.

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RFC - device names and mdadm with some reference to udev.
@ 2008-10-26 22:56 Neil Brown
  2008-10-27  8:22 ` martin f krafft
                   ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Neil Brown @ 2008-10-26 22:56 UTC (permalink / raw)
  To: linux-raid; +Cc: Doug Ledford, martin f. krafft, Michal Marek, Kay Sievers


Greeting.
 This is a Request For Comments....

 Device naming in mdadm is a bit of a mess.
 We have partitioned devices (mdp) and non-partitioned (md)
 We have names in /dev/md/ (/dev/md/d0) and directly in /dev
    (/dev/md_d0).
 We have support for user-friendly names (/dev/md/home) and for
    "kernel-internal" names (/dev/md0).

 All this can produce extra confusion when udev is brought into the
 picture.  And it can leave lots of litter lying around in /dev if we
 aren't careful (which we aren't).

 I hope to release mdadm-3.0 this year, and maybe that gives me a
 chance to get it "right".  I don't want to break backwards
 compatibility in a big way, but I think I am happy to introduce
 little changes if it means a more consistent model.

 In 2.6.28, partitioned devices (mdp) wont be needed any more as md
 will make use of the "extended partition" functionality recently
 added.  All md devices can be partitioned.  The device number for the
 partitions will be very different to that of the whole device, but
 udev should hide all of that.  So we don't have to worry too much
 about mdp devices.

 So I think the following is how I want things to work.  I am very
 open to comments and suggestions.  Particularly I want to know what
 (if anything) this will break.

 1/ The only device nodes created will be /dev/mdX and /dev/md_dX
    along with partitions /dev/mdXpY and /dev/md_dXpY as appropriate.
    These will be created by mdadm in accordance with the "--auto"
    flag unless something in mdadm.conf says to leave it to udev.
    In that case, mdadm will create a temporary node
    (/dev/.mdadm.whatever) and remove it once udev has created the
    real thing.

 2/ There will be various symlinks to these devices.
    a/ if "symlinks=yes" is given in mdadm.conf, symlinks from
         /dev/md/X or /dev/md/dX will be created.
    b/ if udev is configured like on Debian,
              /dev/disk/by-id/md-name-XXXX
	and   /dev/disk/by-id/md-uuid-UUUU
       will be created (by udev).
    c/ If there is a 'name' associated with the array then
        /dev/md/name will be created as a link.
    d/ if an explicit device name of /dev/name was given,
        either on a -A, -B, -C, command or in mdadm.conf,
	then the 'name' must match the name of the array,
	and /dev/name will be used as well as /dev/md/name.

 3/ For a 'NAME' to be used, with as md-name-NAME or /dev/md/NAME,
    we need a high degree of confidence that the array was intended
    for "this" host, or otherwise is not going to conflict with
    an array that is meant for "this" host.
    We get this confidence in a number of ways:
    a/ If the name is listed in /etc/mdadm.conf 
       e.g.  ARRAY /dev/md/home UUID=XXXX.....
    b/ If the name was given on the command line
    b/ If the name is stored in the metadata of an array which is
       explicitly identifed in mdadm.conf or by the command line.
    c/ If the name is of the form  "host:name" and "host" matches
       this host.  We then use just "name".
    d/ If the name is of the form "host:name" and "host" does not
       match this host, we can still assume that "host:name" is
       unique and use that.
    e/ For 0.90 metadata, if the uuid has the host name encoded in it
       then it was intended for 'this' host.

    Thus unsafe names are names extracted from the metadata of arrays
    which are auto-detected, where there is no hint in the metadata
    that the array is built for 'this' host.

    If the NAME is not known to be safe, we can still assemble the
    array, but we use a "random" high minor number, and allow it
    to be found primarily by the by-id/md-uuid-UUUUU... link or some
    other link created based on array content: e.g. disk/by-label/
    Also the array will be assembled "auto-readonly" so no resync etc
    will happen until the array is actually used.

    mdadm-3.0 will be able to support "containers" such as a set of
    devices with DDF metadata.  These can then contain a number of
    different arrays.  If the 'container' is known to be local to
    'this' host, then we assume that all contained arrays are too.

    I'm contemplating creating a link based on the metadata type with
    a sequential number. e.g. /dev/md/ddf1 or /dev/md/imsm2.
    I'm not sure if there should be in /dev/md/ or directly in /dev/.
    I'm also not sure if I should leave the creation to udev, and
    whether I should use a small sequential number, or just whatever
    number was allocated as the minor number of the device.

 4/ When we stop an array, mdadm will remove anything from /dev that
    it probably created.
    In particular, it will remove the device node as described in 1,
    any partitions, and any symlinks in /dev or /dev/md which point to
    any of those.  I need to be certain that this won't confuse udev.

 5/ I want to enable assembly without having to give
    an explicit device name, thus requiring mdadm to automatically
    assign one just as it would for auto-assembly.
    In particular, the "ARRAY" line in mdadm.conf will no longer
    require an array name.  That would mean that "-Es" wouldn't need
    to produce an array name (which is not always easy).
    So:
        mdadm -Es > /tmp/mdadm.conf
	mdadm -Asc/tmp/mdadm.conf
    would leave the choice of device name to the "-A" stage which is
    the only time that unique non-predictable names can be chosen.

 6/ I'm thinking that if the array name given to --create or
    --assemble looks as though it identifies a metadata type, by
    having the name of a metadata type followed by some digits,
    e.g. /dev/ddf0 or /dev/md/imsm3
    then we insist that the array have that metadata type.
    That could mean that a future metadata type might conflict with
    a previously valid usage, which would be a bore.
    Maybe if there are trailing digits, then it *must* identify a
    metadata type, or be "mdNN".

Some issues that all of this needs to address:

 1/ People want auto-assembly.  I've always fought against it (we
    don't auto-mount all filesystems do we?).  But it is a loosing
    battle.  And on a modern desktop, when you plug in a new drive the
    filesystem is automatically mounted.  So my argument is falling
    apart.

 2/ Auto-assembly of new arrays must not conflict with auto-assembly
    of previously existing arrays, even if the devices comprising the
    new arrays are discovered earlier.  This is what the 'homehost'
    concept is for.  Your array will only get assembled with a
    predictable name if it is known to be attached to 'this' host.

 3/ Auto-assembly needs to handle incremental arrival of devices
    correctly.  There are no easy solutions to this, particularly when
    e.g. ext3 can write to the device even when mounted read-only (for
    journal replay).
    I think the best that I can do for now is assemble things
    'read-auto' to delay any writes a long as possible in the hope
    that all available devices will be connected by then.
    Adding in-memory bitmaps for all degraded array to accelerate
    rebuild would help but won't be in 2.6.28.

 4/ auto-assembly needs to do the right thing on a SAN where multiple
    hosts can each see multiple arrays.  Clearly only one host should
    write to any one array at one time (until I get some
    cluster-awareness going, which I had hoped to work on this year,
    but it doesn't look like I will).
    In this case, I don't think read-auto is enough.  We either need
    to not assemble arrays when aren't known to belong to us, or we
    need to assemble them read-only and require and explicit
    read-write setting.

    So we need some way to know which devices could be visible to
    other hosts.
    I could have a global flag in mdadm.conf "Options SAN"
    I could have a SAN-DEVICES to match "DEVICES", but as just about
    everything is "/dev/sd*" these days, I don't know if that would
    work.

    Any suggestions concerning this would be welcome.

I'm also wondering if I should include a udev 'rules' file for md in
the mdadm distribution.  Obviously it would be no more than a
recommendation, but it might give me a voice in guiding how udev
interacted with mdadm.

Any thoughts of any of this would be most welcome.

Thanks,
NeilBrown



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-26 22:56 RFC - device names and mdadm with some reference to udev Neil Brown
@ 2008-10-27  8:22 ` martin f krafft
  2008-10-27 15:13   ` Doug Ledford
                     ` (2 more replies)
  2008-10-27 12:41 ` Kay Sievers
  2008-10-30 17:18 ` RFC - device names and mdadm with some reference to udev Doug Ledford
  2 siblings, 3 replies; 51+ messages in thread
From: martin f krafft @ 2008-10-27  8:22 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Doug Ledford, Michal Marek, Kay Sievers

[-- Attachment #1: Type: text/plain, Size: 1548 bytes --]

also sprach Neil Brown <neilb@suse.de> [2008.10.26.2356 +0100]:
> Greeting.
>  This is a Request For Comments....

Good morning!

[...]
> I'm also wondering if I should include a udev 'rules' file for md
> in the mdadm distribution.  Obviously it would be no more than
> a recommendation, but it might give me a voice in guiding how udev
> interacted with mdadm.

I would really like to have a clear separation of competencies.
Ideally, mdadm never creates any devices but leaves it all to udev,
and all configuration about alternate names ("symlinks") is done in
the udev rules file.

I know mdadm needs the devices for the ioctls(). However, much of
what it does with ioctl should already be possible with /sys. Thus,
in my ideal world, I imagine mdadm to be a manipulator of /sys,
instructing the kernel to do stuff with components and arrays, and
have udev create and remove corresponding devices in response to
kernel events.

I realise this would require a revamp of mdadm, and might actually
be better done in a new software designed to eventually replace
mdadm. But is this a way forward with which you could befriend
yourself?

-- 
 .''`.   martin f. krafft <madduck@debian.org>
: :'  :  proud Debian developer, author, administrator, and user
`. `'`   http://people.debian.org/~madduck - http://debiansystem.info
  `-  Debian - when you have better things to do than fixing systems

"a compliment is like a kiss through a veil."
                                                        -- victor hugo

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27  8:22 ` martin f krafft
@ 2008-10-27 15:13   ` Doug Ledford
  2008-10-27 16:10     ` Andre Noll
  2008-10-27 16:13     ` Kay Sievers
  2008-10-27 22:37   ` Neil Brown
  2008-10-28  6:17   ` Luca Berra
  2 siblings, 2 replies; 51+ messages in thread
From: Doug Ledford @ 2008-10-27 15:13 UTC (permalink / raw)
  To: martin f krafft; +Cc: Neil Brown, linux-raid, Michal Marek, Kay Sievers

[-- Attachment #1: Type: text/plain, Size: 1660 bytes --]

On Mon, 2008-10-27 at 09:22 +0100, martin f krafft wrote:
> also sprach Neil Brown <neilb@suse.de> [2008.10.26.2356 +0100]:
> > Greeting.
> >  This is a Request For Comments....
> 
> Good morning!
> 
> [...]
> > I'm also wondering if I should include a udev 'rules' file for md
> > in the mdadm distribution.  Obviously it would be no more than
> > a recommendation, but it might give me a voice in guiding how udev
> > interacted with mdadm.
> 
> I would really like to have a clear separation of competencies.
> Ideally, mdadm never creates any devices but leaves it all to udev,
> and all configuration about alternate names ("symlinks") is done in
> the udev rules file.

This would then require that we have a working udev in our initrd
images.  It would greatly increase the complexity of early booting as a
result.

> I know mdadm needs the devices for the ioctls(). However, much of
> what it does with ioctl should already be possible with /sys. Thus,
> in my ideal world, I imagine mdadm to be a manipulator of /sys,
> instructing the kernel to do stuff with components and arrays, and
> have udev create and remove corresponding devices in response to
> kernel events.
> 
> I realise this would require a revamp of mdadm, and might actually
> be better done in a new software designed to eventually replace
> mdadm. But is this a way forward with which you could befriend
> yourself?
> 
-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 15:13   ` Doug Ledford
@ 2008-10-27 16:10     ` Andre Noll
  2008-10-27 16:37       ` Kay Sievers
  2008-10-27 16:13     ` Kay Sievers
  1 sibling, 1 reply; 51+ messages in thread
From: Andre Noll @ 2008-10-27 16:10 UTC (permalink / raw)
  To: Doug Ledford
  Cc: martin f krafft, Neil Brown, linux-raid, Michal Marek,
	Kay Sievers

[-- Attachment #1: Type: text/plain, Size: 757 bytes --]

On 11:13, Doug Ledford wrote:

> > I would really like to have a clear separation of competencies.
> > Ideally, mdadm never creates any devices but leaves it all to udev,
> > and all configuration about alternate names ("symlinks") is done in
> > the udev rules file.
> 
> This would then require that we have a working udev in our initrd
> images.  It would greatly increase the complexity of early booting as a
> result.

Given that the initramfs usually contains busybox, one can also using
mdev. It's much simpler than udev and it's good enough if the only
thing you want to do is mounting the root partition that resides on
a software raid array.

Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 16:10     ` Andre Noll
@ 2008-10-27 16:37       ` Kay Sievers
  2008-10-27 16:59         ` martin f krafft
                           ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Kay Sievers @ 2008-10-27 16:37 UTC (permalink / raw)
  To: Andre Noll
  Cc: Doug Ledford, martin f krafft, Neil Brown, linux-raid,
	Michal Marek

On Mon, Oct 27, 2008 at 17:10, Andre Noll <maan@systemlinux.org> wrote:
> On 11:13, Doug Ledford wrote:
>
>> > I would really like to have a clear separation of competencies.
>> > Ideally, mdadm never creates any devices but leaves it all to udev,
>> > and all configuration about alternate names ("symlinks") is done in
>> > the udev rules file.
>>
>> This would then require that we have a working udev in our initrd
>> images.  It would greatly increase the complexity of early booting as a
>> result.
>
> Given that the initramfs usually contains busybox, one can also using
> mdev. It's much simpler than udev and it's good enough if the only
> thing you want to do is mounting the root partition that resides on
> a software raid array.

Depends on your definition of "usual". Debian, Fedora, openSUSE,
Ubuntu, Gentoo (as far as Gentoo counts as a distro with a default
setup) none of them uses any busybox/mdev setup, and all use udev in
initramfs.

It's very simple to setup and follows the same logic as udev running
in the rootfs, There is absolutely no "increase of complexity"
involved if you use udev in the real root anyway, you just copy the
binaries and the rules, and on bootup you wait for /dev/root to show
up, mount it and start /sbin/init. Custom busybox stuff does not
support any non-trivial feature a "general purpose" distro needs to
support today.

Kay

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 16:37       ` Kay Sievers
@ 2008-10-27 16:59         ` martin f krafft
  2008-10-27 18:31           ` Kay Sievers
  2008-10-27 17:24         ` Doug Ledford
  2008-10-27 17:30         ` Andre Noll
  2 siblings, 1 reply; 51+ messages in thread
From: martin f krafft @ 2008-10-27 16:59 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Andre Noll, Doug Ledford, Neil Brown, linux-raid, Michal Marek

[-- Attachment #1: Type: text/plain, Size: 1164 bytes --]

also sprach Kay Sievers <kay.sievers@vrfy.org> [2008.10.27.1737 +0100]:
> It's very simple to setup and follows the same logic as udev running
> in the rootfs, There is absolutely no "increase of complexity"
> involved if you use udev in the real root anyway, you just copy the
> binaries and the rules, and on bootup you wait for /dev/root to show
> up, mount it and start /sbin/init. Custom busybox stuff does not
> support any non-trivial feature a "general purpose" distro needs to
> support today.

I would love to see some explicit instructions, then I could carry
them into Debian, which is currently using full-blown udev with
initramfs.

-- 
 .''`.   martin f. krafft <madduck@debian.org>
: :'  :  proud Debian developer, author, administrator, and user
`. `'`   http://people.debian.org/~madduck - http://debiansystem.info
  `-  Debian - when you have better things to do than fixing systems
 
"if ever somethin' don't feel right to you, remember what pancho said
 to the cisco kid...  `let's win, before we are dancing at the end of
 a rope, without music.'"
                                                             -- sailor

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 16:59         ` martin f krafft
@ 2008-10-27 18:31           ` Kay Sievers
  2008-10-28  6:21             ` Luca Berra
  0 siblings, 1 reply; 51+ messages in thread
From: Kay Sievers @ 2008-10-27 18:31 UTC (permalink / raw)
  To: Kay Sievers, Andre Noll, Doug Ledford, Neil Brown, linux-raid,
	Michal 

On Mon, Oct 27, 2008 at 17:59, martin f krafft <madduck@debian.org> wrote:
> also sprach Kay Sievers <kay.sievers@vrfy.org> [2008.10.27.1737 +0100]:
>> It's very simple to setup and follows the same logic as udev running
>> in the rootfs, There is absolutely no "increase of complexity"
>> involved if you use udev in the real root anyway, you just copy the
>> binaries and the rules, and on bootup you wait for /dev/root to show
>> up, mount it and start /sbin/init. Custom busybox stuff does not
>> support any non-trivial feature a "general purpose" distro needs to
>> support today.
>
> I would love to see some explicit instructions, then I could carry
> them into Debian,

What do you miss from the Ubuntu setup?

> which is currently using full-blown udev with
> initramfs.

Which is the right thing to do, yes.

Kay

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 18:31           ` Kay Sievers
@ 2008-10-28  6:21             ` Luca Berra
  0 siblings, 0 replies; 51+ messages in thread
From: Luca Berra @ 2008-10-28  6:21 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Andre Noll, Doug Ledford, Neil Brown, linux-raid, Michal Marek

On Mon, Oct 27, 2008 at 07:31:57PM +0100, Kay Sievers wrote:
>On Mon, Oct 27, 2008 at 17:59, martin f krafft <madduck@debian.org> wrote:
>> also sprach Kay Sievers <kay.sievers@vrfy.org> [2008.10.27.1737 +0100]:
>>> It's very simple to setup and follows the same logic as udev running
>>> in the rootfs, There is absolutely no "increase of complexity"
>>> involved if you use udev in the real root anyway, you just copy the
>>> binaries and the rules, and on bootup you wait for /dev/root to show
>>> up, mount it and start /sbin/init. Custom busybox stuff does not
>>> support any non-trivial feature a "general purpose" distro needs to
>>> support today.
>>
>> I would love to see some explicit instructions, then I could carry
>> them into Debian,
>
>What do you miss from the Ubuntu setup?
>
>> which is currently using full-blown udev with
>> initramfs.
>
>Which is the right thing to do, yes.
>
i believe it is overkill
initramfs should have the sole purpose of finding and mounting the root
filesystem, there is no need in packing it with unneeded junk.

L.

-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 16:37       ` Kay Sievers
  2008-10-27 16:59         ` martin f krafft
@ 2008-10-27 17:24         ` Doug Ledford
  2008-10-27 23:36           ` Neil Brown
                             ` (2 more replies)
  2008-10-27 17:30         ` Andre Noll
  2 siblings, 3 replies; 51+ messages in thread
From: Doug Ledford @ 2008-10-27 17:24 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Andre Noll, martin f krafft, Neil Brown, linux-raid, Michal Marek

[-- Attachment #1: Type: text/plain, Size: 3580 bytes --]

On Mon, 2008-10-27 at 17:37 +0100, Kay Sievers wrote:
> On Mon, Oct 27, 2008 at 17:10, Andre Noll <maan@systemlinux.org> wrote:
> > On 11:13, Doug Ledford wrote:
> >
> >> > I would really like to have a clear separation of competencies.
> >> > Ideally, mdadm never creates any devices but leaves it all to udev,
> >> > and all configuration about alternate names ("symlinks") is done in
> >> > the udev rules file.
> >>
> >> This would then require that we have a working udev in our initrd
> >> images.  It would greatly increase the complexity of early booting as a
> >> result.
> >
> > Given that the initramfs usually contains busybox, one can also using
> > mdev. It's much simpler than udev and it's good enough if the only
> > thing you want to do is mounting the root partition that resides on
> > a software raid array.
> 
> Depends on your definition of "usual". Debian, Fedora, openSUSE,
> Ubuntu, Gentoo (as far as Gentoo counts as a distro with a default
> setup) none of them uses any busybox/mdev setup, and all use udev in
> initramfs.

Not a complete udev implementation IIRC.  It doesn't have all the rules
that a running system has.  And at least Fedora still starts md devices
via a specific call to mdadm in the initrd script, not via udev rules.

> It's very simple to setup and follows the same logic as udev running
> in the rootfs, There is absolutely no "increase of complexity"
> involved if you use udev in the real root anyway, you just copy the
> binaries and the rules, and on bootup you wait for /dev/root to show
> up, mount it and start /sbin/init. Custom busybox stuff does not
> support any non-trivial feature a "general purpose" distro needs to
> support today.

I've found the udev rules method of starting md devices to be
problematic (at best).

Here's the issue (in Fedora at least).  Starting devices via udev means
starting them as soon as they are capable and not waiting until all
devices are up and running.  You have to do this in case the device is
in a degraded state and you aren't going to get all the devices.
However, we don't create a bitmap on devices by default in the installer
(a user can add one themselves, but it isn't there by default).  Without
the bitmap, if the device is written to before all devices are added, it
triggers a full resync of the device.  As it turns out, for certain
installations, this happens on *every* single reboot.  It's painful, to
say the least.   So, I wanted to change the udev rule to work slightly
differently.  I wanted the invocation of mdadm --incremental that
happened to be the one that took the array from an unrunable state to a
runable but degraded state to sleep for say 2 to 5 seconds, and then if
the array is still not up and running due to subsequent udev rule
invocations, it would start the array in a degraded state.  This,
however, breaks udevsettle.  So, the current setup (for the upcoming
fedora 10) is done such that the udev rule won't start any degraded
arrays, and instead we have both a specific mdadm invocation in the
initrd and another in rc.sysinit that will start any degraded arrays
that are also listed in the mdadm.conf file.  This makes sure that known
arrays are assembled and started if at all possible, but we only start
unknown arrays if they are complete.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 17:24         ` Doug Ledford
@ 2008-10-27 23:36           ` Neil Brown
  2008-10-29 18:49             ` Doug Ledford
  2008-10-28  6:32           ` Luca Berra
  2008-10-28  9:42           ` occasional bitmap was " David Greaves
  2 siblings, 1 reply; 51+ messages in thread
From: Neil Brown @ 2008-10-27 23:36 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Kay Sievers, Andre Noll, martin f krafft, linux-raid,
	Michal Marek

On Monday October 27, dledford@redhat.com wrote:
> 
> I've found the udev rules method of starting md devices to be
> problematic (at best).
> 
> Here's the issue (in Fedora at least).  Starting devices via udev means
> starting them as soon as they are capable and not waiting until all
> devices are up and running.  You have to do this in case the device is
> in a degraded state and you aren't going to get all the devices.
> However, we don't create a bitmap on devices by default in the installer
> (a user can add one themselves, but it isn't there by default).  Without
> the bitmap, if the device is written to before all devices are added, it
> triggers a full resync of the device.  As it turns out, for certain
> installations, this happens on *every* single reboot.  It's painful, to
> say the least.   So, I wanted to change the udev rule to work slightly
> differently.  I wanted the invocation of mdadm --incremental that
> happened to be the one that took the array from an unrunable state to a
> runable but degraded state to sleep for say 2 to 5 seconds, and then if
> the array is still not up and running due to subsequent udev rule
> invocations, it would start the array in a degraded state.  This,
> however, breaks udevsettle.  So, the current setup (for the upcoming
> fedora 10) is done such that the udev rule won't start any degraded
> arrays, and instead we have both a specific mdadm invocation in the
> initrd and another in rc.sysinit that will start any degraded arrays
> that are also listed in the mdadm.conf file.  This makes sure that known
> arrays are assembled and started if at all possible, but we only start
> unknown arrays if they are complete.
> 

This is using udev to start md devices, which is not quite the focus
of the previous discussion.  That was more about using udev to create
the entries in /dev when someone else started the arrays.

However this is still a real issue that I would like to handle as best
we can.

I would like to get the md code to always have at least an in-memory
bitmap to allow quite resync after a "re-add".

However even this isn't a perfect solution as there is a window when a
single device failure can kill an array.

Your solution sounds good, but I'd be happy to hear other thoughts on
the issue.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 23:36           ` Neil Brown
@ 2008-10-29 18:49             ` Doug Ledford
  0 siblings, 0 replies; 51+ messages in thread
From: Doug Ledford @ 2008-10-29 18:49 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3457 bytes --]

On Tue, 2008-10-28 at 10:36 +1100, Neil Brown wrote:
> On Monday October 27, dledford@redhat.com wrote:
> > 
> > I've found the udev rules method of starting md devices to be
> > problematic (at best).
> > 
> > Here's the issue (in Fedora at least).  Starting devices via udev means
> > starting them as soon as they are capable and not waiting until all
> > devices are up and running.  You have to do this in case the device is
> > in a degraded state and you aren't going to get all the devices.
> > However, we don't create a bitmap on devices by default in the installer
> > (a user can add one themselves, but it isn't there by default).  Without
> > the bitmap, if the device is written to before all devices are added, it
> > triggers a full resync of the device.  As it turns out, for certain
> > installations, this happens on *every* single reboot.  It's painful, to
> > say the least.   So, I wanted to change the udev rule to work slightly
> > differently.  I wanted the invocation of mdadm --incremental that
> > happened to be the one that took the array from an unrunable state to a
> > runable but degraded state to sleep for say 2 to 5 seconds, and then if
> > the array is still not up and running due to subsequent udev rule
> > invocations, it would start the array in a degraded state.  This,
> > however, breaks udevsettle.  So, the current setup (for the upcoming
> > fedora 10) is done such that the udev rule won't start any degraded
> > arrays, and instead we have both a specific mdadm invocation in the
> > initrd and another in rc.sysinit that will start any degraded arrays
> > that are also listed in the mdadm.conf file.  This makes sure that known
> > arrays are assembled and started if at all possible, but we only start
> > unknown arrays if they are complete.
> > 
> 
> This is using udev to start md devices, which is not quite the focus
> of the previous discussion.  That was more about using udev to create
> the entries in /dev when someone else started the arrays.

True enough, although I think they are a bit related simply because it's
udev rules on block devices that trigger the mdadm -I invocations that
trigger the new mdadm devices, so the issue of creating devices from
udev rules is at least mildly related to how mdadm gets called in the
first place, especially for the issue of hot plugging as you brought up
in your mail as hot plugging is specifically a case of udev kicking
mdadm off.

> However this is still a real issue that I would like to handle as best
> we can.
> 
> I would like to get the md code to always have at least an in-memory
> bitmap to allow quite resync after a "re-add".
> 
> However even this isn't a perfect solution as there is a window when a
> single device failure can kill an array.
> 
> Your solution sounds good, but I'd be happy to hear other thoughts on
> the issue.

I ended up coding this up.  It took quite a bit more touchup in the
incremental code than I expected.  In general, mdadm-2.6.7.1 doesn't do
a very good job of honoring information in /etc/mdadm.conf when doing
incremental assembly.  So, momentarily I'll send you a patch series/pull
request with the changes.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 17:24         ` Doug Ledford
  2008-10-27 23:36           ` Neil Brown
@ 2008-10-28  6:32           ` Luca Berra
  2008-10-28  9:42           ` occasional bitmap was " David Greaves
  2 siblings, 0 replies; 51+ messages in thread
From: Luca Berra @ 2008-10-28  6:32 UTC (permalink / raw)
  To: linux-raid

On Mon, Oct 27, 2008 at 01:24:30PM -0400, Doug Ledford wrote:
>that are also listed in the mdadm.conf file.  This makes sure that known
>arrays are assembled and started if at all possible, but we only start
>unknown arrays if they are complete.

seems interesting, do you also have any provisioning for avoiding it to
start unknown arrays at all, if i don't want it to?

i.e. it is on shared storage and in this moment another node has it running.

L.



-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 51+ messages in thread

* occasional bitmap was Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 17:24         ` Doug Ledford
  2008-10-27 23:36           ` Neil Brown
  2008-10-28  6:32           ` Luca Berra
@ 2008-10-28  9:42           ` David Greaves
  2 siblings, 0 replies; 51+ messages in thread
From: David Greaves @ 2008-10-28  9:42 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Kay Sievers, Andre Noll, martin f krafft, Neil Brown, linux-raid,
	Michal Marek

Doug Ledford wrote:
> Here's the issue (in Fedora at least).  Starting devices via udev means
> starting them as soon as they are capable and not waiting until all
> devices are up and running.  You have to do this in case the device is
> in a degraded state and you aren't going to get all the devices.

> However, we don't create a bitmap on devices by default in the installer
> (a user can add one themselves, but it isn't there by default).  Without
> the bitmap, if the device is written to before all devices are added, it
> triggers a full resync of the device.

What about creating a bitmap and only writing to it whilst the array is degraded?

That way you don't get the performance hit of a permanent bitmap (which I live
with) but you do get high speed incremental assembly.

The bitmap area may have to record the starting event count written added to it.
 Then as devices are added, if the starting event matches the bitmap is used, if
not then a full sync happens.

I don't think this needs anything special on shutdown.

David


-- 
"Don't worry, you'll be fine; I saw it work in a cartoon once..."

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 16:37       ` Kay Sievers
  2008-10-27 16:59         ` martin f krafft
  2008-10-27 17:24         ` Doug Ledford
@ 2008-10-27 17:30         ` Andre Noll
  2 siblings, 0 replies; 51+ messages in thread
From: Andre Noll @ 2008-10-27 17:30 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Doug Ledford, martin f krafft, Neil Brown, linux-raid,
	Michal Marek

[-- Attachment #1: Type: text/plain, Size: 1448 bytes --]

On 17:37, Kay Sievers wrote:
> >> This would then require that we have a working udev in our initrd
> >> images.  It would greatly increase the complexity of early booting as a
> >> result.
> >
> > Given that the initramfs usually contains busybox, one can also using
> > mdev. It's much simpler than udev and it's good enough if the only
> > thing you want to do is mounting the root partition that resides on
> > a software raid array.
> 
> Depends on your definition of "usual". Debian, Fedora, openSUSE,
> Ubuntu, Gentoo (as far as Gentoo counts as a distro with a default
> setup) none of them uses any busybox/mdev setup, and all use udev in
> initramfs.

At least Ubuntu's initramfs contains busybox and starts up a shell
during initramfs startup if the root partition could not be mounted.

Anyway, my point is that it's currently possible to mount a root
partition that resides on an md device without using udev.  Even plain
mknod instead of mdev is enough if you know exactly which device node
you have to create.

Of course you might want to do that only if your system doesn't boot
anymore. But I think it's important that in this rescue situation
also the new mdadm can bring up the system without support from udev.

In short: It's good to be able to do the plumbing manually in case
the porcelain isn't flushing.

Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 15:13   ` Doug Ledford
  2008-10-27 16:10     ` Andre Noll
@ 2008-10-27 16:13     ` Kay Sievers
  1 sibling, 0 replies; 51+ messages in thread
From: Kay Sievers @ 2008-10-27 16:13 UTC (permalink / raw)
  To: Doug Ledford; +Cc: martin f krafft, Neil Brown, linux-raid, Michal Marek

On Mon, Oct 27, 2008 at 16:13, Doug Ledford <dledford@redhat.com> wrote:
> On Mon, 2008-10-27 at 09:22 +0100, martin f krafft wrote:
>> also sprach Neil Brown <neilb@suse.de> [2008.10.26.2356 +0100]:

>> > I'm also wondering if I should include a udev 'rules' file for md
>> > in the mdadm distribution.  Obviously it would be no more than
>> > a recommendation, but it might give me a voice in guiding how udev
>> > interacted with mdadm.
>>
>> I would really like to have a clear separation of competencies.
>> Ideally, mdadm never creates any devices but leaves it all to udev,
>> and all configuration about alternate names ("symlinks") is done in
>> the udev rules file.
>
> This would then require that we have a working udev in our initrd
> images.  It would greatly increase the complexity of early booting as a
> result.

There is no usual distro not having udev there.

Kay

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27  8:22 ` martin f krafft
  2008-10-27 15:13   ` Doug Ledford
@ 2008-10-27 22:37   ` Neil Brown
  2008-10-27 22:51     ` Kay Sievers
  2008-10-28  6:17   ` Luca Berra
  2 siblings, 1 reply; 51+ messages in thread
From: Neil Brown @ 2008-10-27 22:37 UTC (permalink / raw)
  To: martin f krafft; +Cc: linux-raid, Doug Ledford, Michal Marek, Kay Sievers

On Monday October 27, madduck@debian.org wrote:
> also sprach Neil Brown <neilb@suse.de> [2008.10.26.2356 +0100]:
> > Greeting.
> >  This is a Request For Comments....
> 
> Good morning!
> 
> [...]
> > I'm also wondering if I should include a udev 'rules' file for md
> > in the mdadm distribution.  Obviously it would be no more than
> > a recommendation, but it might give me a voice in guiding how udev
> > interacted with mdadm.
> 
> I would really like to have a clear separation of competencies.
> Ideally, mdadm never creates any devices but leaves it all to udev,
> and all configuration about alternate names ("symlinks") is done in
> the udev rules file.

Yes, I am moving towards this.  And it seems to be an idea with
resounding support judging by the follow-ups.  So I will probably go
even further than I was planning.
 - if mdadm detects that udev is active (how do I do that???) mdadm
   won't create anything in /dev, or remove anything from /dev, except
   for temporary devices with names that start with '.'.

 - if mdadm does not detect udev, it will still create the device
   and maybe some links.  And remove anything it might have created.

> 
> I know mdadm needs the devices for the ioctls(). However, much of
> what it does with ioctl should already be possible with /sys. Thus,
> in my ideal world, I imagine mdadm to be a manipulator of /sys,
> instructing the kernel to do stuff with components and arrays, and
> have udev create and remove corresponding devices in response to
> kernel events.

Most things can now be done via sysfs, but not everything (bitmaps is
the most obvious hole in the sysfs support).  And I still need to
support older kernels that don't have as many sysfs attributes.

However I do want to move towards using sysfs preferentially,
particularly for "mdadm --monitor".  I would rather that daemon didn't
ever open the device, as that can interfere with e.g. stopping the
array.  The "mdadm -D" calls from udev also need to not open the device.

> 
> I realise this would require a revamp of mdadm, and might actually
> be better done in a new software designed to eventually replace
> mdadm. But is this a way forward with which you could befriend
> yourself?

One issue that looms in my mind as I consider this is the Usage of
mdadm when e.g. creating an array

   mdadm -C /dev/md5 -l5 -n3 /dev/sd[bcd]

I need to give the name of an array device (/dev/md5) that may not
exist but that mdadm doesn't now want to create.

Once I have created the array I might want to look at the details with

   mdadm --detail /dev/md5

The important role that the string "/dev/md5" is serving here is
providing a connection between the two command.  Whatever I created in
the first is what I access in the second.

I could have mdadm accept a simple name

   mdadm -C foo -l5 -n3 /dev/sd[bcd]

but we would need to be clear on the semantics of that name.
For a v1.x metadata array, or a member of a DDF set, I could store
the name in the metadata and it could be a persistent name.
For v0.90, only numbers can be persistent names.
For DDF containers and IMSM(*) there is no where to store a name.
I could store it in /var/run/mdadm/map so that the "mdadm --detail"
could find it.  But in that case it wouldn't be permanent.

I related question is the creation and naming of arrays with other
metadata formats.  e.g. DDF or imsm.
Currently I do e.g.
   mdadm -C /dev/ddf1 --metadata=ddf ......
having the string 'ddf' twice annoys me.
Maybe I could allow

   mdadm -C ddf1  -n5 /dev/sd[abcde]

and have mdadm recognise the metadata format name in "ddf1" and use
that metadata type.

The thing I want to get right now is to put strict limits on names
that are allowed to be given as the array device name in Assemble,
Build, Create.  I can then add new idea by allowing names to be given
that were illegal before.
So in the first instance, I'm think the array name can be:

  /dev/mdN
  /dev/md/N
  /dev/md_dN
  /dev/md/dN

  /dev/md/name-with-no-trailing-digit
           The metadata must store a name which matches the given name
  /dev/md/metadataname-with-trailing-digit-string
           The array must have the named metadata.

But I'm not completely sure of this.  And it might break existing
setups (but I've decided to live with that).

Thanks,
NeilBrown

* DDF = Disk Data Format.  An SNIA 'standard'
* IMSM = Intel Matrix Storage Manager
* SNIA = Storage Networking Industry Association.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 22:37   ` Neil Brown
@ 2008-10-27 22:51     ` Kay Sievers
  2008-10-27 23:56       ` Neil Brown
  0 siblings, 1 reply; 51+ messages in thread
From: Kay Sievers @ 2008-10-27 22:51 UTC (permalink / raw)
  To: Neil Brown; +Cc: martin f krafft, linux-raid, Doug Ledford, Michal Marek

On Mon, Oct 27, 2008 at 23:37, Neil Brown <neilb@suse.de> wrote:
> On Monday October 27, madduck@debian.org wrote:
>> also sprach Neil Brown <neilb@suse.de> [2008.10.26.2356 +0100]:

>> I would really like to have a clear separation of competencies.
>> Ideally, mdadm never creates any devices but leaves it all to udev,
>> and all configuration about alternate names ("symlinks") is done in
>> the udev rules file.
>
> Yes, I am moving towards this.  And it seems to be an idea with
> resounding support judging by the follow-ups.  So I will probably go
> even further than I was planning.
>  - if mdadm detects that udev is active (how do I do that???)

Most tools just check if /dev/.udev/ exists.

>   mdadm
>   won't create anything in /dev, or remove anything from /dev, except
>   for temporary devices with names that start with '.'.

Fine.

>  - if mdadm does not detect udev, it will still create the device
>   and maybe some links.  And remove anything it might have created.

Sounds good.

>> I know mdadm needs the devices for the ioctls(). However, much of
>> what it does with ioctl should already be possible with /sys. Thus,
>> in my ideal world, I imagine mdadm to be a manipulator of /sys,
>> instructing the kernel to do stuff with components and arrays, and
>> have udev create and remove corresponding devices in response to
>> kernel events.
>
> Most things can now be done via sysfs, but not everything (bitmaps is
> the most obvious hole in the sysfs support).  And I still need to
> support older kernels that don't have as many sysfs attributes.
>
> However I do want to move towards using sysfs preferentially,
> particularly for "mdadm --monitor".  I would rather that daemon didn't
> ever open the device, as that can interfere with e.g. stopping the
> array.  The "mdadm -D" calls from udev also need to not open the device.

Sounds fine, but remember that udev will always look at every device's
content if not explicitly told no to do. It will look for filesystem
signatures and other metadata at the beginning and the end of the
device.

> One issue that looms in my mind as I consider this is the Usage of
> mdadm when e.g. creating an array
>
>   mdadm -C /dev/md5 -l5 -n3 /dev/sd[bcd]
>
> I need to give the name of an array device (/dev/md5) that may not
> exist but that mdadm doesn't now want to create.
>
> Once I have created the array I might want to look at the details with
>
>   mdadm --detail /dev/md5
>
> The important role that the string "/dev/md5" is serving here is
> providing a connection between the two command.  Whatever I created in
> the first is what I access in the second.

Can't you just use the major/minor is there is no other meaningful
name? The devnum can not change on any system, and is always valid as
long as the kernel device exists.

Kay

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 22:51     ` Kay Sievers
@ 2008-10-27 23:56       ` Neil Brown
  2008-10-28  0:20         ` Kay Sievers
  0 siblings, 1 reply; 51+ messages in thread
From: Neil Brown @ 2008-10-27 23:56 UTC (permalink / raw)
  To: Kay Sievers; +Cc: martin f krafft, linux-raid, Doug Ledford, Michal Marek

On Monday October 27, kay.sievers@vrfy.org wrote:
> On Mon, Oct 27, 2008 at 23:37, Neil Brown <neilb@suse.de> wrote:
> > On Monday October 27, madduck@debian.org wrote:
> >> also sprach Neil Brown <neilb@suse.de> [2008.10.26.2356 +0100]:
> 
> >> I would really like to have a clear separation of competencies.
> >> Ideally, mdadm never creates any devices but leaves it all to udev,
> >> and all configuration about alternate names ("symlinks") is done in
> >> the udev rules file.
> >
> > Yes, I am moving towards this.  And it seems to be an idea with
> > resounding support judging by the follow-ups.  So I will probably go
> > even further than I was planning.
> >  - if mdadm detects that udev is active (how do I do that???)
> 
> Most tools just check if /dev/.udev/ exists.

So we are checking if udev is configured rather than if it is
running.  I guess that is what we really want to check.  OK - thanks.

> > However I do want to move towards using sysfs preferentially,
> > particularly for "mdadm --monitor".  I would rather that daemon didn't
> > ever open the device, as that can interfere with e.g. stopping the
> > array.  The "mdadm -D" calls from udev also need to not open the device.
> 
> Sounds fine, but remember that udev will always look at every device's
> content if not explicitly told no to do. It will look for filesystem
> signatures and other metadata at the beginning and the end of the
> device.

Ahh yes, of course.
So if I want to be able to stop an array immediately after starting
(or changing) it (as I often do in test scripts, but may not need to
in real life) I need to wait for udev to settle.
So if I just need this in a script I can add
    udevadm settle
somewhere between the 'start' and the 'stop'.

I wonder if I ever want mdadm to call that directly?  I suspect that
if it got called from a udev rule it would deadlock, so I'd need to be
careful of that.

> 
> > One issue that looms in my mind as I consider this is the Usage of
> > mdadm when e.g. creating an array
> >
> >   mdadm -C /dev/md5 -l5 -n3 /dev/sd[bcd]
> >
> > I need to give the name of an array device (/dev/md5) that may not
> > exist but that mdadm doesn't now want to create.
> >
> > Once I have created the array I might want to look at the details with
> >
> >   mdadm --detail /dev/md5
> >
> > The important role that the string "/dev/md5" is serving here is
> > providing a connection between the two command.  Whatever I created in
> > the first is what I access in the second.
> 
> Can't you just use the major/minor is there is no other meaningful
> name? The devnum can not change on any system, and is always valid as
> long as the kernel device exists.

Maybe.  Though in general I would rather that the user didn't allocate
the minor number.
I could get "mdadm --create" to report
   mdadm: created array as /dev/md42
or
   mdadm: created md array 42

and then you could simply
   mdadm --detail 42

however that would be awkward for scripts.

I don't need to require that a name be given, but I want to allow it.
An I need to stay at least a little bit compatible with current mdadm
usage and practices.

I'm definitely considering allowing

  mdadm --create md0 ....

(i.e. drop the '/dev/').  That isn't a big step in functionality, but
it might be an important step in perceptions.

Thanks, the picture is slowly becoming clearer.
NeilBrown

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 23:56       ` Neil Brown
@ 2008-10-28  0:20         ` Kay Sievers
  0 siblings, 0 replies; 51+ messages in thread
From: Kay Sievers @ 2008-10-28  0:20 UTC (permalink / raw)
  To: Neil Brown; +Cc: martin f krafft, linux-raid, Doug Ledford, Michal Marek

On Tue, Oct 28, 2008 at 00:56, Neil Brown <neilb@suse.de> wrote:
> On Monday October 27, kay.sievers@vrfy.org wrote:
>> On Mon, Oct 27, 2008 at 23:37, Neil Brown <neilb@suse.de> wrote:
>> > On Monday October 27, madduck@debian.org wrote:
>> >> also sprach Neil Brown <neilb@suse.de> [2008.10.26.2356 +0100]:
>>
>> >> I would really like to have a clear separation of competencies.
>> >> Ideally, mdadm never creates any devices but leaves it all to udev,
>> >> and all configuration about alternate names ("symlinks") is done in
>> >> the udev rules file.
>> >
>> > Yes, I am moving towards this.  And it seems to be an idea with
>> > resounding support judging by the follow-ups.  So I will probably go
>> > even further than I was planning.
>> >  - if mdadm detects that udev is active (how do I do that???)
>>
>> Most tools just check if /dev/.udev/ exists.
>
> So we are checking if udev is configured rather than if it is
> running.  I guess that is what we really want to check.  OK - thanks.

Yes, that's what Debian is checking for. Most other distros just don't
work at all without it because too much stuff depends on it today. We
have many dynamic minor numbers already in the kernel, and also the
extended dynamic block minors will make it really hard to run a system
without it.

>> > However I do want to move towards using sysfs preferentially,
>> > particularly for "mdadm --monitor".  I would rather that daemon didn't
>> > ever open the device, as that can interfere with e.g. stopping the
>> > array.  The "mdadm -D" calls from udev also need to not open the device.
>>
>> Sounds fine, but remember that udev will always look at every device's
>> content if not explicitly told no to do. It will look for filesystem
>> signatures and other metadata at the beginning and the end of the
>> device.
>
> Ahh yes, of course.
> So if I want to be able to stop an array immediately after starting
> (or changing) it (as I often do in test scripts, but may not need to
> in real life) I need to wait for udev to settle.
> So if I just need this in a script I can add
>    udevadm settle
> somewhere between the 'start' and the 'stop'.

You could just wait for the specific events you have created, either
by watching the udev queue that a sequence of events has been finished
(device-mapper is doing that).

Or we could add support to the kernel to return you the event seqnum
in the action you requested from the kernel, and your ioctl or
whatever else passes this number down to mdadm which will wait for
that specific event number to be finished.

> I wonder if I ever want mdadm to call that directly?

Don't know. Maybe not, especially not in the "dumb" default mode which
waits for _all_ events to finish, also new ones happening after your
action.

> I suspect that
> if it got called from a udev rule it would deadlock, so I'd need to be
> careful of that.

Yeah, the "dumb" wait-for-all would deadlock for 3 minutes and then
been killed by the udev event process. :)

>> > One issue that looms in my mind as I consider this is the Usage of
>> > mdadm when e.g. creating an array
>> >
>> >   mdadm -C /dev/md5 -l5 -n3 /dev/sd[bcd]
>> >
>> > I need to give the name of an array device (/dev/md5) that may not
>> > exist but that mdadm doesn't now want to create.
>> >
>> > Once I have created the array I might want to look at the details with
>> >
>> >   mdadm --detail /dev/md5
>> >
>> > The important role that the string "/dev/md5" is serving here is
>> > providing a connection between the two command.  Whatever I created in
>> > the first is what I access in the second.
>>
>> Can't you just use the major/minor is there is no other meaningful
>> name? The devnum can not change on any system, and is always valid as
>> long as the kernel device exists.
>
> Maybe.  Though in general I would rather that the user didn't allocate
> the minor number.
> I could get "mdadm --create" to report
>   mdadm: created array as /dev/md42
> or
>   mdadm: created md array 42
>
> and then you could simply
>   mdadm --detail 42
>
> however that would be awkward for scripts.
>
> I don't need to require that a name be given, but I want to allow it.
> An I need to stay at least a little bit compatible with current mdadm
> usage and practices.
>
> I'm definitely considering allowing
>
>  mdadm --create md0 ....
>
> (i.e. drop the '/dev/').  That isn't a big step in functionality, but
> it might be an important step in perceptions.

Sounds reasonable, yes.

Thanks,
Kay

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27  8:22 ` martin f krafft
  2008-10-27 15:13   ` Doug Ledford
  2008-10-27 22:37   ` Neil Brown
@ 2008-10-28  6:17   ` Luca Berra
  2 siblings, 0 replies; 51+ messages in thread
From: Luca Berra @ 2008-10-28  6:17 UTC (permalink / raw)
  To: Neil Brown, linux-raid, Doug Ledford, Michal Marek, Kay Sievers

On Mon, Oct 27, 2008 at 09:22:57AM +0100, martin f krafft wrote:
>also sprach Neil Brown <neilb@suse.de> [2008.10.26.2356 +0100]:
>> Greeting.
>>  This is a Request For Comments....
>
>Good morning!
>
>[...]
>> I'm also wondering if I should include a udev 'rules' file for md
>> in the mdadm distribution.  Obviously it would be no more than
>> a recommendation, but it might give me a voice in guiding how udev
>> interacted with mdadm.
>
>I would really like to have a clear separation of competencies.
>Ideally, mdadm never creates any devices but leaves it all to udev,
>and all configuration about alternate names ("symlinks") is done in
>the udev rules file.
I would not, mdadm should be still able to create the base device name
if needed. Aliases could be left to udev.
>I know mdadm needs the devices for the ioctls(). However, much of
>what it does with ioctl should already be possible with /sys. Thus,
>in my ideal world, I imagine mdadm to be a manipulator of /sys,
/sys has the same issue as /dev.
there is no md sys node before the device is created.
apart from this small issue, mdadm should be able to control md via
sysfs as well (i.e. sync_action).

>I realise this would require a revamp of mdadm, and might actually
>be better done in a new software designed to eventually replace
>mdadm. But is this a way forward with which you could befriend
>yourself?
Please, NO, it took ages to get people and some distributions to stop
using raidtools in favor of mdadm, don't do this mess again.

L.



-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-26 22:56 RFC - device names and mdadm with some reference to udev Neil Brown
  2008-10-27  8:22 ` martin f krafft
@ 2008-10-27 12:41 ` Kay Sievers
  2008-10-27 13:23   ` David Lethe
                     ` (2 more replies)
  2008-10-30 17:18 ` RFC - device names and mdadm with some reference to udev Doug Ledford
  2 siblings, 3 replies; 51+ messages in thread
From: Kay Sievers @ 2008-10-27 12:41 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Doug Ledford, martin f. krafft, Michal Marek

On Sun, Oct 26, 2008 at 23:56, Neil Brown <neilb@suse.de> wrote:
>  Device naming in mdadm is a bit of a mess.

>  In 2.6.28, partitioned devices (mdp) wont be needed any more as md
>  will make use of the "extended partition" functionality recently
>  added.

You mean the extended minor space, right? Or the extended partitions,
which are a format in a msdos table?

>  1/ The only device nodes created will be /dev/mdX and /dev/md_dX
>    along with partitions /dev/mdXpY and /dev/md_dXpY as appropriate.
>    These will be created by mdadm in accordance with the "--auto"
>    flag unless something in mdadm.conf says to leave it to udev.
>    In that case, mdadm will create a temporary node
>    (/dev/.mdadm.whatever) and remove it once udev has created the
>    real thing.

Sounds fine, if mdadm needs a device node. It could also wait for udev
to have the node created, but having a temporary node sounds fine, as
long as it will not clash with anything udev is creating.

>  2/ There will be various symlinks to these devices.
>    a/ if "symlinks=yes" is given in mdadm.conf, symlinks from
>         /dev/md/X or /dev/md/dX will be created.
>    b/ if udev is configured like on Debian,
>              /dev/disk/by-id/md-name-XXXX
>        and   /dev/disk/by-id/md-uuid-UUUU
>       will be created (by udev).

Yes, almost all distros have that.

>    I'm contemplating creating a link based on the metadata type with
>    a sequential number. e.g. /dev/md/ddf1 or /dev/md/imsm2.
>    I'm not sure if there should be in /dev/md/ or directly in /dev/.
>    I'm also not sure if I should leave the creation to udev, and
>    whether I should use a small sequential number, or just whatever
>    number was allocated as the minor number of the device.

There is intentionally no support for enumeration in udev, it will
just not work and such numbers/links are not reproducible in hotplug
environments, and therefore totally useless, and do much more harm
than good.

Nothing must ever depend on enumeration, or minor numbers, if these
properties can not made persistent, attached  to the device itself, so
that it will always show up with the same number forever. Better do
not even start such an idea, and leave the kernel name as the primary
"random" number, instead of creating new randomness on top.

>  4/ When we stop an array, mdadm will remove anything from /dev that
>    it probably created.

Sure, but only if mdadm has it created.

>    In particular, it will remove the device node as described in 1,
>    any partitions, and any symlinks in /dev or /dev/md which point to
>    any of those.  I need to be certain that this won't confuse udev.

You must never touch anything that udev has created. It must be driven
by kernel "add/remove/change" events.

>  1/ People want auto-assembly.  I've always fought against it (we
>    don't auto-mount all filesystems do we?).

Some systems do automount all devices. Most systems do only hotplug
devices which are not listed in /etc/fstab. Expect in the future that
there will always be auto-assembly and also auto-mounting to some
degree. All the newer storage buses, like iSCSI and such will always
need  auto-mounting on device discovery, and not work with any
bootup-script logic.

>    But it is a loosing
>    battle.  And on a modern desktop, when you plug in a new drive the
>    filesystem is automatically mounted.  So my argument is falling
>    apart.

Yes, we will need to support that as a common setup.

> I'm also wondering if I should include a udev 'rules' file for md in
> the mdadm distribution.  Obviously it would be no more than a
> recommendation, but it might give me a voice in guiding how udev
> interacted with mdadm.

Definitely, it should carry a udev rules file which instructs udev to
create all intended symlinks and also supports the raid auto-assembly
setup. It should not mount anything by default though.

I'm happy, to see you working on next-generation mdadm. I like to see
a better integration with udev, and especially, if mdadm detects a
running udev, not to mess around in /dev in any way, but leave the
names in /dev to instructions in udev rules. Temporary nodes are fine,
as long as they don't conflict with anything else, and get removed
after they are not needed anymore. All updates to symlinks and such
should be done by "change" events from the kernel, which instructs
udev to update all the links, and not by touching anything in /dev
from mdadm.

Do you think mdadm will stay a program only, called by udev/the user,
or will a port of its functionality live in a daemon?

Thanks,
Kay

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: RFC - device names and mdadm with some reference to udev.
  2008-10-27 12:41 ` Kay Sievers
@ 2008-10-27 13:23   ` David Lethe
  2008-10-27 23:27     ` Neil Brown
  2008-10-27 13:24   ` Andre Noll
  2008-10-27 23:23   ` Neil Brown
  2 siblings, 1 reply; 51+ messages in thread
From: David Lethe @ 2008-10-27 13:23 UTC (permalink / raw)
  To: Kay Sievers, Neil Brown
  Cc: linux-raid, Doug Ledford, martin f. krafft, Michal Marek

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Kay Sievers
> Sent: Monday, October 27, 2008 7:42 AM
> To: Neil Brown
> Cc: linux-raid@vger.kernel.org; Doug Ledford; martin f. krafft; Michal
> Marek
> Subject: Re: RFC - device names and mdadm with some reference to udev.
> 
> On Sun, Oct 26, 2008 at 23:56, Neil Brown <neilb@suse.de> wrote:
> >  Device naming in mdadm is a bit of a mess.
> 
> >  In 2.6.28, partitioned devices (mdp) wont be needed any more as md
> >  will make use of the "extended partition" functionality recently
> >  added.
> 
> You mean the extended minor space, right? Or the extended partitions,
> which are a format in a msdos table?
> 
> >  1/ The only device nodes created will be /dev/mdX and /dev/md_dX
> >    along with partitions /dev/mdXpY and /dev/md_dXpY as appropriate.
> >    These will be created by mdadm in accordance with the "--auto"
> >    flag unless something in mdadm.conf says to leave it to udev.
> >    In that case, mdadm will create a temporary node
> >    (/dev/.mdadm.whatever) and remove it once udev has created the
> >    real thing.
> 
> Sounds fine, if mdadm needs a device node. It could also wait for udev
> to have the node created, but having a temporary node sounds fine, as
> long as it will not clash with anything udev is creating.
> 
> >  2/ There will be various symlinks to these devices.
> >    a/ if "symlinks=yes" is given in mdadm.conf, symlinks from
> >         /dev/md/X or /dev/md/dX will be created.
> >    b/ if udev is configured like on Debian,
> >              /dev/disk/by-id/md-name-XXXX
> >        and   /dev/disk/by-id/md-uuid-UUUU
> >       will be created (by udev).
> 
> Yes, almost all distros have that.
> 
> >    I'm contemplating creating a link based on the metadata type with
> >    a sequential number. e.g. /dev/md/ddf1 or /dev/md/imsm2.
> >    I'm not sure if there should be in /dev/md/ or directly in /dev/.
> >    I'm also not sure if I should leave the creation to udev, and
> >    whether I should use a small sequential number, or just whatever
> >    number was allocated as the minor number of the device.
> 
> There is intentionally no support for enumeration in udev, it will
> just not work and such numbers/links are not reproducible in hotplug
> environments, and therefore totally useless, and do much more harm
> than good.
> 
> Nothing must ever depend on enumeration, or minor numbers, if these
> properties can not made persistent, attached  to the device itself, so
> that it will always show up with the same number forever. Better do
> not even start such an idea, and leave the kernel name as the primary
> "random" number, instead of creating new randomness on top.
> 
> >  4/ When we stop an array, mdadm will remove anything from /dev that
> >    it probably created.
> 
> Sure, but only if mdadm has it created.
> 
> >    In particular, it will remove the device node as described in 1,
> >    any partitions, and any symlinks in /dev or /dev/md which point to
> >    any of those.  I need to be certain that this won't confuse udev.
> 
> You must never touch anything that udev has created. It must be driven
> by kernel "add/remove/change" events.
> 
> >  1/ People want auto-assembly.  I've always fought against it (we
> >    don't auto-mount all filesystems do we?).
> 
> Some systems do automount all devices. Most systems do only hotplug
> devices which are not listed in /etc/fstab. Expect in the future that
> there will always be auto-assembly and also auto-mounting to some
> degree. All the newer storage buses, like iSCSI and such will always
> need  auto-mounting on device discovery, and not work with any
> bootup-script logic.
> 
> >    But it is a loosing
> >    battle.  And on a modern desktop, when you plug in a new drive the
> >    filesystem is automatically mounted.  So my argument is falling
> >    apart.
> 
> Yes, we will need to support that as a common setup.
> 
> > I'm also wondering if I should include a udev 'rules' file for md in
> > the mdadm distribution.  Obviously it would be no more than a
> > recommendation, but it might give me a voice in guiding how udev
> > interacted with mdadm.
> 
> Definitely, it should carry a udev rules file which instructs udev to
> create all intended symlinks and also supports the raid auto-assembly
> setup. It should not mount anything by default though.
> 
> I'm happy, to see you working on next-generation mdadm. I like to see
> a better integration with udev, and especially, if mdadm detects a
> running udev, not to mess around in /dev in any way, but leave the
> names in /dev to instructions in udev rules. Temporary nodes are fine,
> as long as they don't conflict with anything else, and get removed
> after they are not needed anymore. All updates to symlinks and such
> should be done by "change" events from the kernel, which instructs
> udev to update all the links, and not by touching anything in /dev
> from mdadm.
> 
> Do you think mdadm will stay a program only, called by udev/the user,
> or will a port of its functionality live in a daemon?
> 
> Thanks,
> Kay
> --

I am with Kay here, never force automount.  
I put that right up there with the bonehead MSFT rule of trying to write
signatures on disk drives once they appear.  

Furthermore, don't just delete /dev/md names.   That would be even a greater
mistake.  LINUX today has storage on SANs, clustering, multi-tasking, multi-pathing,
SAN-management/monitoring software that will be using device paths that you want to
delete.  

I can't think of a simple fix, but can think of a complicated fix to make this play
nice in such environments, when things are good .. and when things go bad.   My outside-
the-box suggestion is to present md target devices as a SCSI RAID controller or
processor device where you use ANSI-defined sense keys/ASC values to allow 
apps that are running remotely or even
locally to query immediate state.   If the md device is broken, then report the same sense
information not ready, spun down, whatever ... that a physical disk would report for 
various partition.
More importantly, use EVPD Inquiry and log pages to query configuration information, of
both the /dev/md device, AND all of the partitions, along with health and anything else.
Enterprise management software wouldn't have to log into the LINUX host and run custom
scripts to see what is going on. Use mode sense to send control/configuration change requests.

The ANSI provides a mechanism and options for defining a unique naming convention, and you can
even add a UUID in the format you want as a Vendor-specific layout.   There is already a foundation
for such work due to the iSCSI logic, but obviously much more work is required.

Yes, this not a simple & easy fix, but if you want to future-proof everything and make LINUX storage
easy to integrate into heterogeneous environments, then let ANSI be your guide.
David


^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: RFC - device names and mdadm with some reference to udev.
  2008-10-27 13:23   ` David Lethe
@ 2008-10-27 23:27     ` Neil Brown
  2008-10-27 23:48       ` David Lethe
  0 siblings, 1 reply; 51+ messages in thread
From: Neil Brown @ 2008-10-27 23:27 UTC (permalink / raw)
  To: David Lethe
  Cc: Kay Sievers, linux-raid, Doug Ledford, martin f. krafft,
	Michal Marek

On Monday October 27, david@santools.com wrote:
> 
> I am with Kay here, never force automount.  
> I put that right up there with the bonehead MSFT rule of trying to write
> signatures on disk drives once they appear.  

By you wouldn't mind if something that looked like it might have once
been a raid1 started a resync as soon as you plugged it in?

> 
> Furthermore, don't just delete /dev/md names.   That would be even a greater
> mistake.  LINUX today has storage on SANs, clustering, multi-tasking, multi-pathing,
> SAN-management/monitoring software that will be using device paths that you want to
> delete.  

I don't understand.  If the md array has been explicitly stopped, why
not remove the names from /dev.  They have no meaning any more.  And
nothing can have them open.


> 
> I can't think of a simple fix, but can think of a complicated fix to make this play
> nice in such environments, when things are good .. and when things go bad.   My outside-
> the-box suggestion is to present md target devices as a SCSI RAID controller or
> processor device where you use ANSI-defined sense keys/ASC values to allow 
> apps that are running remotely or even
> locally to query immediate state.   If the md device is broken, then report the same sense
> information not ready, spun down, whatever ... that a physical disk would report for 
> various partition.
> More importantly, use EVPD Inquiry and log pages to query configuration information, of
> both the /dev/md device, AND all of the partitions, along with health and anything else.
> Enterprise management software wouldn't have to log into the LINUX host and run custom
> scripts to see what is going on. Use mode sense to send control/configuration change requests.
> 
> The ANSI provides a mechanism and options for defining a unique naming convention, and you can
> even add a UUID in the format you want as a Vendor-specific layout.   There is already a foundation
> for such work due to the iSCSI logic, but obviously much more work is required.
> 
> Yes, this not a simple & easy fix, but if you want to future-proof everything and make LINUX storage
> easy to integrate into heterogeneous environments, then let ANSI be your guide.

What I think you are suggesting is that md raid be exportable via
iSCSI (or FCOE or AOE or flavor-of-the-month) in such a way that
status-query commands 'do the right thing'.  Sounds like we want a
plug-in for iscsid (or whatever it is that support iscsi service).

Is that what you mean?

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: RFC - device names and mdadm with some reference to udev.
  2008-10-27 23:27     ` Neil Brown
@ 2008-10-27 23:48       ` David Lethe
  0 siblings, 0 replies; 51+ messages in thread
From: David Lethe @ 2008-10-27 23:48 UTC (permalink / raw)
  To: Neil Brown
  Cc: Kay Sievers, linux-raid, Doug Ledford, martin f. krafft,
	Michal Marek



> -----Original Message-----
> From: Neil Brown [mailto:neilb@suse.de]
> Sent: Monday, October 27, 2008 6:28 PM
> To: David Lethe
> Cc: Kay Sievers; linux-raid@vger.kernel.org; Doug Ledford; martin f.
> krafft; Michal Marek
> Subject: RE: RFC - device names and mdadm with some reference to udev.
> 
> On Monday October 27, david@santools.com wrote:
> >
> > I am with Kay here, never force automount.
> > I put that right up there with the bonehead MSFT rule of trying to
> write
> > signatures on disk drives once they appear.
> 
> By you wouldn't mind if something that looked like it might have once
> been a raid1 started a resync as soon as you plugged it in?
> 

This could be addressed via EVPD pages.  The application that cares what
raid1 is
at any instant in time, or what might have been can ask it.
Furthermore, benny
of doing this is those applications don't have to have access to the
LINUX machine
or O/S in any way to get the info, and would be O/S, shell, and hardware
-agnostic.

> >
> > Furthermore, don't just delete /dev/md names.   That would be even a
> greater
> > mistake.  LINUX today has storage on SANs, clustering,
multi-tasking,
> multi-pathing,
> > SAN-management/monitoring software that will be using device paths
> that you want to
> > delete.
> 
> I don't understand.  If the md array has been explicitly stopped, why
> not remove the names from /dev.  They have no meaning any more.  And
> nothing can have them open.
> 

They have no meaning to mdadm and the programs that explicitly know that
md was stopped.  What if /dev/mdX was opened by another app and in use?
I do
not know for sure, but I bet there are corner cases where you wouldn't
be
able to remove device names even if you wanted to.  But, just educated
guess
on this item, but I suggest some people experienced with clustering,
infiniband-
connected nodes and such have the opportunity to put their opinion on
this one.


> >
> > I can't think of a simple fix, but can think of a complicated fix to
> make this play
> > nice in such environments, when things are good .. and when things
go
> bad.   My outside-
> > the-box suggestion is to present md target devices as a SCSI RAID
> controller or
> > processor device where you use ANSI-defined sense keys/ASC values to
> allow
> > apps that are running remotely or even
> > locally to query immediate state.   If the md device is broken, then
> report the same sense
> > information not ready, spun down, whatever ... that a physical disk
> would report for
> > various partition.
> > More importantly, use EVPD Inquiry and log pages to query
> configuration information, of
> > both the /dev/md device, AND all of the partitions, along with
health
> and anything else.
> > Enterprise management software wouldn't have to log into the LINUX


> host and run custom
> > scripts to see what is going on. Use mode sense to send
> control/configuration change requests.
> >
> > The ANSI provides a mechanism and options for defining a unique
> naming convention, and you can
> > even add a UUID in the format you want as a Vendor-specific layout.
> There is already a foundation
> > for such work due to the iSCSI logic, but obviously much more work
is
> required.
> >
> > Yes, this not a simple & easy fix, but if you want to future-proof
> everything and make LINUX storage
> > easy to integrate into heterogeneous environments, then let ANSI be
> your guide.
> 
> What I think you are suggesting is that md raid be exportable via
> iSCSI (or FCOE or AOE or flavor-of-the-month) in such a way that
> status-query commands 'do the right thing'.  Sounds like we want a
> plug-in for iscsid (or whatever it is that support iscsi service).
> 
> Is that what you mean?
> 
> Thanks,
> NeilBrown

iSCSI would be easier, but as long as you asked for suggestions ... I
prefer
that somebody magically write all of that code which will allow one to
add true
physical-port target devices so if you had a SAS, SCSI, or FC card then
you could
hook up multiple hosts, or switche. 
Then you would have foundation for multiple-concurrent
host connectivity on md-based volumes in addition to individual disks.
Instant SAN.
But iSCSI would effectively solve problem of having appropriate method
for communicating 
health and state across a SAN or WAN, and you wouldn't even have to
write code to export
the md device as a SCSI device type 0 (i.e, disk drive).   

(Hey, you asked for suggestions ... so consider this my letter to Santa)




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 12:41 ` Kay Sievers
  2008-10-27 13:23   ` David Lethe
@ 2008-10-27 13:24   ` Andre Noll
  2008-10-27 14:20     ` Kay Sievers
  2008-10-27 23:23   ` Neil Brown
  2 siblings, 1 reply; 51+ messages in thread
From: Andre Noll @ 2008-10-27 13:24 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Neil Brown, linux-raid, Doug Ledford, martin f. krafft,
	Michal Marek

[-- Attachment #1: Type: text/plain, Size: 3275 bytes --]

On 13:41, Kay Sievers wrote:
> >  1/ The only device nodes created will be /dev/mdX and /dev/md_dX
> >    along with partitions /dev/mdXpY and /dev/md_dXpY as appropriate.
> >    These will be created by mdadm in accordance with the "--auto"
> >    flag unless something in mdadm.conf says to leave it to udev.
> >    In that case, mdadm will create a temporary node
> >    (/dev/.mdadm.whatever) and remove it once udev has created the
> >    real thing.
> 
> Sounds fine, if mdadm needs a device node. It could also wait for udev
> to have the node created, but having a temporary node sounds fine, as
> long as it will not clash with anything udev is creating.

IMHO we should try to avoid creating device nodes from mdadm whenever
possible. OTOH it should be possible to assemble a raid array manually
i.e. without udev, for example when used from a rescue system.

> >    In particular, it will remove the device node as described in 1,
> >    any partitions, and any symlinks in /dev or /dev/md which point to
> >    any of those.  I need to be certain that this won't confuse udev.
> 
> You must never touch anything that udev has created. It must be driven
> by kernel "add/remove/change" events.

I think it's no problem to let mdadm generate such events rather than
messing with device nodes itself. BTW: What's the preferred way to
wait for the generation of the device node after an appropriate event
has been generated? Polling?

> Some systems do automount all devices. Most systems do only hotplug
> devices which are not listed in /etc/fstab. Expect in the future that
> there will always be auto-assembly and also auto-mounting to some
> degree. All the newer storage buses, like iSCSI and such will always
> need  auto-mounting on device discovery, and not work with any
> bootup-script logic.

The nice thing is that this kind of auto-mounting can be handled from
user space. So we _might_ get rid of the in-kernel raid autodetect
code eventually.

> > I'm also wondering if I should include a udev 'rules' file for md in
> > the mdadm distribution.  Obviously it would be no more than a
> > recommendation, but it might give me a voice in guiding how udev
> > interacted with mdadm.
> 
> Definitely, it should carry a udev rules file which instructs udev to
> create all intended symlinks and also supports the raid auto-assembly
> setup. It should not mount anything by default though.

How about distributing such a rules file together with udev? As far as
I understand it, you (Kay) are currently trying to unify the different
udev configurations that exist in the wild. If the udev source code
contained a rules file for md, this would already help.

Moreover, changes to udev that require modifications of the md rules
file might happen more frequently than changes to md that require
such modifications ;)

> Do you think mdadm will stay a program only, called by udev/the user,
> or will a port of its functionality live in a daemon?

mdadm already has a daemon mode. It's currently used only as a
monitoring tool (to send alert mails on disk failures) but this could
be extended if necessary.

Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 13:24   ` Andre Noll
@ 2008-10-27 14:20     ` Kay Sievers
  0 siblings, 0 replies; 51+ messages in thread
From: Kay Sievers @ 2008-10-27 14:20 UTC (permalink / raw)
  To: Andre Noll
  Cc: Neil Brown, linux-raid, Doug Ledford, martin f. krafft,
	Michal Marek

On Mon, Oct 27, 2008 at 14:24, Andre Noll <maan@systemlinux.org> wrote:
> On 13:41, Kay Sievers wrote:
>> >  1/ The only device nodes created will be /dev/mdX and /dev/md_dX
>> >    along with partitions /dev/mdXpY and /dev/md_dXpY as appropriate.
>> >    These will be created by mdadm in accordance with the "--auto"
>> >    flag unless something in mdadm.conf says to leave it to udev.
>> >    In that case, mdadm will create a temporary node
>> >    (/dev/.mdadm.whatever) and remove it once udev has created the
>> >    real thing.
>>
>> Sounds fine, if mdadm needs a device node. It could also wait for udev
>> to have the node created, but having a temporary node sounds fine, as
>> long as it will not clash with anything udev is creating.
>
> IMHO we should try to avoid creating device nodes from mdadm whenever
> possible. OTOH it should be possible to assemble a raid array manually
> i.e. without udev, for example when used from a rescue system.

Sounds good, yes.

>> >    In particular, it will remove the device node as described in 1,
>> >    any partitions, and any symlinks in /dev or /dev/md which point to
>> >    any of those.  I need to be certain that this won't confuse udev.
>>
>> You must never touch anything that udev has created. It must be driven
>> by kernel "add/remove/change" events.
>
> I think it's no problem to let mdadm generate such events rather than
> messing with device nodes itself. BTW: What's the preferred way to
> wait for the generation of the device node after an appropriate event
> has been generated? Polling?

There are several ways, but none of them is really simple because of
the complete asynchronous behavior of udev.

A daemon could listen for  events from udev, in the way like when
"udevadm monitor" prints "UDEV[] add /.../block/md0, we can be sure
the node exists, and the event is fully handled. The event environment
also contains the device name and all the symlink names.

Or we need to loop until the event sequence number is handled (device
mapper is doing that), or loop until the device node/link is there.

>> Some systems do automount all devices. Most systems do only hotplug
>> devices which are not listed in /etc/fstab. Expect in the future that
>> there will always be auto-assembly and also auto-mounting to some
>> degree. All the newer storage buses, like iSCSI and such will always
>> need  auto-mounting on device discovery, and not work with any
>> bootup-script logic.

>> > I'm also wondering if I should include a udev 'rules' file for md in
>> > the mdadm distribution.  Obviously it would be no more than a
>> > recommendation, but it might give me a voice in guiding how udev
>> > interacted with mdadm.
>>
>> Definitely, it should carry a udev rules file which instructs udev to
>> create all intended symlinks and also supports the raid auto-assembly
>> setup. It should not mount anything by default though.
>
> How about distributing such a rules file together with udev? As far as
> I understand it, you (Kay) are currently trying to unify the different
> udev configurations that exist in the wild. If the udev source code
> contained a rules file for md, this would already help.
>
> Moreover, changes to udev that require modifications of the md rules
> file might happen more frequently than changes to md that require
> such modifications ;)

Works both ways. Doesn't really matter. We have many packages today
installing their own rules files, which is good if systems do not use
a specific piece of software, so that they don't needlessly match udev
rules with every event, for something that will never exist.

Kay

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 12:41 ` Kay Sievers
  2008-10-27 13:23   ` David Lethe
  2008-10-27 13:24   ` Andre Noll
@ 2008-10-27 23:23   ` Neil Brown
  2008-10-28  0:03     ` Kay Sievers
                       ` (2 more replies)
  2 siblings, 3 replies; 51+ messages in thread
From: Neil Brown @ 2008-10-27 23:23 UTC (permalink / raw)
  To: Kay Sievers; +Cc: linux-raid, Doug Ledford, martin f. krafft, Michal Marek

On Monday October 27, kay.sievers@vrfy.org wrote:
> On Sun, Oct 26, 2008 at 23:56, Neil Brown <neilb@suse.de> wrote:
> >  Device naming in mdadm is a bit of a mess.
> 
> >  In 2.6.28, partitioned devices (mdp) wont be needed any more as md
> >  will make use of the "extended partition" functionality recently
> >  added.
> 
> You mean the extended minor space, right? Or the extended partitions,
> which are a format in a msdos table?
> 

Yes, the extended minor space.  Too many extensions here :-)

> >  1/ The only device nodes created will be /dev/mdX and /dev/md_dX
> >    along with partitions /dev/mdXpY and /dev/md_dXpY as appropriate.
> >    These will be created by mdadm in accordance with the "--auto"
> >    flag unless something in mdadm.conf says to leave it to udev.
> >    In that case, mdadm will create a temporary node
> >    (/dev/.mdadm.whatever) and remove it once udev has created the
> >    real thing.
> 
> Sounds fine, if mdadm needs a device node. It could also wait for udev
> to have the node created, but having a temporary node sounds fine, as
> long as it will not clash with anything udev is creating.

mdadm definitely does need a device node.  Currently opening a
block-special-device is the only way to create an md array.  I have
contemplated some approaches using sysfs, but I could never see that
they actually gained me anything.
It should not need to wait for anything though.  It can just keep
using the temporary node it created.

> 
> >  2/ There will be various symlinks to these devices.
> >    a/ if "symlinks=yes" is given in mdadm.conf, symlinks from
> >         /dev/md/X or /dev/md/dX will be created.
> >    b/ if udev is configured like on Debian,
> >              /dev/disk/by-id/md-name-XXXX
> >        and   /dev/disk/by-id/md-uuid-UUUU
> >       will be created (by udev).
> 
> Yes, almost all distros have that.

But in different places. 
Debian has /etc/udev/rules
openSUSE has /lib/udev/rules

I love standards.  There are so many to choose from. :-)

Is there anywhere else I should get 'make install' to check?

Debian doesn't get 'mdp' devices right.
openSUSE is already ready for 2.6.28 in which all md devices can be
partitions!

> 
> >    I'm contemplating creating a link based on the metadata type with
> >    a sequential number. e.g. /dev/md/ddf1 or /dev/md/imsm2.
> >    I'm not sure if there should be in /dev/md/ or directly in /dev/.
> >    I'm also not sure if I should leave the creation to udev, and
> >    whether I should use a small sequential number, or just whatever
> >    number was allocated as the minor number of the device.
> 
> There is intentionally no support for enumeration in udev, it will
> just not work and such numbers/links are not reproducible in hotplug
> environments, and therefore totally useless, and do much more harm
> than good.

I'm not entirely following your logic here.
The 'a' 'b' 'c' at the end of e.g. /dev/sda are not reproducible in
hotplug environments, but they are not totally useless.  I know they
will remain stable as long as the device is present, so once I found
out which /dev/sdX is my USB thingo I just plugged in, I can
repeatedly use that nice simple name to cfdisk, mkfs, mount, whatever
the device.

> 
> Nothing must ever depend on enumeration, or minor numbers, if these
> properties can not made persistent, attached  to the device itself, so
> that it will always show up with the same number forever. Better do
> not even start such an idea, and leave the kernel name as the primary
> "random" number, instead of creating new randomness on top.

I'm not entirely convinced.  However I can see a real difficulty in
introducing a new sequence number.  Thus if a 'ddf' array gets created
as /dev/md15, then if I want to create a name containing the string
'ddf', it should be 'ddf15'.  Maybe /dev/ddf15.  Maybe /dev/md/ddf15.

> 
> >  4/ When we stop an array, mdadm will remove anything from /dev that
> >    it probably created.
> 
> Sure, but only if mdadm has it created.

Why would it matter?

Hmmm... Does udev ever deleting things from /dev?  I notice that 'md'
devices don't seem to disappear.  Maybe that is because /sys/block/mdX
never disappears (last time I tried it was too racy).
Would there be any way to get udev to delete devices when 
  /sys/block/mdX/md/array_state 
becomes 'clear' (presumably on a CHANGE event) ??

> 
> >    In particular, it will remove the device node as described in 1,
> >    any partitions, and any symlinks in /dev or /dev/md which point to
> >    any of those.  I need to be certain that this won't confuse udev.
> 
> You must never touch anything that udev has created. It must be driven
> by kernel "add/remove/change" events.

Again - why?  I notice that if I do remove the device nodes when the
array is stopped, the still create nicely recreated when I restart the
array.
However I seem to have decided to make a clear distinction between
when udev is running or now, so I'll not remove anything if I think
udev is running.

> 
> > I'm also wondering if I should include a udev 'rules' file for md in
> > the mdadm distribution.  Obviously it would be no more than a
> > recommendation, but it might give me a voice in guiding how udev
> > interacted with mdadm.
> 
> Definitely, it should carry a udev rules file which instructs udev to
> create all intended symlinks and also supports the raid auto-assembly
> setup. It should not mount anything by default though.

But why is 'mounting' so much different to 'assembling' in people's
eyes?   Certainly mounting readonly should be OK if assembling is
seen as OK.

However I have no intention of automounting anything.

> 
> I'm happy, to see you working on next-generation mdadm. I like to see
> a better integration with udev, and especially, if mdadm detects a
> running udev, not to mess around in /dev in any way, but leave the
> names in /dev to instructions in udev rules. Temporary nodes are fine,
> as long as they don't conflict with anything else, and get removed
> after they are not needed anymore. All updates to symlinks and such
> should be done by "change" events from the kernel, which instructs
> udev to update all the links, and not by touching anything in /dev
> from mdadm.

Better late than never :-)

I asked it in another email, but for completeness:
  What is the best way to detect in udev is running?  And will it
  work the same in all distros (having discovered the /lib/udev vs
  /etc/udev difference, I'm a little worried).

> 
> Do you think mdadm will stay a program only, called by udev/the user,
> or will a port of its functionality live in a daemon?

What are you thinking here?

  mdadm --monitor runs as a daemon, emails interesting events, and
  sometimes moves spares between arrays.

  mdmon (a new program in mdadm-3.0) is a daemon that monitors a
  particular array (or more accurately, a set of related metadata,
  there might be several arrays) and updates the metadata
  accordingly.  This isn't used for v0.90 of v1.x metadata.

  mdadm -D and mdadm -E can be run from udev to help with hot plug
  events. 

These are all daemons or daemon-like functionality.  What else were
you thinking of?  A stand-alone daemon that supports HTTP and allows
arrays to be built and configured from a browser?  No.  I wouldn't
do that. ;-)

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 23:23   ` Neil Brown
@ 2008-10-28  0:03     ` Kay Sievers
  2008-10-28  0:43       ` Neil Brown
                         ` (2 more replies)
  2008-10-29  8:56     ` RFC - device names and mdadm with some reference to udev Gabor Gombas
  2008-10-31 20:49     ` mdp devices on Debian (was: RFC - device names and mdadm with some reference to udev.) martin f krafft
  2 siblings, 3 replies; 51+ messages in thread
From: Kay Sievers @ 2008-10-28  0:03 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Doug Ledford, martin f. krafft, Michal Marek

On Tue, Oct 28, 2008 at 00:23, Neil Brown <neilb@suse.de> wrote:
> On Monday October 27, kay.sievers@vrfy.org wrote:
>> On Sun, Oct 26, 2008 at 23:56, Neil Brown <neilb@suse.de> wrote:
>> >  Device naming in mdadm is a bit of a mess.
>> >  2/ There will be various symlinks to these devices.
>> >    a/ if "symlinks=yes" is given in mdadm.conf, symlinks from
>> >         /dev/md/X or /dev/md/dX will be created.
>> >    b/ if udev is configured like on Debian,
>> >              /dev/disk/by-id/md-name-XXXX
>> >        and   /dev/disk/by-id/md-uuid-UUUU
>> >       will be created (by udev).
>>
>> Yes, almost all distros have that.
>
> But in different places.
> Debian has /etc/udev/rules
> openSUSE has /lib/udev/rules
>
> I love standards.  There are so many to choose from. :-)

They are all valid and needed. You should install in
/lib/udev/rules.d/ if the rule is not supposed to be edited by the
user.

All stuff in /lib/udev/rules.d/ is not marked as "config" in the
package and will be overwritten with a udev update, regardless if the
content has been edited or not. We moved the "default" rules there
because people edited the files in /etc and wondered why stuff broke
in weird ways on updates. /etc/udev/rules.d/ is for "user rules" or
on-the-fly created system specific ones, like persistent net names and
cdrom rules. In an ideal setup you would be able to do rm -rf
/etc/udev/rules.d/*, reboot, and start device configuration from
scratch.

Debian didn't catch up the last months, they use an older version of
udev, and have always had thier very own idea of rules, that didn't
match the udev default.

> Is there anywhere else I should get 'make install' to check?

No, just put any udev rule in /lib/udev/rules.d/.

> Debian doesn't get 'mdp' devices right.
> openSUSE is already ready for 2.6.28 in which all md devices can be
> partitions!
>
>>
>> >    I'm contemplating creating a link based on the metadata type with
>> >    a sequential number. e.g. /dev/md/ddf1 or /dev/md/imsm2.
>> >    I'm not sure if there should be in /dev/md/ or directly in /dev/.
>> >    I'm also not sure if I should leave the creation to udev, and
>> >    whether I should use a small sequential number, or just whatever
>> >    number was allocated as the minor number of the device.
>>
>> There is intentionally no support for enumeration in udev, it will
>> just not work and such numbers/links are not reproducible in hotplug
>> environments, and therefore totally useless, and do much more harm
>> than good.
>
> I'm not entirely following your logic here.
> The 'a' 'b' 'c' at the end of e.g. /dev/sda are not reproducible in
> hotplug environments, but they are not totally useless.  I know they
> will remain stable as long as the device is present, so once I found
> out which /dev/sdX is my USB thingo I just plugged in, I can
> repeatedly use that nice simple name to cfdisk, mkfs, mount, whatever
> the device.

Oh, so you need "your enumeration" only to be valid during the
existence of your device? That sounds fine, sure. I read that as you
are thinking about giving devices names which are meaningful across
reboots.

>> Nothing must ever depend on enumeration, or minor numbers, if these
>> properties can not made persistent, attached  to the device itself, so
>> that it will always show up with the same number forever. Better do
>> not even start such an idea, and leave the kernel name as the primary
>> "random" number, instead of creating new randomness on top.
>
> I'm not entirely convinced.  However I can see a real difficulty in
> introducing a new sequence number.  Thus if a 'ddf' array gets created
> as /dev/md15, then if I want to create a name containing the string
> 'ddf', it should be 'ddf15'.  Maybe /dev/ddf15.  Maybe /dev/md/ddf15.

Seems like a misunderstanding, if you need these names only during the
uptime of the device and will not need to remember that name at the
next boot, it's fine sure.

>> >  4/ When we stop an array, mdadm will remove anything from /dev that
>> >    it probably created.
>>
>> Sure, but only if mdadm has it created.
>
> Why would it matter?

Because you are not supposed to remove stuff udev has created. It will
likely create dangling symlinks at least. Also udev maintains a stack
of symlink names, for devices which claim the same symlink name, like
it happens for label and uuid links. If the device goes away, the
device with the next highest priority gets its symlink restored.
Messing around in udev-managed device files will just asks for
trouble.

> Hmmm... Does udev ever deleting things from /dev?

Sure, try with your USB stick, or any other device.

> I notice that 'md'
> devices don't seem to disappear.  Maybe that is because /sys/block/mdX
> never disappears (last time I tried it was too racy).

It stays because the md kernel device lifetime rules are kind of
broken regarding hotplug setups. Similar issue why md needs all the
static nodes in /dev too to create a device.

> Would there be any way to get udev to delete devices when
>  /sys/block/mdX/md/array_state
> becomes 'clear' (presumably on a CHANGE event) ??

What would be the reason to leave the kernel block device around?
Can't you just remove it like any other subsytem in the kernel does.
That would just remove the node, all links and update userspace to
reflect the change.

There is currently no "change" event that could tell to remove a
device node in /dev while we still have a kernel device around. And
you would need to convince me that this is really needed, and why md
is so special here. :)

>> >    In particular, it will remove the device node as described in 1,
>> >    any partitions, and any symlinks in /dev or /dev/md which point to
>> >    any of those.  I need to be certain that this won't confuse udev.
>>
>> You must never touch anything that udev has created. It must be driven
>> by kernel "add/remove/change" events.
>
> Again - why?  I notice that if I do remove the device nodes when the
> array is stopped, the still create nicely recreated when I restart the
> array.
> However I seem to have decided to make a clear distinction between
> when udev is running or now, so I'll not remove anything if I think
> udev is running.

As said, I think the block device in the kernel should go, if md wants
to inetgrate without special casing in the usual hotplug setup. /dev
is just the mirror of kernel devices, not to hide stuff from users
which exists in the kernel. :)

>> > I'm also wondering if I should include a udev 'rules' file for md in
>> > the mdadm distribution.  Obviously it would be no more than a
>> > recommendation, but it might give me a voice in guiding how udev
>> > interacted with mdadm.
>>
>> Definitely, it should carry a udev rules file which instructs udev to
>> create all intended symlinks and also supports the raid auto-assembly
>> setup. It should not mount anything by default though.
>
> But why is 'mounting' so much different to 'assembling' in people's
> eyes?   Certainly mounting readonly should be OK if assembling is
> seen as OK.
>
> However I have no intention of automounting anything.

I guess mounting makes stuff visible and makes data vulnerable, and is
definitely more a "policy decision" at the userspace level than an
array assembly. The question alone where to mount, it is not easy to
answer, and definitely a more difficult policy than to create a simple
block device.

>> I'm happy, to see you working on next-generation mdadm. I like to see
>> a better integration with udev, and especially, if mdadm detects a
>> running udev, not to mess around in /dev in any way, but leave the
>> names in /dev to instructions in udev rules. Temporary nodes are fine,
>> as long as they don't conflict with anything else, and get removed
>> after they are not needed anymore. All updates to symlinks and such
>> should be done by "change" events from the kernel, which instructs
>> udev to update all the links, and not by touching anything in /dev
>> from mdadm.
>
> Better late than never :-)
>
> I asked it in another email, but for completeness:
>  What is the best way to detect in udev is running?  And will it
>  work the same in all distros (having discovered the /lib/udev vs
>  /etc/udev difference, I'm a little worried).

All recent udev versions support /lib/udev/rules.d/ and
/etc/udev/rules.d/. And non-user tweakable stuff should not be in
/etc. I wouldn't care too much in your source package. Packagers with
special requirements will care about non-default setups in their
package.

>> Do you think mdadm will stay a program only, called by udev/the user,
>> or will a port of its functionality live in a daemon?
>
> What are you thinking here?
>
>  mdadm --monitor runs as a daemon, emails interesting events, and
>  sometimes moves spares between arrays.
>
>  mdmon (a new program in mdadm-3.0) is a daemon that monitors a
>  particular array (or more accurately, a set of related metadata,
>  there might be several arrays) and updates the metadata
>  accordingly.  This isn't used for v0.90 of v1.x metadata.
>
>  mdadm -D and mdadm -E can be run from udev to help with hot plug
>  events.
>
> These are all daemons or daemon-like functionality.  What else were
> you thinking of?  A stand-alone daemon that supports HTTP and allows
> arrays to be built and configured from a browser?  No.  I wouldn't
> do that. ;-)

I was just checking possibililites for mdadm to watch for events from
udev, which wouldn't work with a invoked program, because it would
permanently need to listen to a socket. No special idea or plan here,
just checking what you have in mind.

Kay

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-28  0:03     ` Kay Sievers
@ 2008-10-28  0:43       ` Neil Brown
  2008-10-28  1:16         ` Kay Sievers
  2008-10-28  1:44       ` Neil Brown
  2008-10-31 20:54       ` Debian and udev (was: RFC - device names and mdadm with some reference to udev.) martin f krafft
  2 siblings, 1 reply; 51+ messages in thread
From: Neil Brown @ 2008-10-28  0:43 UTC (permalink / raw)
  To: Kay Sievers; +Cc: linux-raid, Doug Ledford, martin f. krafft, Michal Marek

On Tuesday October 28, kay.sievers@vrfy.org wrote:
> > But in different places.
> > Debian has /etc/udev/rules
> > openSUSE has /lib/udev/rules
> >
> > I love standards.  There are so many to choose from. :-)
> 
> They are all valid and needed. You should install in
> /lib/udev/rules.d/ if the rule is not supposed to be edited by the
> user.

Clearer now.  Thanks.

> > I notice that 'md'
> > devices don't seem to disappear.  Maybe that is because /sys/block/mdX
> > never disappears (last time I tried it was too racy).
> 
> It stays because the md kernel device lifetime rules are kind of
> broken regarding hotplug setups. Similar issue why md needs all the
> static nodes in /dev too to create a device.
> 
> > Would there be any way to get udev to delete devices when
> >  /sys/block/mdX/md/array_state
> > becomes 'clear' (presumably on a CHANGE event) ??
> 
> What would be the reason to leave the kernel block device around?
> Can't you just remove it like any other subsytem in the kernel does.
> That would just remove the node, all links and update userspace to
> reflect the change.

I tried some time ago.  It was hard.

md devices magic appear when you tried to open the device-special
file.  I need some sort of locking to prevent that creation while I'm
destroying the old device.  But when I was trying this (quite some
months ago) the locking around do_open was fairly difficult to
follow.  I don't remember the exact issues, but I gave up.

What would happen was that when the md device disappear, udev would
try to open it (I think) and make it reappear again.  Sometimes with
an oops.  I think I avoided some of that by sending the DELETE event
well before the device was actually deleted ... or something.  But it
was still far from perfect.

Maybe I should try again.

> 
> There is currently no "change" event that could tell to remove a
> device node in /dev while we still have a kernel device around. And
> you would need to convince me that this is really needed, and why md
> is so special here. :)

md is a bit 'special', but not quite unique.  I think 'loop' now works
the same way as md in terms of devices magically appearing on open.
Maybe I can see how it was made to work for that case.

Thanks Kay,

NeilBrown

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-28  0:43       ` Neil Brown
@ 2008-10-28  1:16         ` Kay Sievers
  0 siblings, 0 replies; 51+ messages in thread
From: Kay Sievers @ 2008-10-28  1:16 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Doug Ledford, martin f. krafft, Michal Marek

On Tue, Oct 28, 2008 at 01:43, Neil Brown <neilb@suse.de> wrote:
> On Tuesday October 28, kay.sievers@vrfy.org wrote:
>> > But in different places.
>> > Debian has /etc/udev/rules
>> > openSUSE has /lib/udev/rules
>> >
>> > I love standards.  There are so many to choose from. :-)
>>
>> They are all valid and needed. You should install in
>> /lib/udev/rules.d/ if the rule is not supposed to be edited by the
>> user.
>
> Clearer now.  Thanks.
>
>> > I notice that 'md'
>> > devices don't seem to disappear.  Maybe that is because /sys/block/mdX
>> > never disappears (last time I tried it was too racy).
>>
>> It stays because the md kernel device lifetime rules are kind of
>> broken regarding hotplug setups. Similar issue why md needs all the
>> static nodes in /dev too to create a device.
>>
>> > Would there be any way to get udev to delete devices when
>> >  /sys/block/mdX/md/array_state
>> > becomes 'clear' (presumably on a CHANGE event) ??
>>
>> What would be the reason to leave the kernel block device around?
>> Can't you just remove it like any other subsytem in the kernel does.
>> That would just remove the node, all links and update userspace to
>> reflect the change.
>
> I tried some time ago.  It was hard.
>
> md devices magic appear when you tried to open the device-special
> file.  I need some sort of locking to prevent that creation while I'm
> destroying the old device.  But when I was trying this (quite some
> months ago) the locking around do_open was fairly difficult to
> follow.  I don't remember the exact issues, but I gave up.
>
> What would happen was that when the md device disappear, udev would
> try to open it (I think) and make it reappear again.  Sometimes with
> an oops.  I think I avoided some of that by sending the DELETE event
> well before the device was actually deleted ... or something.  But it
> was still far from perfect.

Hmm, that would be a bug, if udev looks at a volume at "remove".

> Maybe I should try again.

I would love to see /sys reflecting the actual state and not carrying
all the "dead" devices. Also some other logic to instantiate a new
device then open() would be nice. Other subsytems use control device
nodes to request new devices, sysfs might work too, if that's the
preferred method.

The create-on-open is just a not easy to solve chicken/egg problem
today and not really supported. Can't we have mdadm creating new
devices with a control device instead of relying on the open() logic
of a pre-existing device node?

>> There is currently no "change" event that could tell to remove a
>> device node in /dev while we still have a kernel device around. And
>> you would need to convince me that this is really needed, and why md
>> is so special here. :)
>
> md is a bit 'special', but not quite unique.  I think 'loop' now works
> the same way as md in terms of devices magically appearing on open.

Yeah, that's fine for some special use cases, they also have "destruct
on last close" now which sounds useful for some setups. But as
mentioned the whole create-at-open does not really integrate with a
dynamic /dev.

> Maybe I can see how it was made to work for that case.

That would be great.

Thanks,
Kay

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-28  0:03     ` Kay Sievers
  2008-10-28  0:43       ` Neil Brown
@ 2008-10-28  1:44       ` Neil Brown
  2008-10-28  1:52         ` Kay Sievers
  2008-10-31 20:54       ` Debian and udev (was: RFC - device names and mdadm with some reference to udev.) martin f krafft
  2 siblings, 1 reply; 51+ messages in thread
From: Neil Brown @ 2008-10-28  1:44 UTC (permalink / raw)
  To: Kay Sievers; +Cc: linux-raid, Doug Ledford, martin f. krafft, Michal Marek

On Tuesday October 28, kay.sievers@vrfy.org wrote:
> On Tue, Oct 28, 2008 at 00:23, Neil Brown <neilb@suse.de> wrote:
> > I notice that 'md'
> > devices don't seem to disappear.  Maybe that is because /sys/block/mdX
> > never disappears (last time I tried it was too racy).
> 
> It stays because the md kernel device lifetime rules are kind of
> broken regarding hotplug setups. Similar issue why md needs all the
> static nodes in /dev too to create a device.
> 
> > Would there be any way to get udev to delete devices when
> >  /sys/block/mdX/md/array_state
> > becomes 'clear' (presumably on a CHANGE event) ??
> 
> What would be the reason to leave the kernel block device around?
> Can't you just remove it like any other subsytem in the kernel does.
> That would just remove the node, all links and update userspace to
> reflect the change.
> 
> There is currently no "change" event that could tell to remove a
> device node in /dev while we still have a kernel device around. And
> you would need to convince me that this is really needed, and why md
> is so special here. :)

I've just done a bit of experimentation...

If create an array /dev/md0 and we get a symlink
  /dev/disk/by-id/md-uuid-XXXXX -> ../../md0
I stop the array and the symlink stays there
I create a new array as /dev/md0 (hence new uuid)
and I get an new symlink
  /dev/disk/by-id/md-uuid-YYYY -> ../../md0
but also, the first symlink goes away.

So somehow that first symlink is being removed even though the device
isn't being stopped.
I guess I need a 
   kobject_uevent(...., KOBJ_CHANGE)
when the array is stopped.... [tries that].

Better.  But it looks like I need to get rid of the partitions too...

Yes.  Putting

		bdev = bdget_disk(mddev->gendisk, 0);
		if (bdev) {
			blkdev_ioctl(bdev, 0, BLKRRPART, 0);
			bdput(bdev);
		}
		kobject_uevent(&disk_to_dev(mddev->gendisk)->kobj, KOBJ_CHANGE);

at a suitable place in do_md_stop causes all the symlinks created by
udev to disappear when I stop the array, only the /dev/mdX remains.
That should do for 2.6.28.  Something better maybe for .29.

NeilBrown

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-28  1:44       ` Neil Brown
@ 2008-10-28  1:52         ` Kay Sievers
  2008-10-28  1:54           ` Kay Sievers
  0 siblings, 1 reply; 51+ messages in thread
From: Kay Sievers @ 2008-10-28  1:52 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Doug Ledford, martin f. krafft, Michal Marek

On Tue, Oct 28, 2008 at 02:44, Neil Brown <neilb@suse.de> wrote:
> On Tuesday October 28, kay.sievers@vrfy.org wrote:
>> On Tue, Oct 28, 2008 at 00:23, Neil Brown <neilb@suse.de> wrote:
>> > I notice that 'md'
>> > devices don't seem to disappear.  Maybe that is because /sys/block/mdX
>> > never disappears (last time I tried it was too racy).
>>
>> It stays because the md kernel device lifetime rules are kind of
>> broken regarding hotplug setups. Similar issue why md needs all the
>> static nodes in /dev too to create a device.
>>
>> > Would there be any way to get udev to delete devices when
>> >  /sys/block/mdX/md/array_state
>> > becomes 'clear' (presumably on a CHANGE event) ??
>>
>> What would be the reason to leave the kernel block device around?
>> Can't you just remove it like any other subsytem in the kernel does.
>> That would just remove the node, all links and update userspace to
>> reflect the change.
>>
>> There is currently no "change" event that could tell to remove a
>> device node in /dev while we still have a kernel device around. And
>> you would need to convince me that this is really needed, and why md
>> is so special here. :)
>
> I've just done a bit of experimentation...
>
> If create an array /dev/md0 and we get a symlink
>  /dev/disk/by-id/md-uuid-XXXXX -> ../../md0
> I stop the array and the symlink stays there
> I create a new array as /dev/md0 (hence new uuid)
> and I get an new symlink
>  /dev/disk/by-id/md-uuid-YYYY -> ../../md0
> but also, the first symlink goes away.
>
> So somehow that first symlink is being removed even though the device
> isn't being stopped.

I guess mdadm --export didn't give the old name again when the
"change" event was sent. If udev get's an "add" or "change" event for
an existing device, it will lookup the currently links belonging to
that device in its database. It computes the new links, deletes all
the no longer valid links, keeps the still valid ones, and creates the
new ones. On "remove" it will remove all links and possibly restore
stuff this device has overwritten.

> I guess I need a
>   kobject_uevent(...., KOBJ_CHANGE)
> when the array is stopped.... [tries that].

If mdadm will not export stuff anymore, the links will be removed that way, yes.

> Better.  But it looks like I need to get rid of the partitions too...
>
> Yes.  Putting
>
>                bdev = bdget_disk(mddev->gendisk, 0);
>                if (bdev) {
>                        blkdev_ioctl(bdev, 0, BLKRRPART, 0);
>                        bdput(bdev);
>                }
>                kobject_uevent(&disk_to_dev(mddev->gendisk)->kobj, KOBJ_CHANGE);
>
> at a suitable place in do_md_stop causes all the symlinks created by
> udev to disappear when I stop the array, only the /dev/mdX remains.
> That should do for 2.6.28.  Something better maybe for .29.

Yes, BLKRRPART does already send change events for the disk and all partitions.

Kay

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-28  1:52         ` Kay Sievers
@ 2008-10-28  1:54           ` Kay Sievers
  0 siblings, 0 replies; 51+ messages in thread
From: Kay Sievers @ 2008-10-28  1:54 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Doug Ledford, martin f. krafft, Michal Marek

On Tue, Oct 28, 2008 at 02:52, Kay Sievers <kay.sievers@vrfy.org> wrote:
> On Tue, Oct 28, 2008 at 02:44, Neil Brown <neilb@suse.de> wrote:
>> On Tuesday October 28, kay.sievers@vrfy.org wrote:
>>> On Tue, Oct 28, 2008 at 00:23, Neil Brown <neilb@suse.de> wrote:
>>> > I notice that 'md'
>>> > devices don't seem to disappear.  Maybe that is because /sys/block/mdX
>>> > never disappears (last time I tried it was too racy).
>>>
>>> It stays because the md kernel device lifetime rules are kind of
>>> broken regarding hotplug setups. Similar issue why md needs all the
>>> static nodes in /dev too to create a device.
>>>
>>> > Would there be any way to get udev to delete devices when
>>> >  /sys/block/mdX/md/array_state
>>> > becomes 'clear' (presumably on a CHANGE event) ??
>>>
>>> What would be the reason to leave the kernel block device around?
>>> Can't you just remove it like any other subsytem in the kernel does.
>>> That would just remove the node, all links and update userspace to
>>> reflect the change.
>>>
>>> There is currently no "change" event that could tell to remove a
>>> device node in /dev while we still have a kernel device around. And
>>> you would need to convince me that this is really needed, and why md
>>> is so special here. :)
>>
>> I've just done a bit of experimentation...
>>
>> If create an array /dev/md0 and we get a symlink
>>  /dev/disk/by-id/md-uuid-XXXXX -> ../../md0
>> I stop the array and the symlink stays there
>> I create a new array as /dev/md0 (hence new uuid)
>> and I get an new symlink
>>  /dev/disk/by-id/md-uuid-YYYY -> ../../md0
>> but also, the first symlink goes away.
>>
>> So somehow that first symlink is being removed even though the device
>> isn't being stopped.
>
> I guess mdadm --export didn't give the old name again when the
> "change" event was sent. If udev get's an "add" or "change" event for
> an existing device, it will lookup the currently links belonging to
> that device in its database. It computes the new links, deletes all
> the no longer valid links, keeps the still valid ones, and creates the
> new ones. On "remove" it will remove all links and possibly restore
> stuff this device has overwritten.
>
>> I guess I need a
>>   kobject_uevent(...., KOBJ_CHANGE)
>> when the array is stopped.... [tries that].
>
> If mdadm will not export stuff anymore, the links will be removed that way, yes.
>
>> Better.  But it looks like I need to get rid of the partitions too...
>>
>> Yes.  Putting
>>
>>                bdev = bdget_disk(mddev->gendisk, 0);
>>                if (bdev) {
>>                        blkdev_ioctl(bdev, 0, BLKRRPART, 0);
>>                        bdput(bdev);
>>                }
>>                kobject_uevent(&disk_to_dev(mddev->gendisk)->kobj, KOBJ_CHANGE);
>>
>> at a suitable place in do_md_stop causes all the symlinks created by
>> udev to disappear when I stop the array, only the /dev/mdX remains.
>> That should do for 2.6.28.  Something better maybe for .29.
>
> Yes, BLKRRPART does already send change events for the disk and all partitions.

Just in case you don't already do that, you can run "udevadm monitor
--udev --env" and see the udev processed events with the properties.
DEVLINKS will tell you the currently valid links that will be created,
or have been removed.

Kay

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Debian and udev (was: RFC - device names and mdadm with some reference to udev.)
  2008-10-28  0:03     ` Kay Sievers
  2008-10-28  0:43       ` Neil Brown
  2008-10-28  1:44       ` Neil Brown
@ 2008-10-31 20:54       ` martin f krafft
  2008-10-31 23:08         ` Bernd Schubert
  2 siblings, 1 reply; 51+ messages in thread
From: martin f krafft @ 2008-10-31 20:54 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1620 bytes --]

also sprach Kay Sievers <kay.sievers@vrfy.org> [2008.10.28.0103 +0100]:
> All stuff in /lib/udev/rules.d/ is not marked as "config" in the
> package and will be overwritten with a udev update, regardless if the
> content has been edited or not. We moved the "default" rules there
> because people edited the files in /etc and wondered why stuff broke
> in weird ways on updates. /etc/udev/rules.d/ is for "user rules" or
> on-the-fly created system specific ones, like persistent net names and
> cdrom rules. In an ideal setup you would be able to do rm -rf
> /etc/udev/rules.d/*, reboot, and start device configuration from
> scratch.
> 
> Debian didn't catch up the last months, they use an older version of
> udev, and have always had thier very own idea of rules, that didn't
> match the udev default.

Debian is nearing a release, we have other things to worry about.

But to clarify Kay's statement:
yes, we cannot follow the udev default if we accept that users might
want to edit udev rules, even if they risk breaking stuff. We very
specifically discourage the administrator to write to anywhere but
/usr/local and /etc for good reasons. Thus, the rules have to go to
/etc/udev.

-- 
 .''`.   martin f. krafft <madduck@debian.org>
: :'  :  proud Debian developer, author, administrator, and user
`. `'`   http://people.debian.org/~madduck - http://debiansystem.info
  `-  Debian - when you have better things to do than fixing systems
 
/.ing an issue is like asking an infinite number of monkeys for advice
                                                   -- in #debian-devel

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Debian and udev (was: RFC - device names and mdadm with some reference to udev.)
  2008-10-31 20:54       ` Debian and udev (was: RFC - device names and mdadm with some reference to udev.) martin f krafft
@ 2008-10-31 23:08         ` Bernd Schubert
  0 siblings, 0 replies; 51+ messages in thread
From: Bernd Schubert @ 2008-10-31 23:08 UTC (permalink / raw)
  To: linux-raid

On Fri, Oct 31, 2008 at 09:54:34PM +0100, martin f krafft wrote:
> also sprach Kay Sievers <kay.sievers@vrfy.org> [2008.10.28.0103 +0100]:
> > All stuff in /lib/udev/rules.d/ is not marked as "config" in the
> > package and will be overwritten with a udev update, regardless if the
> > content has been edited or not. We moved the "default" rules there
> > because people edited the files in /etc and wondered why stuff broke
> > in weird ways on updates. /etc/udev/rules.d/ is for "user rules" or
> > on-the-fly created system specific ones, like persistent net names and
> > cdrom rules. In an ideal setup you would be able to do rm -rf
> > /etc/udev/rules.d/*, reboot, and start device configuration from
> > scratch.
> > 
> > Debian didn't catch up the last months, they use an older version of
> > udev, and have always had thier very own idea of rules, that didn't
> > match the udev default.
> 
> Debian is nearing a release, we have other things to worry about.
> 
> But to clarify Kay's statement:
> yes, we cannot follow the udev default if we accept that users might
> want to edit udev rules, even if they risk breaking stuff. We very
> specifically discourage the administrator to write to anywhere but
> /usr/local and /etc for good reasons. Thus, the rules have to go to
> /etc/udev.

Rules written by the adminstrator have to go to /etc/udev, but system 
default rules certainly not. And IMHO it is plain wrong when
Debian maintainers think they know better than the rest of the world.
(And yes, I'm a debian user myself and also a package maintainer).

Cheers,
Bernd

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-27 23:23   ` Neil Brown
  2008-10-28  0:03     ` Kay Sievers
@ 2008-10-29  8:56     ` Gabor Gombas
  2008-10-31 20:49     ` mdp devices on Debian (was: RFC - device names and mdadm with some reference to udev.) martin f krafft
  2 siblings, 0 replies; 51+ messages in thread
From: Gabor Gombas @ 2008-10-29  8:56 UTC (permalink / raw)
  To: Neil Brown
  Cc: Kay Sievers, linux-raid, Doug Ledford, martin f. krafft,
	Michal Marek

On Tue, Oct 28, 2008 at 10:23:38AM +1100, Neil Brown wrote:

> I'm not entirely convinced.  However I can see a real difficulty in
> introducing a new sequence number.  Thus if a 'ddf' array gets created
> as /dev/md15, then if I want to create a name containing the string
> 'ddf', it should be 'ddf15'.  Maybe /dev/ddf15.  Maybe /dev/md/ddf15.

Or, if you want to follow the logic of /dev/disk/by-*, something like
/dev/md/by-container/ddf-md15 (or ddf-<UUID>).

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 51+ messages in thread

* mdp devices on Debian (was: RFC - device names and mdadm with some reference to udev.)
  2008-10-27 23:23   ` Neil Brown
  2008-10-28  0:03     ` Kay Sievers
  2008-10-29  8:56     ` RFC - device names and mdadm with some reference to udev Gabor Gombas
@ 2008-10-31 20:49     ` martin f krafft
  2 siblings, 0 replies; 51+ messages in thread
From: martin f krafft @ 2008-10-31 20:49 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 731 bytes --]

also sprach Neil Brown <neilb@suse.de> [2008.10.28.0023 +0100]:
> Debian doesn't get 'mdp' devices right.

Correction: it does do everything it needs, it can even boot from
them, but our installer won't let you create them. I wanted to add
that functionality at some point and then learnt that eventually,
plain md devices would become partitionable in the future, so
I dropped it.

-- 
 .''`.   martin f. krafft <madduck@debian.org>
: :'  :  proud Debian developer, author, administrator, and user
`. `'`   http://people.debian.org/~madduck - http://debiansystem.info
  `-  Debian - when you have better things to do than fixing systems
 
windows 2000: designed for the internet.
the internet: designed for unix.

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-26 22:56 RFC - device names and mdadm with some reference to udev Neil Brown
  2008-10-27  8:22 ` martin f krafft
  2008-10-27 12:41 ` Kay Sievers
@ 2008-10-30 17:18 ` Doug Ledford
  2008-10-31  9:45   ` Neil Brown
  2008-11-02 13:47   ` Luca Berra
  2 siblings, 2 replies; 51+ messages in thread
From: Doug Ledford @ 2008-10-30 17:18 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, martin f. krafft, Michal Marek, Kay Sievers

[-- Attachment #1: Type: text/plain, Size: 12659 bytes --]

On Mon, 2008-10-27 at 09:56 +1100, Neil Brown wrote:
> Greeting.
>  This is a Request For Comments....

OK, I've taken my time responding to this at least partially because I
wanted to get you my current changes first.

>  Device naming in mdadm is a bit of a mess.
>  We have partitioned devices (mdp) and non-partitioned (md)
>  We have names in /dev/md/ (/dev/md/d0) and directly in /dev
>     (/dev/md_d0).
>  We have support for user-friendly names (/dev/md/home) and for
>     "kernel-internal" names (/dev/md0).
> 
>  All this can produce extra confusion when udev is brought into the
>  picture.  And it can leave lots of litter lying around in /dev if we
>  aren't careful (which we aren't).
> 
>  I hope to release mdadm-3.0 this year, and maybe that gives me a
>  chance to get it "right".  I don't want to break backwards
>  compatibility in a big way, but I think I am happy to introduce
>  little changes if it means a more consistent model.
> 
>  In 2.6.28, partitioned devices (mdp) wont be needed any more as md
>  will make use of the "extended partition" functionality recently
>  added.  All md devices can be partitioned.  The device number for the
>  partitions will be very different to that of the whole device, but
>  udev should hide all of that.  So we don't have to worry too much
>  about mdp devices.

Back compatibility, and the ability to use current mdadm on older
kernels may mean that we need to deal with mdp devices regardless.

>  So I think the following is how I want things to work.  I am very
>  open to comments and suggestions.  Particularly I want to know what
>  (if anything) this will break.
> 
>  1/ The only device nodes created will be /dev/mdX and /dev/md_dX
>     along with partitions /dev/mdXpY and /dev/md_dXpY as appropriate.
>     These will be created by mdadm in accordance with the "--auto"
>     flag unless something in mdadm.conf says to leave it to udev.
>     In that case, mdadm will create a temporary node
>     (/dev/.mdadm.whatever) and remove it once udev has created the
>     real thing.

One thing I noticed in my work on the incremental stuff, is that the
user friendly device naming method still wants to create
these /dev/md_dX{pY} array names.  I'm actually in favor of doing away
with the notion that an array needs to be numbered and exist in a
numbered format in the /dev/ namespace.  If you have a user friendly
name, such as /dev/md/root and /dev/md/boot, or /dev/md/root_p1
and /dev/md/root_p2, I see no need to add additional numbered devices.
Instead, just allow the device number of the named devices to be random.

>  2/ There will be various symlinks to these devices.
>     a/ if "symlinks=yes" is given in mdadm.conf, symlinks from
>          /dev/md/X or /dev/md/dX will be created.
>     b/ if udev is configured like on Debian,
>               /dev/disk/by-id/md-name-XXXX
> 	and   /dev/disk/by-id/md-uuid-UUUU
>        will be created (by udev).
>     c/ If there is a 'name' associated with the array then
>         /dev/md/name will be created as a link.
>     d/ if an explicit device name of /dev/name was given,
>         either on a -A, -B, -C, command or in mdadm.conf,
> 	then the 'name' must match the name of the array,
> 	and /dev/name will be used as well as /dev/md/name.

I think all these symlinks are problematic.  We have a naming
consistency problem, and creating all these links just perpetuates that
problem.  I would be in favor of standardizing the namespace location
and semantics and doing away with all the symlinks.  Do that, and within
one release cycle all the confusion will be gone.

>  3/ For a 'NAME' to be used, with as md-name-NAME or /dev/md/NAME,
>     we need a high degree of confidence that the array was intended
>     for "this" host, or otherwise is not going to conflict with
>     an array that is meant for "this" host.
>     We get this confidence in a number of ways:
>     a/ If the name is listed in /etc/mdadm.conf 
>        e.g.  ARRAY /dev/md/home UUID=XXXX.....
>     b/ If the name was given on the command line
>     b/ If the name is stored in the metadata of an array which is
>        explicitly identifed in mdadm.conf or by the command line.
>     c/ If the name is of the form  "host:name" and "host" matches
>        this host.  We then use just "name".
>     d/ If the name is of the form "host:name" and "host" does not
>        match this host, we can still assume that "host:name" is
>        unique and use that.
>     e/ For 0.90 metadata, if the uuid has the host name encoded in it
>        then it was intended for 'this' host.
> 
>     Thus unsafe names are names extracted from the metadata of arrays
>     which are auto-detected, where there is no hint in the metadata
>     that the array is built for 'this' host.
> 
>     If the NAME is not known to be safe, we can still assemble the
>     array, but we use a "random" high minor number, and allow it
>     to be found primarily by the by-id/md-uuid-UUUUU... link or some
>     other link created based on array content: e.g. disk/by-label/
>     Also the array will be assembled "auto-readonly" so no resync etc
>     will happen until the array is actually used.

There's no need to make autostarted arrays that we can't identify as
being intended solely for this host hard to find.  It's a little tricky
if there's no homehost in the array, so let's skip that for a second.
If there *is* a homehost, and we don't list the array in mdadm.conf or
it doesn't match our homehost, then I think the answer is to just start
the array auto-readonly with the name /dev/md/homehost:name.  Since we
are assuming that homehost:name is unique even if it isn't our device,
then that means it's sufficient for naming the device uniquely in
our /dev/md space.  Now, if we don't have a homehost on the array, then
I would do as you suggest and use a random high device number and have
udev create any appropriate links.  Of course, udev would make those
same links on devices with a homehost, so the final difference is just
that you create a homehost:name device when possible, skip it when not.
All the rest is the same.

>     mdadm-3.0 will be able to support "containers" such as a set of
>     devices with DDF metadata.  These can then contain a number of
>     different arrays.  If the 'container' is known to be local to
>     'this' host, then we assume that all contained arrays are too.
> 
>     I'm contemplating creating a link based on the metadata type with
>     a sequential number. e.g. /dev/md/ddf1 or /dev/md/imsm2.
>     I'm not sure if there should be in /dev/md/ or directly in /dev/.
>     I'm also not sure if I should leave the creation to udev, and
>     whether I should use a small sequential number, or just whatever
>     number was allocated as the minor number of the device.
> 
>  4/ When we stop an array, mdadm will remove anything from /dev that
>     it probably created.
>     In particular, it will remove the device node as described in 1,
>     any partitions, and any symlinks in /dev or /dev/md which point to
>     any of those.  I need to be certain that this won't confuse udev.
> 
>  5/ I want to enable assembly without having to give
>     an explicit device name, thus requiring mdadm to automatically
>     assign one just as it would for auto-assembly.
>     In particular, the "ARRAY" line in mdadm.conf will no longer
>     require an array name.  That would mean that "-Es" wouldn't need
>     to produce an array name (which is not always easy).
>     So:
>         mdadm -Es > /tmp/mdadm.conf
> 	mdadm -Asc/tmp/mdadm.conf
>     would leave the choice of device name to the "-A" stage which is
>     the only time that unique non-predictable names can be chosen.
> 
>  6/ I'm thinking that if the array name given to --create or
>     --assemble looks as though it identifies a metadata type, by
>     having the name of a metadata type followed by some digits,
>     e.g. /dev/ddf0 or /dev/md/imsm3
>     then we insist that the array have that metadata type.
>     That could mean that a future metadata type might conflict with
>     a previously valid usage, which would be a bore.
>     Maybe if there are trailing digits, then it *must* identify a
>     metadata type, or be "mdNN".
> 
> Some issues that all of this needs to address:
> 
>  1/ People want auto-assembly.  I've always fought against it (we
>     don't auto-mount all filesystems do we?).  But it is a loosing
>     battle.  And on a modern desktop, when you plug in a new drive the
>     filesystem is automatically mounted.  So my argument is falling
>     apart.
> 
>  2/ Auto-assembly of new arrays must not conflict with auto-assembly
>     of previously existing arrays, even if the devices comprising the
>     new arrays are discovered earlier.  This is what the 'homehost'
>     concept is for.  Your array will only get assembled with a
>     predictable name if it is known to be attached to 'this' host.

Really, with the advent of mount-by-label filesystem usage, this
argument has become less legitimate.  That's not to say that using
homehost intelligently isn't desirable, but even if there is a name
conflict, and the wrong array gets assembled first, it really doesn't
matter since the upper layers will detect the proper filesystem by
filesystem label or uuid and use whatever device contains the filesystem
they want.  So, I would treat homehost as a convenience and a hint, but
I wouldn't allow lack of homehost or wrong homehost to prevent assembly.

>  3/ Auto-assembly needs to handle incremental arrival of devices
>     correctly.  There are no easy solutions to this, particularly when
>     e.g. ext3 can write to the device even when mounted read-only (for
>     journal replay).
>     I think the best that I can do for now is assemble things
>     'read-auto' to delay any writes a long as possible in the hope
>     that all available devices will be connected by then.
>     Adding in-memory bitmaps for all degraded array to accelerate
>     rebuild would help but won't be in 2.6.28.

My solution to this was to only auto-assemble unknown devices if they
are complete.  I think there is an argument to be made that if an array
isn't a normal array on a machine, then degraded assembly is not a
given.

>  4/ auto-assembly needs to do the right thing on a SAN where multiple
>     hosts can each see multiple arrays.  Clearly only one host should
>     write to any one array at one time (until I get some
>     cluster-awareness going, which I had hoped to work on this year,
>     but it doesn't look like I will).
>     In this case, I don't think read-auto is enough.  We either need
>     to not assemble arrays when aren't known to belong to us, or we
>     need to assemble them read-only and require and explicit
>     read-write setting.
> 
>     So we need some way to know which devices could be visible to
>     other hosts.
>     I could have a global flag in mdadm.conf "Options SAN"
>     I could have a SAN-DEVICES to match "DEVICES", but as just about
>     everything is "/dev/sd*" these days, I don't know if that would
>     work.
> 
>     Any suggestions concerning this would be welcome.

The scariest suggestion, but probably the most complete and automated,
would be to have mdadm do a search on any constituent devices to find
out what the eventual low level driver is.  If it's a fiber channel
driver, or iSCSI, then don't auto assemble.  If it's sata/e-sata, or
local SAS, then it's more likely auto assemble is fine.  But, that level
of mucking around in /sys for each device would probably be quite ugly.

> I'm also wondering if I should include a udev 'rules' file for md in
> the mdadm distribution.  Obviously it would be no more than a
> recommendation, but it might give me a voice in guiding how udev
> interacted with mdadm.

Actually, this would probably be very helpful.  For instance, that udev
rules file is probably the way you decide whether mdadm or udev
creates/deletes all the links.  The actions of mdadm and udev have to be
synchronized in order to avoid confusion about responsibilities.

> Any thoughts of any of this would be most welcome.
> 
> Thanks,
> NeilBrown
> 
-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-30 17:18 ` RFC - device names and mdadm with some reference to udev Doug Ledford
@ 2008-10-31  9:45   ` Neil Brown
  2008-11-03  9:29     ` Gabor Gombas
  2008-11-03 14:34     ` Doug Ledford
  2008-11-02 13:47   ` Luca Berra
  1 sibling, 2 replies; 51+ messages in thread
From: Neil Brown @ 2008-10-31  9:45 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-raid, martin f. krafft, Michal Marek, Kay Sievers

On Thursday October 30, dledford@redhat.com wrote:
> On Mon, 2008-10-27 at 09:56 +1100, Neil Brown wrote:
> > Greeting.
> >  This is a Request For Comments....
> 
> OK, I've taken my time responding to this at least partially because I
> wanted to get you my current changes first.

You mean it isn't a law of the Internet the every email must be
replied to in less that 2 hour!  In never knew!!

Thanks for taking the time when the time was right.
> > 
> >  In 2.6.28, partitioned devices (mdp) wont be needed any more as md
> >  will make use of the "extended partition" functionality recently
> >  added.  All md devices can be partitioned.  The device number for the
> >  partitions will be very different to that of the whole device, but
> >  udev should hide all of that.  So we don't have to worry too much
> >  about mdp devices.
> 
> Back compatibility, and the ability to use current mdadm on older
> kernels may mean that we need to deal with mdp devices regardless.

True.  My thoughts were that the needs of mdp should not drive design
any more.  Certainly we keep back compatibility where practical and
fix bug (thanks!) and include support for mdp on at least and equal
level with md.  But any extra concerns don't need to drive design.

> 
> >  So I think the following is how I want things to work.  I am very
> >  open to comments and suggestions.  Particularly I want to know what
> >  (if anything) this will break.
> > 
> >  1/ The only device nodes created will be /dev/mdX and /dev/md_dX
> >     along with partitions /dev/mdXpY and /dev/md_dXpY as appropriate.
> >     These will be created by mdadm in accordance with the "--auto"
> >     flag unless something in mdadm.conf says to leave it to udev.
> >     In that case, mdadm will create a temporary node
> >     (/dev/.mdadm.whatever) and remove it once udev has created the
> >     real thing.
> 
> One thing I noticed in my work on the incremental stuff, is that the
> user friendly device naming method still wants to create
> these /dev/md_dX{pY} array names.  I'm actually in favor of doing away
> with the notion that an array needs to be numbered and exist in a
> numbered format in the /dev/ namespace.  If you have a user friendly
> name, such as /dev/md/root and /dev/md/boot, or /dev/md/root_p1
> and /dev/md/root_p2, I see no need to add additional numbered devices.
> Instead, just allow the device number of the named devices to be random.

I have considered dropping the "/dev/mdXX" names altogether, and I
think mdadm.2 sometimes does that.  But I've decided against it.
My reasons are:

 1/ udev is going to create them anyway, so there is no point trying
    to hide them.
 2/ those names appear in /proc/mdstat and despite all the rhetoric
    about naming policy not belonging in the kernel, the kernel does
    set some naming policy, "mdX" etc are part of that, and we cannot
    avoid it.
    Joe Sysadmin will see a name in /proc/mdstat and might want to
    access that device.  Having it easily available in /dev is good.

My current thought is that /dev/md/ provides human friendly names.
/dev/disk/by-id/md-whatever provides script-friendly names.  And /dev
directly contains kernel-friendly names.

> 
> >  2/ There will be various symlinks to these devices.
> >     a/ if "symlinks=yes" is given in mdadm.conf, symlinks from
> >          /dev/md/X or /dev/md/dX will be created.
> >     b/ if udev is configured like on Debian,
> >               /dev/disk/by-id/md-name-XXXX
> > 	and   /dev/disk/by-id/md-uuid-UUUU
> >        will be created (by udev).
> >     c/ If there is a 'name' associated with the array then
> >         /dev/md/name will be created as a link.
> >     d/ if an explicit device name of /dev/name was given,
> >         either on a -A, -B, -C, command or in mdadm.conf,
> > 	then the 'name' must match the name of the array,
> > 	and /dev/name will be used as well as /dev/md/name.
> 
> I think all these symlinks are problematic.  We have a naming
> consistency problem, and creating all these links just perpetuates that
> problem.  I would be in favor of standardizing the namespace location
> and semantics and doing away with all the symlinks.  Do that, and within
> one release cycle all the confusion will be gone.

Your last sentence is very pragmatic and sensible.  If confusion
exists, we really want to move firmly away from it, and people will
cope, particularly if things become cleared (even if they are
different to what they are used to).

I am dropping support for the "--symlinks" option and matching
mdadm.conf entry. 
/dev/mdXXX will always be the device node.  There will always be (at
most) one entry in /dev/md/ which points to it.  It might be e.g.
/dev/md/0, but only if no better name is available.

Hopefully this will be clear if documented well.

I think having a large number of symlinks from different places in
/dev is inevitable.   But if we come up with clear definitions of
meaning, purpose, and behaviour, we should be safe.

> 
> There's no need to make autostarted arrays that we can't identify as
> being intended solely for this host hard to find.  It's a little tricky
> if there's no homehost in the array, so let's skip that for a second.
> If there *is* a homehost, and we don't list the array in mdadm.conf or
> it doesn't match our homehost, then I think the answer is to just start
> the array auto-readonly with the name /dev/md/homehost:name.  Since we
> are assuming that homehost:name is unique even if it isn't our device,
> then that means it's sufficient for naming the device uniquely in
> our /dev/md space.  Now, if we don't have a homehost on the array, then
> I would do as you suggest and use a random high device number and have
> udev create any appropriate links.  Of course, udev would make those
> same links on devices with a homehost, so the final difference is just
> that you create a homehost:name device when possible, skip it when not.
> All the rest is the same.

I agree.  I think I have implemented some of this, but not all.  In
particular the idea of not starting unexpectedly-degraded arrays which
are foreign is not implemented I don't think.  I will do that.
I also now create e.g. /dev/md/homehost:name when that might be
appropriate.  However it isn't always the case that the name of the
homehost is known.  For 0.90, I can tell if a particular homehost
matches, but I cannot tell the correct homehost name.

But yes, we can still assemble the array and provide some sort of
meaningful name in /dev/md.

> > 
> >  2/ Auto-assembly of new arrays must not conflict with auto-assembly
> >     of previously existing arrays, even if the devices comprising the
> >     new arrays are discovered earlier.  This is what the 'homehost'
> >     concept is for.  Your array will only get assembled with a
> >     predictable name if it is known to be attached to 'this' host.
> 
> Really, with the advent of mount-by-label filesystem usage, this
> argument has become less legitimate.  That's not to say that using
> homehost intelligently isn't desirable, but even if there is a name
> conflict, and the wrong array gets assembled first, it really doesn't
> matter since the upper layers will detect the proper filesystem by
> filesystem label or uuid and use whatever device contains the filesystem
> they want.  So, I would treat homehost as a convenience and a hint, but
> I wouldn't allow lack of homehost or wrong homehost to prevent assembly.

Agreed.  auto-read-only and not starting unexpectedly-degraded foreign
arrays make me more comfortable about this.

> >  4/ auto-assembly needs to do the right thing on a SAN where multiple
> >     hosts can each see multiple arrays.  Clearly only one host should
> >     write to any one array at one time (until I get some
> >     cluster-awareness going, which I had hoped to work on this year,
> >     but it doesn't look like I will).
> >     In this case, I don't think read-auto is enough.  We either need
> >     to not assemble arrays when aren't known to belong to us, or we
> >     need to assemble them read-only and require and explicit
> >     read-write setting.
> > 
> >     So we need some way to know which devices could be visible to
> >     other hosts.
> >     I could have a global flag in mdadm.conf "Options SAN"
> >     I could have a SAN-DEVICES to match "DEVICES", but as just about
> >     everything is "/dev/sd*" these days, I don't know if that would
> >     work.
> > 
> >     Any suggestions concerning this would be welcome.
> 
> The scariest suggestion, but probably the most complete and automated,
> would be to have mdadm do a search on any constituent devices to find
> out what the eventual low level driver is.  If it's a fiber channel
> driver, or iSCSI, then don't auto assemble.  If it's sata/e-sata, or
> local SAS, then it's more likely auto assemble is fine.  But, that level
> of mucking around in /sys for each device would probably be quite ugly.


Quite.  And I'd almost certainly get it wrong.  One day someone might
come up with a solution that can be automated.  For now I think I
stick with configuration in mdadm.conf

> 
> > I'm also wondering if I should include a udev 'rules' file for md in
> > the mdadm distribution.  Obviously it would be no more than a
> > recommendation, but it might give me a voice in guiding how udev
> > interacted with mdadm.
> 
> Actually, this would probably be very helpful.  For instance, that udev
> rules file is probably the way you decide whether mdadm or udev
> creates/deletes all the links.  The actions of mdadm and udev have to be
> synchronized in order to avoid confusion about responsibilities.

Good.  I'm feeling quite positive about the idea of distributing an
mdadm.rules file.  I'm now even starting to understand udev rules
files!


Thanks for your thoughtful contributions.

NeilBrown

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-31  9:45   ` Neil Brown
@ 2008-11-03  9:29     ` Gabor Gombas
  2008-11-03 10:33       ` Kay Sievers
  2008-11-03 14:34     ` Doug Ledford
  1 sibling, 1 reply; 51+ messages in thread
From: Gabor Gombas @ 2008-11-03  9:29 UTC (permalink / raw)
  To: Neil Brown
  Cc: Doug Ledford, linux-raid, martin f. krafft, Michal Marek,
	Kay Sievers

On Fri, Oct 31, 2008 at 08:45:35PM +1100, Neil Brown wrote:

> I have considered dropping the "/dev/mdXX" names altogether, and I
> think mdadm.2 sometimes does that.  But I've decided against it.
> My reasons are:
> 
>  1/ udev is going to create them anyway, so there is no point trying
>     to hide them.
>  2/ those names appear in /proc/mdstat and despite all the rhetoric
>     about naming policy not belonging in the kernel, the kernel does
>     set some naming policy, "mdX" etc are part of that, and we cannot
>     avoid it.
>     Joe Sysadmin will see a name in /proc/mdstat and might want to
>     access that device.  Having it easily available in /dev is good.

Network devices can be renamed and the new name appears under
/proc/net/dev and in /sys. I think the best solution would be to enable
such renaming for block devices too. I don't know how hard it would be
to implement... Naming policy belongs to userspace, but it would be very
nice to tell the kernel "I want to call this device FOO from now. Please
use the string FOO whenever you refer to this device, and don't use any
other names for it".

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-11-03  9:29     ` Gabor Gombas
@ 2008-11-03 10:33       ` Kay Sievers
  2008-11-03 11:58         ` Gabor Gombas
  0 siblings, 1 reply; 51+ messages in thread
From: Kay Sievers @ 2008-11-03 10:33 UTC (permalink / raw)
  To: Gabor Gombas
  Cc: Neil Brown, Doug Ledford, linux-raid, martin f. krafft,
	Michal Marek

On Mon, Nov 3, 2008 at 10:29, Gabor Gombas <gombasg@sztaki.hu> wrote:
> On Fri, Oct 31, 2008 at 08:45:35PM +1100, Neil Brown wrote:
>
>> I have considered dropping the "/dev/mdXX" names altogether, and I
>> think mdadm.2 sometimes does that.  But I've decided against it.
>> My reasons are:
>>
>>  1/ udev is going to create them anyway, so there is no point trying
>>     to hide them.
>>  2/ those names appear in /proc/mdstat and despite all the rhetoric
>>     about naming policy not belonging in the kernel, the kernel does
>>     set some naming policy, "mdX" etc are part of that, and we cannot
>>     avoid it.
>>     Joe Sysadmin will see a name in /proc/mdstat and might want to
>>     access that device.  Having it easily available in /dev is good.
>
> Network devices can be renamed and the new name appears under
> /proc/net/dev and in /sys. I think the best solution would be to enable
> such renaming for block devices too. I don't know how hard it would be
> to implement... Naming policy belongs to userspace, but it would be very
> nice to tell the kernel "I want to call this device FOO from now. Please
> use the string FOO whenever you refer to this device, and don't use any
> other names for it".

Network devices have only one entry point, if you insist you can think
of the index number as another value, but there is always only one
single name that matters. And we need to rename them because we have
no real concept  of symlinks for network interfaces.

That is not true at all for block devices, you can identify them in
many ways, by name, by physical location, by hardware ID from the
stuff behind the devices, by filesystem metadata, by properies of the
specific subsystem, ... Renaming block devices just does not make much
sense, because there is no primary name to use, it all depends on the
actual setup and personal preference.

Kay

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-11-03 10:33       ` Kay Sievers
@ 2008-11-03 11:58         ` Gabor Gombas
  2008-11-03 12:11           ` Kay Sievers
  0 siblings, 1 reply; 51+ messages in thread
From: Gabor Gombas @ 2008-11-03 11:58 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Neil Brown, Doug Ledford, linux-raid, martin f. krafft,
	Michal Marek

On Mon, Nov 03, 2008 at 11:33:36AM +0100, Kay Sievers wrote:

> Network devices have only one entry point, if you insist you can think
> of the index number as another value, but there is always only one
> single name that matters.

And you have only one (major, minor) pair for block devices. No
difference.

> And we need to rename them because we have
> no real concept  of symlinks for network interfaces.
> 
> That is not true at all for block devices, you can identify them in
> many ways, by name, by physical location, by hardware ID from the
> stuff behind the devices, by filesystem metadata, by properies of the
> specific subsystem,

And you can identify network devices by name, physical location,
hardware address, stuff behind the devices (aka. network
autoconfiguration). Again no difference.


> ... Renaming block devices just does not make much
> sense, because there is no primary name to use, it all depends on the
> actual setup and personal preference.

Exactly. And that's why _I_ want to choose the name. The current
udev-based partial solution is not good enough, as even if I give a
meaningful name to a node in /dev the kernel still only tells me "there
is a bad sector on /dev/sdk" and I have to spend precious time to figure
out which device /dev/sdk is. OTOH if I could rename sdk to eg.
self1slot3 then the error message would contain _all_ the information I
need.

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences,
     Laboratory of Parallel and Distributed Systems
     Address   : H-1132 Budapest Victor Hugo u. 18-22. Hungary
     Phone/Fax : +36 1 329-78-64 (secretary)
     W3        : http://www.lpds.sztaki.hu
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-11-03 11:58         ` Gabor Gombas
@ 2008-11-03 12:11           ` Kay Sievers
  0 siblings, 0 replies; 51+ messages in thread
From: Kay Sievers @ 2008-11-03 12:11 UTC (permalink / raw)
  To: Gabor Gombas
  Cc: Neil Brown, Doug Ledford, linux-raid, martin f. krafft,
	Michal Marek

On Mon, Nov 3, 2008 at 12:58, Gabor Gombas <gombasg@sztaki.hu> wrote:
> On Mon, Nov 03, 2008 at 11:33:36AM +0100, Kay Sievers wrote:
>
>> Network devices have only one entry point, if you insist you can think
>> of the index number as another value, but there is always only one
>> single name that matters.
>
> And you have only one (major, minor) pair for block devices. No
> difference.

It's different. You can't have multiple names to access an interface,
like you can have symlinks, that's all it is about.

>> And we need to rename them because we have
>> no real concept  of symlinks for network interfaces.
>>
>> That is not true at all for block devices, you can identify them in
>> many ways, by name, by physical location, by hardware ID from the
>> stuff behind the devices, by filesystem metadata, by properies of the
>> specific subsystem,
>
> And you can identify network devices by name, physical location,
> hardware address, stuff behind the devices (aka. network
> autoconfiguration). Again no difference.

Sure, you can, but there is nothing that gives you any name to access
the device. All access needs to be manually translated to the real
name before you interface with the kernel. It is very different.

>> ... Renaming block devices just does not make much
>> sense, because there is no primary name to use, it all depends on the
>> actual setup and personal preference.
>
> Exactly. And that's why _I_ want to choose the name. The current
> udev-based partial solution is not good enough, as even if I give a
> meaningful name to a node in /dev the kernel still only tells me "there
> is a bad sector on /dev/sdk" and I have to spend precious time to figure
> out which device /dev/sdk is. OTOH if I could rename sdk to eg.
> self1slot3 then the error message would contain _all_ the information I
> need.

If you care, log the symlinks to syslog, and you have always the
relation of all created names to the device at any time the kernel
device name existed.

There is no point to open a can or worms, and debate over what primary
names block devices should have, and how you handle conflicting names,
duplicates like you have with multipath, race-free renaming, and so
on. We have symlinks for that, which work good enough.

Kay

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-31  9:45   ` Neil Brown
  2008-11-03  9:29     ` Gabor Gombas
@ 2008-11-03 14:34     ` Doug Ledford
  2008-11-03 15:20       ` Dan Williams
  2008-11-07  6:13       ` Neil Brown
  1 sibling, 2 replies; 51+ messages in thread
From: Doug Ledford @ 2008-11-03 14:34 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, martin f. krafft, Michal Marek, Kay Sievers

[-- Attachment #1: Type: text/plain, Size: 5494 bytes --]

On Fri, 2008-10-31 at 20:45 +1100, Neil Brown wrote:
> > >  1/ The only device nodes created will be /dev/mdX and /dev/md_dX
> > >     along with partitions /dev/mdXpY and /dev/md_dXpY as appropriate.
> > >     These will be created by mdadm in accordance with the "--auto"
> > >     flag unless something in mdadm.conf says to leave it to udev.
> > >     In that case, mdadm will create a temporary node
> > >     (/dev/.mdadm.whatever) and remove it once udev has created the
> > >     real thing.
> > 
> > One thing I noticed in my work on the incremental stuff, is that the
> > user friendly device naming method still wants to create
> > these /dev/md_dX{pY} array names.  I'm actually in favor of doing away
> > with the notion that an array needs to be numbered and exist in a
> > numbered format in the /dev/ namespace.  If you have a user friendly
> > name, such as /dev/md/root and /dev/md/boot, or /dev/md/root_p1
> > and /dev/md/root_p2, I see no need to add additional numbered devices.
> > Instead, just allow the device number of the named devices to be random.
> 
> I have considered dropping the "/dev/mdXX" names altogether, and I
> think mdadm.2 sometimes does that.  But I've decided against it.
> My reasons are:
> 
>  1/ udev is going to create them anyway, so there is no point trying
>     to hide them.

I don't think this is accurate.

>  2/ those names appear in /proc/mdstat and despite all the rhetoric
>     about naming policy not belonging in the kernel, the kernel does
>     set some naming policy, "mdX" etc are part of that, and we cannot
>     avoid it.
>     Joe Sysadmin will see a name in /proc/mdstat and might want to
>     access that device.  Having it easily available in /dev is good.
> 
> My current thought is that /dev/md/ provides human friendly names.
> /dev/disk/by-id/md-whatever provides script-friendly names.  And /dev
> directly contains kernel-friendly names.

The in-kernel names are set by the kernel md code.  Right now, it has a
simplistic test that checks if the device is partitionable, then sets
the kobject name to either md%d or md_d%d.  The key point being that the
md code gets to set the kobject name, and it's the kobject name that is
used my udev.  Don't get me wrong, I know changing this setup now would
break udev horribly, this being because udev current does
subsystem==block,kernel=="md*" to match all md devices.  In order to
break from this, we would need to do something like
subsystem==block,subtype==md and skip any name check tests.  Then we
could in fact use arbitrary names and udev and the rest of the system
would be fine.  So, I'm not saying it would work today, but that doesn't
mean it couldn't be designed for and then implemented with a coordinated
change to the kernel and udev.

> > 
> > >  2/ There will be various symlinks to these devices.
> > >     a/ if "symlinks=yes" is given in mdadm.conf, symlinks from
> > >          /dev/md/X or /dev/md/dX will be created.
> > >     b/ if udev is configured like on Debian,
> > >               /dev/disk/by-id/md-name-XXXX
> > > 	and   /dev/disk/by-id/md-uuid-UUUU
> > >        will be created (by udev).
> > >     c/ If there is a 'name' associated with the array then
> > >         /dev/md/name will be created as a link.
> > >     d/ if an explicit device name of /dev/name was given,
> > >         either on a -A, -B, -C, command or in mdadm.conf,
> > > 	then the 'name' must match the name of the array,
> > > 	and /dev/name will be used as well as /dev/md/name.
> > 
> > I think all these symlinks are problematic.  We have a naming
> > consistency problem, and creating all these links just perpetuates that
> > problem.  I would be in favor of standardizing the namespace location
> > and semantics and doing away with all the symlinks.  Do that, and within
> > one release cycle all the confusion will be gone.
> 
> Your last sentence is very pragmatic and sensible.  If confusion
> exists, we really want to move firmly away from it, and people will
> cope, particularly if things become cleared (even if they are
> different to what they are used to).
> 
> I am dropping support for the "--symlinks" option and matching
> mdadm.conf entry. 
> /dev/mdXXX will always be the device node.  There will always be (at
> most) one entry in /dev/md/ which points to it.  It might be e.g.
> /dev/md/0, but only if no better name is available.

Excellent.  I found the symlinks to create all sorts of cruft that
didn't need to be there.

> > The scariest suggestion, but probably the most complete and automated,
> > would be to have mdadm do a search on any constituent devices to find
> > out what the eventual low level driver is.  If it's a fiber channel
> > driver, or iSCSI, then don't auto assemble.  If it's sata/e-sata, or
> > local SAS, then it's more likely auto assemble is fine.  But, that level
> > of mucking around in /sys for each device would probably be quite ugly.
> 
> 
> Quite.  And I'd almost certainly get it wrong.  One day someone might
> come up with a solution that can be automated.  For now I think I
> stick with configuration in mdadm.conf

I'll keep this in mind as a spare time project...

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-11-03 14:34     ` Doug Ledford
@ 2008-11-03 15:20       ` Dan Williams
  2008-11-07  6:13       ` Neil Brown
  1 sibling, 0 replies; 51+ messages in thread
From: Dan Williams @ 2008-11-03 15:20 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Neil Brown, linux-raid, martin f. krafft, Michal Marek,
	Kay Sievers

On Mon, Nov 3, 2008 at 7:34 AM, Doug Ledford <dledford@redhat.com> wrote:
> On Fri, 2008-10-31 at 20:45 +1100, Neil Brown wrote:
>> > The scariest suggestion, but probably the most complete and automated,
>> > would be to have mdadm do a search on any constituent devices to find
>> > out what the eventual low level driver is.  If it's a fiber channel
>> > driver, or iSCSI, then don't auto assemble.  If it's sata/e-sata, or
>> > local SAS, then it's more likely auto assemble is fine.  But, that level
>> > of mucking around in /sys for each device would probably be quite ugly.
>>
>>
>> Quite.  And I'd almost certainly get it wrong.  One day someone might
>> come up with a solution that can be automated.  For now I think I
>> stick with configuration in mdadm.conf
>
> I'll keep this in mind as a spare time project...
>

I don't think the mucking around to determine the underlying device
would be that ugly.  Another use for this information is to set a
policy on which device ports are raid ports to enable policies like
auto-rebuild, or detect when a component member is found outside the
"raid domain".

--
Dan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-11-03 14:34     ` Doug Ledford
  2008-11-03 15:20       ` Dan Williams
@ 2008-11-07  6:13       ` Neil Brown
  1 sibling, 0 replies; 51+ messages in thread
From: Neil Brown @ 2008-11-07  6:13 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-raid, martin f. krafft, Michal Marek, Kay Sievers

On Monday November 3, dledford@redhat.com wrote:
> 
> The in-kernel names are set by the kernel md code.  Right now, it has a
> simplistic test that checks if the device is partitionable, then sets
> the kobject name to either md%d or md_d%d.  The key point being that the
> md code gets to set the kobject name, and it's the kobject name that is
> used my udev.  Don't get me wrong, I know changing this setup now would
> break udev horribly, this being because udev current does
> subsystem==block,kernel=="md*" to match all md devices.  In order to
> break from this, we would need to do something like
> subsystem==block,subtype==md and skip any name check tests.  Then we
> could in fact use arbitrary names and udev and the rest of the system
> would be fine.  So, I'm not saying it would work today, but that doesn't
> mean it couldn't be designed for and then implemented with a coordinated
> change to the kernel and udev.

That's an interesting idea.

   echo md_fred > /sys/modules/md_mod/parameters/new_array
   # /sys/block/md_fred magically appears with some random minor number
   # udev ignores it because md/array_state is 'clear'
   cd /sys/block/md_fred/md
   echo raid5 > level
   .....
   echo active > array_state
   # udev gets a CHANGE event and creates /dev/md_fred, and/or
   #  /dev/md/fred plus any partitions that are found.

I think that we need to have the "md_" prefix.  I don't want someone
ever to be able to create and md array called "sda"!
And I think this would work with udev today, though mdadm wouldn't
make use of it, and could get confused if some other code did.

Any code which parses /proc/mdstat and expects to see "md%d" there
will also get confused.

But this is the first idea I have seen that really gives some value in
an alternate way for creating md arrays (alternate to just opening the
block-special device).

Thanks for the suggestion.

NeilBrown

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-30 17:18 ` RFC - device names and mdadm with some reference to udev Doug Ledford
  2008-10-31  9:45   ` Neil Brown
@ 2008-11-02 13:47   ` Luca Berra
  1 sibling, 0 replies; 51+ messages in thread
From: Luca Berra @ 2008-11-02 13:47 UTC (permalink / raw)
  To: linux-raid

On Thu, Oct 30, 2008 at 01:18:47PM -0400, Doug Ledford wrote:
>>  4/ auto-assembly needs to do the right thing on a SAN where multiple
>>     hosts can each see multiple arrays.  Clearly only one host should
>>     write to any one array at one time (until I get some
>>     cluster-awareness going, which I had hoped to work on this year,
>>     but it doesn't look like I will).
>>     In this case, I don't think read-auto is enough.  We either need
>>     to not assemble arrays when aren't known to belong to us, or we
>>     need to assemble them read-only and require and explicit
>>     read-write setting.
>> 
>>     So we need some way to know which devices could be visible to
>>     other hosts.
>>     I could have a global flag in mdadm.conf "Options SAN"
>>     I could have a SAN-DEVICES to match "DEVICES", but as just about
>>     everything is "/dev/sd*" these days, I don't know if that would
>>     work.
>> 
>>     Any suggestions concerning this would be welcome.
>
>The scariest suggestion, but probably the most complete and automated,
>would be to have mdadm do a search on any constituent devices to find
>out what the eventual low level driver is.  If it's a fiber channel
>driver, or iSCSI, then don't auto assemble.  If it's sata/e-sata, or
>local SAS, then it's more likely auto assemble is fine.  But, that level
>of mucking around in /sys for each device would probably be quite ugly.
>

unfortunately this will not work out correctly
1) it is fairly possible for an host to boot from fiber-channel, and to
run md over it (it is a fairly common setup here).
2) scsi supports shared storage, and i believe SAS does too.

L.




-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 51+ messages in thread

[parent not found: <dledford@redhat.com>]

* Re: RFC - device names and mdadm with some reference to udev.
@ 2008-10-31  1:02 ` greg
  2008-10-31  9:18   ` Neil Brown
  0 siblings, 1 reply; 51+ messages in thread
From: greg @ 2008-10-31  1:02 UTC (permalink / raw)
  To: Doug Ledford, martin f krafft
  Cc: Neil Brown, linux-raid, Michal Marek, Kay Sievers

On Oct 27, 11:13am, Doug Ledford wrote:
} Subject: Re: RFC - device names and mdadm with some reference to udev.

Good evening to everyone, hope the week has gone well.

> > I would really like to have a clear separation of competencies.
> > Ideally, mdadm never creates any devices but leaves it all to udev,
> > and all configuration about alternate names ("symlinks") is done in
> > the udev rules file.

> This would then require that we have a working udev in our initrd
> images.  It would greatly increase the complexity of early booting
> as a result.

Whatever we do please do not make use of mdadm or startup of arrays
dependent on udev.  I do SAN's for a living and have had far too many
phone calls and have spent too much time trying to get boxes messed up
by udev back on the fabric to want to add any more complication to the
mix.

The notion of udev certainly has its place but not on a server which
only cares about four device nodes for its entire operational life.

Neil your mdadm is a great tool and your contributions via the MD
stuff are beyond peer, keep up the good work.  But this stuff has to
get simpler rather than more complex.

Best wishes for a pleasant weekend to everyone.

}-- End of excerpt from Doug Ledford

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"C++ is designed to allow you to express ideas, but if you don't have
 any ideas or don't have any clue about how to express them, C++
 doesn't offer much help."
                                -- Bjarne Stroustrup
                                   Technology Review

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-31  1:02 ` greg
@ 2008-10-31  9:18   ` Neil Brown
  2008-11-02 13:52     ` Luca Berra
  0 siblings, 1 reply; 51+ messages in thread
From: Neil Brown @ 2008-10-31  9:18 UTC (permalink / raw)
  To: greg; +Cc: Doug Ledford, martin f krafft, linux-raid, Michal Marek,
	Kay Sievers

On Thursday October 30, greg@enjellic.com wrote:
> On Oct 27, 11:13am, Doug Ledford wrote:
> } Subject: Re: RFC - device names and mdadm with some reference to udev.
> 
> Good evening to everyone, hope the week has gone well.
> 
> > > I would really like to have a clear separation of competencies.
> > > Ideally, mdadm never creates any devices but leaves it all to udev,
> > > and all configuration about alternate names ("symlinks") is done in
> > > the udev rules file.
> 
> > This would then require that we have a working udev in our initrd
> > images.  It would greatly increase the complexity of early booting
> > as a result.
> 
> Whatever we do please do not make use of mdadm or startup of arrays
> dependent on udev.  I do SAN's for a living and have had far too many
> phone calls and have spent too much time trying to get boxes messed up
> by udev back on the fabric to want to add any more complication to the
> mix.

I had intended to continue to support the no-udev installations, but
thank you the encouragement that it really is needed and will be used.

Just a clarification:  are you envisaging an installation without udev
at all, or one with udev installed and active, but you don't wont
mdadm to depend on it?  That latter option may be more awkward (I
currently support an environment variable which says "just create the
devices, even if udev appears to be installed").

> 
> The notion of udev certainly has its place but not on a server which
> only cares about four device nodes for its entire operational life.
> 
> Neil your mdadm is a great tool and your contributions via the MD
> stuff are beyond peer, keep up the good work.  But this stuff has to
> get simpler rather than more complex.

Thanks :-)

NeilBrown

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
  2008-10-31  9:18   ` Neil Brown
@ 2008-11-02 13:52     ` Luca Berra
  0 siblings, 0 replies; 51+ messages in thread
From: Luca Berra @ 2008-11-02 13:52 UTC (permalink / raw)
  To: linux-raid

On Fri, Oct 31, 2008 at 08:18:16PM +1100, Neil Brown wrote:
>On Thursday October 30, greg@enjellic.com wrote:
>> On Oct 27, 11:13am, Doug Ledford wrote:
>> } Subject: Re: RFC - device names and mdadm with some reference to udev.
>> 
>> Good evening to everyone, hope the week has gone well.
>> 
>> > > I would really like to have a clear separation of competencies.
>> > > Ideally, mdadm never creates any devices but leaves it all to udev,
>> > > and all configuration about alternate names ("symlinks") is done in
>> > > the udev rules file.
>> 
>> > This would then require that we have a working udev in our initrd
>> > images.  It would greatly increase the complexity of early booting
>> > as a result.
>> 
>> Whatever we do please do not make use of mdadm or startup of arrays
>> dependent on udev.  I do SAN's for a living and have had far too many
>> phone calls and have spent too much time trying to get boxes messed up
>> by udev back on the fabric to want to add any more complication to the
>> mix.

+1

>I had intended to continue to support the no-udev installations, but
>thank you the encouragement that it really is needed and will be used.
>
>Just a clarification:  are you envisaging an installation without udev
>at all, or one with udev installed and active, but you don't wont
>mdadm to depend on it?  That latter option may be more awkward (I
>currently support an environment variable which says "just create the
>devices, even if udev appears to be installed").

i have no objections in letting udev create my device files.
what i dislike is when udev sees a device appearing, and decides it
knows better than me, so it starts using it right away, no matter if it
is an usb key or a multi-tb shared storage.
I could never have tought of something so stupid as the incremental
assembly of md arrays,
0 advantages gained for lot of trouble.

L.

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RFC - device names and mdadm with some reference to udev.
@ 2008-11-04 15:36 greg
  0 siblings, 0 replies; 51+ messages in thread
From: greg @ 2008-11-04 15:36 UTC (permalink / raw)
  To: Neil Brown, greg
  Cc: Doug Ledford, martin f krafft, linux-raid, Michal Marek,
	Kay Sievers

On Oct 31,  8:18pm, Neil Brown wrote:
} Subject: Re: RFC - device names and mdadm with some reference to udev.

Hi Neil, et. al, hope your day has started well.

> On Thursday October 30, greg@enjellic.com wrote:
> > Whatever we do please do not make use of mdadm or startup of arrays
> > dependent on udev.  I do SAN's for a living and have had far too many
> > phone calls and have spent too much time trying to get boxes messed up
> > by udev back on the fabric to want to add any more complication to the
> > mix.

> I had intended to continue to support the no-udev installations, but
> thank you the encouragement that it really is needed and will be
> used.
>
> Just a clarification: are you envisaging an installation without
> udev at all, or one with udev installed and active, but you don't
> wont mdadm to depend on it?  That latter option may be more awkward
> (I currently support an environment variable which says "just create
> the devices, even if udev appears to be installed").

On the really critical systems I supervise there is no presence of
udev at all.  We need mdadm to run in that type of environment.

I guess if mdadm finds udev active and running it should feel free to
cooperate with it.  If there is an option to tell mdadm to create the
devices itself or use what has been defined that would be very helpful
as well and something we would use.

We find real problems with udev race issues in wide area SAN
implementations.  We had an incident a couple of weeks ago which
caused filesystem problems and a significant outage period secondary
to non-deterministic device setup in a udev based environment.

Our primary goals are simple, uncomplicated and reliable.

> > The notion of udev certainly has its place but not on a server which
> > only cares about four device nodes for its entire operational life.
> > 
> > Neil your mdadm is a great tool and your contributions via the MD
> > stuff are beyond peer, keep up the good work.  But this stuff has to
> > get simpler rather than more complex.
> 
> Thanks :-)
> 
> NeilBrown

Keep up the good work, best wishes for a productive week.

}-- End of excerpt from Neil Brown

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"More people are killed every year by pigs than by sharks, which shows
 you how good we are at evaluating risk."
                                -- Bruce Schneier
                                   Beyond Fear

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2008-11-07  6:13 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-26 22:56 RFC - device names and mdadm with some reference to udev Neil Brown
2008-10-27  8:22 ` martin f krafft
2008-10-27 15:13   ` Doug Ledford
2008-10-27 16:10     ` Andre Noll
2008-10-27 16:37       ` Kay Sievers
2008-10-27 16:59         ` martin f krafft
2008-10-27 18:31           ` Kay Sievers
2008-10-28  6:21             ` Luca Berra
2008-10-27 17:24         ` Doug Ledford
2008-10-27 23:36           ` Neil Brown
2008-10-29 18:49             ` Doug Ledford
2008-10-28  6:32           ` Luca Berra
2008-10-28  9:42           ` occasional bitmap was " David Greaves
2008-10-27 17:30         ` Andre Noll
2008-10-27 16:13     ` Kay Sievers
2008-10-27 22:37   ` Neil Brown
2008-10-27 22:51     ` Kay Sievers
2008-10-27 23:56       ` Neil Brown
2008-10-28  0:20         ` Kay Sievers
2008-10-28  6:17   ` Luca Berra
2008-10-27 12:41 ` Kay Sievers
2008-10-27 13:23   ` David Lethe
2008-10-27 23:27     ` Neil Brown
2008-10-27 23:48       ` David Lethe
2008-10-27 13:24   ` Andre Noll
2008-10-27 14:20     ` Kay Sievers
2008-10-27 23:23   ` Neil Brown
2008-10-28  0:03     ` Kay Sievers
2008-10-28  0:43       ` Neil Brown
2008-10-28  1:16         ` Kay Sievers
2008-10-28  1:44       ` Neil Brown
2008-10-28  1:52         ` Kay Sievers
2008-10-28  1:54           ` Kay Sievers
2008-10-31 20:54       ` Debian and udev (was: RFC - device names and mdadm with some reference to udev.) martin f krafft
2008-10-31 23:08         ` Bernd Schubert
2008-10-29  8:56     ` RFC - device names and mdadm with some reference to udev Gabor Gombas
2008-10-31 20:49     ` mdp devices on Debian (was: RFC - device names and mdadm with some reference to udev.) martin f krafft
2008-10-30 17:18 ` RFC - device names and mdadm with some reference to udev Doug Ledford
2008-10-31  9:45   ` Neil Brown
2008-11-03  9:29     ` Gabor Gombas
2008-11-03 10:33       ` Kay Sievers
2008-11-03 11:58         ` Gabor Gombas
2008-11-03 12:11           ` Kay Sievers
2008-11-03 14:34     ` Doug Ledford
2008-11-03 15:20       ` Dan Williams
2008-11-07  6:13       ` Neil Brown
2008-11-02 13:47   ` Luca Berra
     [not found] <dledford@redhat.com>
2008-10-31  1:02 ` greg
2008-10-31  9:18   ` Neil Brown
2008-11-02 13:52     ` Luca Berra
  -- strict thread matches above, loose matches on Subject: below --
2008-11-04 15:36 greg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).