RFC: mdadm and bringing up raid sets from initrd (dracut)

linux-hotplug.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RFC: mdadm and bringing up raid sets from initrd (dracut)
@ 2009-07-14  9:57 Hans de Goede
  2009-07-14 13:39 ` Doug Ledford
       [not found] ` <4A5C6501.3080607-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 2 replies; 12+ messages in thread
From: Hans de Goede @ 2009-07-14  9:57 UTC (permalink / raw)
  To: initramfs; +Cc: linux-hotplug, Danecki, Jacek, Harald Hoyer, Doug Ledford

Hi,

As you probably know I'm working on making Fedora 12 use mdraid
instead of dmraid for Intel BIOS-RAID setups.

The installer (anaconda) part is mostly done (needs more testing)
and now I'm looking at implementing support for this in dracut
(the new mkinitrd for Fedora 12).

So I've been testing how this works for both imsm mdraid sets
and native mdraid metadata sets, in both cases using a 2 disk
mirror, so that the set can also be brought up in degraded mode.

Currently the udev rules use incremental assembly like this:
mdadm -I /dev/mdraid-member

There are 2 problems with this:
1) When doing this for native mdraid metadata arrays, if only
    one disk is present the set never gets activated
2) When doing this for imsm metadata arrays, as soon as the
    first disk is incrementally added, the set gets activated
    in degraded mode and stays that way, the second disk
    will get added to the container, but not to the actual
    sets in the container

And these 2 problems have 2 different solutions:
1) An incomplete, but potentially activatable in degraded mode
    set can be activated using mdadm --run /dev/md#
2) One can stop this problem by using:
    mdadm -I --no-degraded /dev/mdraid-member
    instead (this does not change anything for
    native mdraid metadata format sets)
    But if that is done, the sets in the container never get
    activated, this can be fixed by running
    mdadm -I /dev/md# on the container device

So my proposed solution for this is when udev is done scanning
(when the event queue is empty, detected using the same mechanism as
dracut is using for dmraid), do the following:

For each /dev/md#
   run mdadm --export --detail, and get the MD_LEVEL
   if MD_LEVEL = "container":
     mdadm -I /dev/md#
   else
     mdadm --run /dev/md#

This will:
1) Bring up raid sets inside containers (such as imsm raidsets)
2) Bring up incomplete raid sets in degraded mode where possible

I'll post a patch implementing this later today.

Regards,

Hans

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
  2009-07-14  9:57 RFC: mdadm and bringing up raid sets from initrd (dracut) Hans de Goede
@ 2009-07-14 13:39 ` Doug Ledford
       [not found]   ` <1955210A-EF27-479F-8C58-BA4FA9018A56-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
       [not found] ` <4A5C6501.3080607-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 12+ messages in thread
From: Doug Ledford @ 2009-07-14 13:39 UTC (permalink / raw)
  To: Hans de Goede; +Cc: initramfs, linux-hotplug, Danecki, Jacek, Harald Hoyer

[-- Attachment #1: Type: text/plain, Size: 3804 bytes --]

On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote:
> Hi,
>
> As you probably know I'm working on making Fedora 12 use mdraid
> instead of dmraid for Intel BIOS-RAID setups.
>
> The installer (anaconda) part is mostly done (needs more testing)
> and now I'm looking at implementing support for this in dracut
> (the new mkinitrd for Fedora 12).
>
> So I've been testing how this works for both imsm mdraid sets
> and native mdraid metadata sets, in both cases using a 2 disk
> mirror, so that the set can also be brought up in degraded mode.
>
> Currently the udev rules use incremental assembly like this:
> mdadm -I /dev/mdraid-member

Hmmm...does dracut use udev during initramfs time?  mkinitrd didn't,  
so this would be a change.  In particular, I didn't have these  
problems with mkinitrd because I didn't use udev rules in the initrd,  
I ran mdadm -A instead.  In fact, the F11 method of bringup of raid  
devices is as such:

initrd: use mdadm -As --run <mddevice name with matching ARRAY entry  
in /etc/mdadm.conf>
rc.sysinit: use mdadm -As --run (no md device name, which means all  
arrays listed in mdadm.conf will get brought up, plus extra arrays not  
listed in mdadm.conf but which can be found and identified by metadata)
udev: in 65-md-incremental.rules use mdadm -I <block device> (but only  
if /dev/.in.rcsysinit does not exist, so we don't run udev incremental  
rules until after the system is up and running, which means for hot  
plugged devices...in particular we will never run the udev rule on any  
device that was present on boot, instead the previous two calls will  
catch these devices, and those previous calls will run degraded  
arrays, this allows me to safely refuse to run degraded arrays in the  
udev rules file without risking failing to boot, instead a degraded  
hot plugged array will need minor manual intervention, but the system  
will be fully up and operational no matter what)

I find this setup to be a rather safe, conservative way of handling md  
raid array hot plug.  Are we going to be totally changing this with  
dracut and F12?  This method very nicely resolves the issues you posted.

> There are 2 problems with this:
> 1) When doing this for native mdraid metadata arrays, if only
>   one disk is present the set never gets activated
> 2) When doing this for imsm metadata arrays, as soon as the
>   first disk is incrementally added, the set gets activated
>   in degraded mode and stays that way, the second disk
>   will get added to the container, but not to the actual
>   sets in the container
>
> And these 2 problems have 2 different solutions:
> 1) An incomplete, but potentially activatable in degraded mode
>   set can be activated using mdadm --run /dev/md#
> 2) One can stop this problem by using:
>   mdadm -I --no-degraded /dev/mdraid-member
>   instead (this does not change anything for
>   native mdraid metadata format sets)
>   But if that is done, the sets in the container never get
>   activated, this can be fixed by running
>   mdadm -I /dev/md# on the container device
>
> So my proposed solution for this is when udev is done scanning
> (when the event queue is empty, detected using the same mechanism as
> dracut is using for dmraid), do the following:
>
> For each /dev/md#
>  run mdadm --export --detail, and get the MD_LEVEL
>  if MD_LEVEL == "container":
>    mdadm -I /dev/md#
>  else
>    mdadm --run /dev/md#
>
> This will:
> 1) Bring up raid sets inside containers (such as imsm raidsets)
> 2) Bring up incomplete raid sets in degraded mode where possible
>
> I'll post a patch implementing this later today.
>
> Regards,
>
> Hans


--

Doug Ledford <dledford@redhat.com>

GPG KeyID: CFBFF194
http://people.redhat.com/dledford

InfiniBand Specific RPMS
http://people.redhat.com/dledford/Infiniband





[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 203 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
       [not found]   ` <1955210A-EF27-479F-8C58-BA4FA9018A56-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-14 14:01     ` Hans de Goede
  2009-07-14 14:14       ` Doug Ledford
  0 siblings, 1 reply; 12+ messages in thread
From: Hans de Goede @ 2009-07-14 14:01 UTC (permalink / raw)
  To: Doug Ledford
  Cc: initramfs, linux-hotplug-u79uwXL29TY76Z2rM5mHXA, Danecki, Jacek,
	Harald Hoyer

Hi,

On 07/14/2009 03:39 PM, Doug Ledford wrote:
> On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote:
>> Hi,
>>
>> As you probably know I'm working on making Fedora 12 use mdraid
>> instead of dmraid for Intel BIOS-RAID setups.
>>
>> The installer (anaconda) part is mostly done (needs more testing)
>> and now I'm looking at implementing support for this in dracut
>> (the new mkinitrd for Fedora 12).
>>
>> So I've been testing how this works for both imsm mdraid sets
>> and native mdraid metadata sets, in both cases using a 2 disk
>> mirror, so that the set can also be brought up in degraded mode.
>>
>> Currently the udev rules use incremental assembly like this:
>> mdadm -I /dev/mdraid-member
>
> Hmmm...does dracut use udev during initramfs time?

Yes, it uses udev for everything, making discovery of / consistent
with the discovery of other storage devices.

<snip>

> Are we going to be totally changing this with
> dracut and F12? This method very nicely resolves the issues you posted.
>

Yes.

Regards,

Hans

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
  2009-07-14 14:01     ` Hans de Goede
@ 2009-07-14 14:14       ` Doug Ledford
       [not found]         ` <D758972F-0E5A-4860-9011-6B2DA1FA771A-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Doug Ledford @ 2009-07-14 14:14 UTC (permalink / raw)
  To: Hans de Goede; +Cc: initramfs, linux-hotplug, Danecki, Jacek, Harald Hoyer

[-- Attachment #1: Type: text/plain, Size: 1952 bytes --]

On Jul 14, 2009, at 11:02 AM, Hans de Goede wrote:
> Hi,
> On 07/14/2009 03:39 PM, Doug Ledford wrote:
>> On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote:
>>> Hi,
>>>
>>> As you probably know I'm working on making Fedora 12 use mdraid
>>> instead of dmraid for Intel BIOS-RAID setups.
>>>
>>> The installer (anaconda) part is mostly done (needs more testing)
>>> and now I'm looking at implementing support for this in dracut
>>> (the new mkinitrd for Fedora 12).
>>>
>>> So I've been testing how this works for both imsm mdraid sets
>>> and native mdraid metadata sets, in both cases using a 2 disk
>>> mirror, so that the set can also be brought up in degraded mode.
>>>
>>> Currently the udev rules use incremental assembly like this:
>>> mdadm -I /dev/mdraid-member
>>
>> Hmmm...does dracut use udev during initramfs time?
>
> Yes, it uses udev for everything, making discovery of / consistent
> with the discovery of other storage devices.

I'm not sure I like or agree with that philosophy.  I absolutely  
*don't* want my / filesystem or raid device treated like some plug in,  
temporary, roaming raid device.  They *aren't* the same, not in terms  
of importance to the running of the machine and not in terms of  
reliability requirements.  By using mdadm -A in the mkinitrd calls, I  
was able to put in an mdadm.conf file and limit what arrays get  
started to arrays found non-ambiguously in that mdadm.conf file and  
identified by UUID.  When you switch to incremental assembly for root,  
you risk the possibility of name space collisions and non- 
deterministic bring up of your / array.

> <snip>
>
>> Are we going to be totally changing this with
>> dracut and F12? This method very nicely resolves the issues you  
>> posted.
>>
>
> Yes.
>
> Regards,
>
> Hans


--

Doug Ledford <dledford@redhat.com>

GPG KeyID: CFBFF194
http://people.redhat.com/dledford

InfiniBand Specific RPMS
http://people.redhat.com/dledford/Infiniband





[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 203 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
       [not found] ` <4A5C6501.3080607-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-14 14:30   ` David Zeuthen
       [not found]     ` <1247581847.1991.16.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: David Zeuthen @ 2009-07-14 14:30 UTC (permalink / raw)
  To: Hans de Goede
  Cc: initramfs, linux-hotplug-u79uwXL29TY76Z2rM5mHXA, Danecki, Jacek,
	Harald Hoyer, Doug Ledford

On Tue, 2009-07-14 at 12:59 +0200, Hans de Goede wrote:
> Currently the udev rules use incremental assembly like this:
> mdadm -I /dev/mdraid-member
> 
> There are 2 problems with this:
> 1) When doing this for native mdraid metadata arrays, if only
>     one disk is present the set never gets activated
> 2) When doing this for imsm metadata arrays, as soon as the
>     first disk is incrementally added, the set gets activated
>     in degraded mode and stays that way, the second disk
>     will get added to the container, but not to the actual
>     sets in the container

FWIW, this incremental assembly business in mdadm is actually not a very
good idea. At least not the current implementation. I'm not sure whether
it's still a Fedora-ism or whether it's something that's in upstream
mdadm yet. I'm talking about this udev rule

 /lib/udev/rules.d/65-md-incremental.rules:
 # This file causes block devices with Linux RAID (mdadm) signatures to
 # automatically cause mdadm to be run.
 # See udev(8) for syntax

 SUBSYSTEM="block", ACTION="add", ENV{ID_FS_TYPE}="linux_raid_member", \
	IMPORT{program}="/sbin/mdadm --examine --export $tempnode", \
	RUN+="/bin/bash -c '[ ! -f /dev/.in_sysinit ] && mdadm -I $env{DEVNAME}'"

For example if the user plugs in a random old disk that happens to
contain half of a RAID1 mirror, then the incremental assembly bits sets
up an inert md-device and the user is now left to his own devices as to
sort this out when he's told by partitioning tools etc. that the disk
(or partition of) he just plugged in, is "busy" (it is claimed by the
inert md node).

I actually had to add some extra code to the GNOME Disk Utility bits to
handle such things (stop inert md devices) - makes the user experience
quite a bit worse since there's now an extra state to worry about. And
most current users don't use the UI bits yet for this so they get extra
confused when trying to use e.g. parted(8) or fdisk(8) on the device.

FWIW, I'd wish people would stop playing games like this. If you want to
do auto-assembly at the system-level, at the very least don't leave the
system in a state like this. For example, one way to do auto-assembly
without such bugs would be to use libudev to enumerate all md component
devices with the same MD_UUID. Then you count the number of components
and only start the array if the number of components equals MD_DEVICES.
That's much better than incrementally adding to an md device node that
might never get used.

I've complained to Doug about this already for Fedora but, since it's
still broken and, AFAICT, up it's way to upstream mdadm, it's worth
reiterating the complaint.

Thanks,
David

[1] : And, except for booting, it's not clear to me that you want to
have policy like auto-assembling RAID arrays at the system. I'd leave
such policy to desktop bits where the user can control it and the
software can actually interact with the user. And where it's easy to
turn off features like this.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
       [not found]         ` <D758972F-0E5A-4860-9011-6B2DA1FA771A-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-14 15:00           ` David Zeuthen
  2009-07-16 10:56             ` Harald Hoyer
  0 siblings, 1 reply; 12+ messages in thread
From: David Zeuthen @ 2009-07-14 15:00 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Hans de Goede, initramfs, linux-hotplug-u79uwXL29TY76Z2rM5mHXA,
	Danecki, Jacek, Harald Hoyer

On Tue, 2009-07-14 at 10:14 -0400, Doug Ledford wrote:
> On Jul 14, 2009, at 11:02 AM, Hans de Goede wrote:
> > Hi,
> > On 07/14/2009 03:39 PM, Doug Ledford wrote:
> >> On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote:
> >>> Hi,
> >>>
> >>> As you probably know I'm working on making Fedora 12 use mdraid
> >>> instead of dmraid for Intel BIOS-RAID setups.
> >>>
> >>> The installer (anaconda) part is mostly done (needs more testing)
> >>> and now I'm looking at implementing support for this in dracut
> >>> (the new mkinitrd for Fedora 12).
> >>>
> >>> So I've been testing how this works for both imsm mdraid sets
> >>> and native mdraid metadata sets, in both cases using a 2 disk
> >>> mirror, so that the set can also be brought up in degraded mode.
> >>>
> >>> Currently the udev rules use incremental assembly like this:
> >>> mdadm -I /dev/mdraid-member
> >>
> >> Hmmm...does dracut use udev during initramfs time?
> >
> > Yes, it uses udev for everything, making discovery of / consistent
> > with the discovery of other storage devices.
> 
> I'm not sure I like or agree with that philosophy.  I absolutely  
> *don't* want my / filesystem or raid device treated like some plug in,  
> temporary, roaming raid device.  They *aren't* the same, not in terms  
> of importance to the running of the machine and not in terms of  
> reliability requirements.  By using mdadm -A in the mkinitrd calls, I  
> was able to put in an mdadm.conf file and limit what arrays get  
> started to arrays found non-ambiguously in that mdadm.conf file and  
> identified by UUID.  When you switch to incremental assembly for root,  
> you risk the possibility of name space collisions and non- 
> deterministic bring up of your / array.

I'm concerned about this too. To be more specific, I'm concerned about
both automatically assembling things like RAID arrays / LVM logical
volumes and also automounting devices [1].

Anyway, my point with all this is that maybe we are going about things
wrong in the initramfs. My understanding is that dracut roughly works
this way (please let me know if this is wrong)

 1. when generating the initramfs image, we leave information in
    the kernel command-line about the root filesystem - typically
    the UUID - e.g. root=UUIDx6263c4-5e28-4cdc-97b8-1ab6e221c344

 2. when the initramfs starts, we trigger all uevents and wait for
    things to settle

 3. Autoassembly / magic:

    - If we see e.g. md components, we activate them via udev rules
    - If we see e.g. LUKS devices, we unlock them (by interacting with
      the user asking for the passphrase) via udev rules.
    - Ditto for e.g. LVM

 5. if we see the rootfs (matching on e.g. the UUID passed on the
    kernel command line) we create the /dev/root symlink

 6. when the system has settled (e.g. no more uevents) we mount
    /dev/root and transition to non-early user space. If there
    is no /dev/root link, we bail out

Now, my beef is 3. above. I think it is way too optimistic to just
auto-assemble / unlock etc. everything. E.g. we end up doing a lot of
work not related to the rootfs that is better done in non-early user
space.

Instead, just like we specify the UUID for rootfs on the command-line,
we need to leave some instructions to the initramfs logic on _exactly_
what things should be autoassembled / unlocked / etc. in order to find
the rootfs. So the kernel command-line wouldn't really be "just" the
UUID of rootfs; it would be a whole recipe of actions to do. E.g.

 ROOTFS=UUID\x1234          \ # this the UUID of my rootfs
 MD_ASSEMBLE=UUIDE67     \ # assemble MD array with UUID 4567
 LUKS_UNLOCK=UUID‰ab       # unlock LUKS device with UUID 89ab

which would work for e.g. cases where rootfs is on a LUKS device which
is on a MD array. In other words, we'd need a whole "recipe" passed to
the initramfs (the mkinitrd tool would generate this recipe), not just
the UUID of the rootfs.

Coincidentally, if we had something like this and the format of the
"recipe" was documented somewhere, it would be easy to e.g. implement
"rescue" functionality as described here

http://www.redhat.com/archives/fedora-desktop-list/2009-July/msg00019.html

since graphical disk utilities would just find /etc/grub.conf (or
similar), read the recipe and then start assembling/unlocking bits and
mount them as appropriate in /mnt/rescue/.

Actually this is very close to what Doug is asking for when he says
(paraphrased) "just include mdadm.conf instead of this magic". The key
difference, however, is that the user _won't_ have to use mdadm.conf or
care about config files - it's all taken care of by the mkinitrd binary
when building the recipe. This is a good thing as having one less config
file to worry about is good.

Thanks for considering, and sorry for the long mail,
David

[1] : As some background information, I've spent a good chunk of my
life, five years or so, dealing with end users complaining about how
plain block devices got automounted when they were plugged in. FWIW, the
complaints ranges from both non-sensical (irritated users: "these
desktop kids shall not decide how UNIX works") to actual bugs where the
on-disk contents were mis-detected and either something wrong got
automounted or we failed to automount at all.

If I've learned anything it's that you need to be very very careful here
- unlike Windows and other operating systems with such capabilities,
Linux is.. different.. mostly because we support so many different ways
to put a file system through things likd md and dm. And you need to make
it very easy to turn things like this off.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
       [not found]     ` <1247581847.1991.16.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2009-07-15 18:47       ` Dan Williams
  2009-07-16  0:16         ` Jeremy Katz
                           ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Dan Williams @ 2009-07-15 18:47 UTC (permalink / raw)
  To: David Zeuthen
  Cc: Hans de Goede, initramfs, linux-hotplug-u79uwXL29TY76Z2rM5mHXA,
	Danecki, Jacek, Harald Hoyer, Doug Ledford, NeilBrown

[ Cc: Neil ]

On Tue, Jul 14, 2009 at 7:30 AM, David Zeuthen<david@fubar.dk> wrote:
> On Tue, 2009-07-14 at 12:59 +0200, Hans de Goede wrote:
>> Currently the udev rules use incremental assembly like this:
>> mdadm -I /dev/mdraid-member
>>
>> There are 2 problems with this:
>> 1) When doing this for native mdraid metadata arrays, if only
>>     one disk is present the set never gets activated
>> 2) When doing this for imsm metadata arrays, as soon as the
>>     first disk is incrementally added, the set gets activated
>>     in degraded mode and stays that way, the second disk
>>     will get added to the container, but not to the actual
>>     sets in the container
>
> FWIW, this incremental assembly business in mdadm is actually not a very
> good idea. At least not the current implementation. I'm not sure whether
> it's still a Fedora-ism or whether it's something that's in upstream
> mdadm yet. I'm talking about this udev rule
>
>  /lib/udev/rules.d/65-md-incremental.rules:
>  # This file causes block devices with Linux RAID (mdadm) signatures to
>  # automatically cause mdadm to be run.
>  # See udev(8) for syntax
>
>  SUBSYSTEM="block", ACTION="add", ENV{ID_FS_TYPE}="linux_raid_member", \
>        IMPORT{program}="/sbin/mdadm --examine --export $tempnode", \
>        RUN+="/bin/bash -c '[ ! -f /dev/.in_sysinit ] && mdadm -I $env{DEVNAME}'"
>
> For example if the user plugs in a random old disk that happens to
> contain half of a RAID1 mirror, then the incremental assembly bits sets
> up an inert md-device and the user is now left to his own devices as to
> sort this out when he's told by partitioning tools etc. that the disk
> (or partition of) he just plugged in, is "busy" (it is claimed by the
> inert md node).
>
> I actually had to add some extra code to the GNOME Disk Utility bits to
> handle such things (stop inert md devices) - makes the user experience
> quite a bit worse since there's now an extra state to worry about. And
> most current users don't use the UI bits yet for this so they get extra
> confused when trying to use e.g. parted(8) or fdisk(8) on the device.
>
> FWIW, I'd wish people would stop playing games like this. If you want to
> do auto-assembly at the system-level, at the very least don't leave the
> system in a state like this. For example, one way to do auto-assembly
> without such bugs would be to use libudev to enumerate all md component
> devices with the same MD_UUID. Then you count the number of components
> and only start the array if the number of components equals MD_DEVICES.
> That's much better than incrementally adding to an md device node that
> might never get used.
>
> I've complained to Doug about this already for Fedora but, since it's
> still broken and, AFAICT, up it's way to upstream mdadm, it's worth
> reiterating the complaint.
>
> Thanks,
> David
>
> [1] : And, except for booting, it's not clear to me that you want to
> have policy like auto-assembling RAID arrays at the system. I'd leave
> such policy to desktop bits where the user can control it and the
> software can actually interact with the user. And where it's easy to
> turn off features like this.
>

mdadm-3.0 has facilities to prevent assembly of certain metadata types
[1] or arrays with certain uuids [2].  I wonder if we also need a
facility to prevent auto-assembly of arrays *not* listed in
mdadm.conf?  So the mdadm.conf file installed in the initramfs would
only identify the root array and all other randomly identified md
devices would be ignored (rather than assembled with a foreign name).

Thoughts?

Thanks,
Dan

[1]: http://neil.brown.name/git?p=mdadm;a=commitdiff;h1015d57
[2]: http://neil.brown.name/git?p=mdadm;a=commitdiff;h\x112cace6

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
  2009-07-15 18:47       ` Dan Williams
@ 2009-07-16  0:16         ` Jeremy Katz
       [not found]           ` <20090716001651.GB45537-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-07-16 10:56         ` Neil Brown
  2009-07-16 11:09         ` Neil Brown
  2 siblings, 1 reply; 12+ messages in thread
From: Jeremy Katz @ 2009-07-16  0:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: David Zeuthen, Hans de Goede, initramfs, linux-hotplug,
	Danecki, Jacek, Harald Hoyer, Doug Ledford, NeilBrown

On Wednesday, July 15 2009, Dan Williams said:
> mdadm-3.0 has facilities to prevent assembly of certain metadata types
> [1] or arrays with certain uuids [2].  I wonder if we also need a
> facility to prevent auto-assembly of arrays *not* listed in
> mdadm.conf?  So the mdadm.conf file installed in the initramfs would
> only identify the root array and all other randomly identified md
> devices would be ignored (rather than assembled with a foreign name).
> 
> Thoughts?

There is no mdadm.conf in the initramfs -- in fact, the initramfs may
not even be generated on the system that you're booting and instead be
"generic" for the kernel in question

Jeremy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
       [not found]           ` <20090716001651.GB45537-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-16  7:11             ` Victor Lowther
  0 siblings, 0 replies; 12+ messages in thread
From: Victor Lowther @ 2009-07-16  7:11 UTC (permalink / raw)
  To: Jeremy Katz
  Cc: Dan Williams, David Zeuthen, Hans de Goede, initramfs,
	linux-hotplug-u79uwXL29TY76Z2rM5mHXA, Danecki, Jacek,
	Harald Hoyer, Doug Ledford, NeilBrown

On Wed, Jul 15, 2009 at 7:16 PM, Jeremy Katz<katzj@redhat.com> wrote:
> On Wednesday, July 15 2009, Dan Williams said:
>> mdadm-3.0 has facilities to prevent assembly of certain metadata types
>> [1] or arrays with certain uuids [2].  I wonder if we also need a
>> facility to prevent auto-assembly of arrays *not* listed in
>> mdadm.conf?  So the mdadm.conf file installed in the initramfs would
>> only identify the root array and all other randomly identified md
>> devices would be ignored (rather than assembled with a foreign name).
>>
>> Thoughts?
>
> There is no mdadm.conf in the initramfs -- in fact, the initramfs may
> not even be generated on the system that you're booting and instead be
> "generic" for the kernel in question

Still, it sounds like a good feature to be added for a --hostonly initramfs.

(especially on systems that are attaching to iscsi and/or fibre channel luns)

> Jeremy
> --
> To unsubscribe from this list: send the line "unsubscribe initramfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
  2009-07-14 15:00           ` David Zeuthen
@ 2009-07-16 10:56             ` Harald Hoyer
  0 siblings, 0 replies; 12+ messages in thread
From: Harald Hoyer @ 2009-07-16 10:56 UTC (permalink / raw)
  To: David Zeuthen
  Cc: Doug Ledford, Hans de Goede, initramfs, linux-hotplug,
	Danecki, Jacek

On 07/14/2009 05:00 PM, David Zeuthen wrote:
> On Tue, 2009-07-14 at 10:14 -0400, Doug Ledford wrote:
>> On Jul 14, 2009, at 11:02 AM, Hans de Goede wrote:
>>> Hi,
>>> On 07/14/2009 03:39 PM, Doug Ledford wrote:
>>>> On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote:
>>>>> Hi,
>>>>>
>>>>> As you probably know I'm working on making Fedora 12 use mdraid
>>>>> instead of dmraid for Intel BIOS-RAID setups.
>>>>>
>>>>> The installer (anaconda) part is mostly done (needs more testing)
>>>>> and now I'm looking at implementing support for this in dracut
>>>>> (the new mkinitrd for Fedora 12).
>>>>>
>>>>> So I've been testing how this works for both imsm mdraid sets
>>>>> and native mdraid metadata sets, in both cases using a 2 disk
>>>>> mirror, so that the set can also be brought up in degraded mode.
>>>>>
>>>>> Currently the udev rules use incremental assembly like this:
>>>>> mdadm -I /dev/mdraid-member
>>>> Hmmm...does dracut use udev during initramfs time?
>>> Yes, it uses udev for everything, making discovery of / consistent
>>> with the discovery of other storage devices.
>> I'm not sure I like or agree with that philosophy.  I absolutely
>> *don't* want my / filesystem or raid device treated like some plug in,
>> temporary, roaming raid device.  They *aren't* the same, not in terms
>> of importance to the running of the machine and not in terms of
>> reliability requirements.  By using mdadm -A in the mkinitrd calls, I
>> was able to put in an mdadm.conf file and limit what arrays get
>> started to arrays found non-ambiguously in that mdadm.conf file and
>> identified by UUID.  When you switch to incremental assembly for root,
>> you risk the possibility of name space collisions and non-
>> deterministic bring up of your / array.
>
> I'm concerned about this too. To be more specific, I'm concerned about
> both automatically assembling things like RAID arrays / LVM logical
> volumes and also automounting devices [1].
>
> Anyway, my point with all this is that maybe we are going about things
> wrong in the initramfs. My understanding is that dracut roughly works
> this way (please let me know if this is wrong)
>
>   1. when generating the initramfs image, we leave information in
>      the kernel command-line about the root filesystem - typically
>      the UUID - e.g. root=UUIDx6263c4-5e28-4cdc-97b8-1ab6e221c344
>
>   2. when the initramfs starts, we trigger all uevents and wait for
>      things to settle
>
>   3. Autoassembly / magic:
>
>      - If we see e.g. md components, we activate them via udev rules
>      - If we see e.g. LUKS devices, we unlock them (by interacting with
>        the user asking for the passphrase) via udev rules.
>      - Ditto for e.g. LVM
>
>   5. if we see the rootfs (matching on e.g. the UUID passed on the
>      kernel command line) we create the /dev/root symlink
>
>   6. when the system has settled (e.g. no more uevents) we mount
>      /dev/root and transition to non-early user space. If there
>      is no /dev/root link, we bail out
>
> Now, my beef is 3. above. I think it is way too optimistic to just
> auto-assemble / unlock etc. everything. E.g. we end up doing a lot of
> work not related to the rootfs that is better done in non-early user
> space.
>
> Instead, just like we specify the UUID for rootfs on the command-line,
> we need to leave some instructions to the initramfs logic on _exactly_
> what things should be autoassembled / unlocked / etc. in order to find
> the rootfs. So the kernel command-line wouldn't really be "just" the
> UUID of rootfs; it would be a whole recipe of actions to do. E.g.
>
>   ROOTFS=UUID\x1234          \ # this the UUID of my rootfs
>   MD_ASSEMBLE=UUIDE67     \ # assemble MD array with UUID 4567
>   LUKS_UNLOCK=UUID‰ab       # unlock LUKS device with UUID 89ab
>
> which would work for e.g. cases where rootfs is on a LUKS device which
> is on a MD array. In other words, we'd need a whole "recipe" passed to
> the initramfs (the mkinitrd tool would generate this recipe), not just
> the UUID of the rootfs.
>
> Coincidentally, if we had something like this and the format of the
> "recipe" was documented somewhere, it would be easy to e.g. implement
> "rescue" functionality as described here
>
> http://www.redhat.com/archives/fedora-desktop-list/2009-July/msg00019.html
>
> since graphical disk utilities would just find /etc/grub.conf (or
> similar), read the recipe and then start assembling/unlocking bits and
> mount them as appropriate in /mnt/rescue/.
>
> Actually this is very close to what Doug is asking for when he says
> (paraphrased) "just include mdadm.conf instead of this magic". The key
> difference, however, is that the user _won't_ have to use mdadm.conf or
> care about config files - it's all taken care of by the mkinitrd binary
> when building the recipe. This is a good thing as having one less config
> file to worry about is good.
>
> Thanks for considering, and sorry for the long mail,
> David
>
> [1] : As some background information, I've spent a good chunk of my
> life, five years or so, dealing with end users complaining about how
> plain block devices got automounted when they were plugged in. FWIW, the
> complaints ranges from both non-sensical (irritated users: "these
> desktop kids shall not decide how UNIX works") to actual bugs where the
> on-disk contents were mis-detected and either something wrong got
> automounted or we failed to automount at all.
>
> If I've learned anything it's that you need to be very very careful here
> - unlike Windows and other operating systems with such capabilities,
> Linux is.. different.. mostly because we support so many different ways
> to put a file system through things likd md and dm. And you need to make
> it very easy to turn things like this off.
>
>
>

David, thanks for your suggestion. As of yesterday, dracut recognizes now the 
following command line parameters:

LVM
        rd_NO_LVM
               disable LVM detection

        rd_LVM_VG=<volume group name>
               only activate the volume groups with the given name

crypto LUKS
        rd_NO_LUKS
               disable crypto LUKS detection

        rd_LUKS_UUID=<luks uuid>
               only activate the LUKS partitions with the given UUID

MD
        rd_NO_MD
               disable MD RAID detection

        rd_MD_UUID=<md uuid>
               only activate the raid sets with the given UUID

DMRAID
        rd_NO_DM
               disable DM RAID detection

        rd_DM_UUID=<dmraid uuid>
               only activate the raid sets with the given UUID


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
  2009-07-15 18:47       ` Dan Williams
  2009-07-16  0:16         ` Jeremy Katz
@ 2009-07-16 10:56         ` Neil Brown
  2009-07-16 11:09         ` Neil Brown
  2 siblings, 0 replies; 12+ messages in thread
From: Neil Brown @ 2009-07-16 10:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: David Zeuthen, Hans de Goede, initramfs,
	linux-hotplug-u79uwXL29TY76Z2rM5mHXA, Danecki, Jacek,
	Harald Hoyer, Doug Ledford

On Wednesday July 15, dan.j.williams@intel.com wrote:
> 
> mdadm-3.0 has facilities to prevent assembly of certain metadata types
> [1] or arrays with certain uuids [2].  I wonder if we also need a
> facility to prevent auto-assembly of arrays *not* listed in
> mdadm.conf?  So the mdadm.conf file installed in the initramfs would
> only identify the root array and all other randomly identified md
> devices would be ignored (rather than assembled with a foreign name).
> 
> Thoughts?

This is exactly the functionality that the 'AUTO' line provides.
Arrays that are listed in mdadm.conf will always be assembled when they
are found.
Arrays that are not listed may or may not be assembled depending on
their metadata type and the setting of 'AUTO'.

So
  AUTO -all

will disable all auto-assemble of arrays that are not listed in
mdadm.conf.

NeilBrown

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
  2009-07-15 18:47       ` Dan Williams
  2009-07-16  0:16         ` Jeremy Katz
  2009-07-16 10:56         ` Neil Brown
@ 2009-07-16 11:09         ` Neil Brown
  2 siblings, 0 replies; 12+ messages in thread
From: Neil Brown @ 2009-07-16 11:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: David Zeuthen, Hans de Goede, initramfs,
	linux-hotplug-u79uwXL29TY76Z2rM5mHXA, Danecki, Jacek,
	Harald Hoyer, Doug Ledford

On Wednesday July 15, dan.j.williams@intel.com wrote:
> [ Cc: Neil ]
> 
> On Tue, Jul 14, 2009 at 7:30 AM, David Zeuthen<david@fubar.dk> wrote:
> > On Tue, 2009-07-14 at 12:59 +0200, Hans de Goede wrote:
> >> Currently the udev rules use incremental assembly like this:
> >> mdadm -I /dev/mdraid-member
> >>
> >> There are 2 problems with this:
> >> 1) When doing this for native mdraid metadata arrays, if only
> >>     one disk is present the set never gets activated
> >> 2) When doing this for imsm metadata arrays, as soon as the
> >>     first disk is incrementally added, the set gets activated
> >>     in degraded mode and stays that way, the second disk
> >>     will get added to the container, but not to the actual
> >>     sets in the container
> >
> > FWIW, this incremental assembly business in mdadm is actually not a very
> > good idea. At least not the current implementation. I'm not sure whether
> > it's still a Fedora-ism or whether it's something that's in upstream
> > mdadm yet. I'm talking about this udev rule
> >
> >  /lib/udev/rules.d/65-md-incremental.rules:
> >  # This file causes block devices with Linux RAID (mdadm) signatures to
> >  # automatically cause mdadm to be run.
> >  # See udev(8) for syntax
> >
> >  SUBSYSTEM="block", ACTION="add", ENV{ID_FS_TYPE}="linux_raid_member", \
> >        IMPORT{program}="/sbin/mdadm --examine --export $tempnode", \
> >        RUN+="/bin/bash -c '[ ! -f /dev/.in_sysinit ] && mdadm -I $env{DEVNAME}'"
> >
> > For example if the user plugs in a random old disk that happens to
> > contain half of a RAID1 mirror, then the incremental assembly bits sets
> > up an inert md-device and the user is now left to his own devices as to
> > sort this out when he's told by partitioning tools etc. that the disk
> > (or partition of) he just plugged in, is "busy" (it is claimed by the
> > inert md node).
> >
> > I actually had to add some extra code to the GNOME Disk Utility bits to
> > handle such things (stop inert md devices) - makes the user experience
> > quite a bit worse since there's now an extra state to worry about. And
> > most current users don't use the UI bits yet for this so they get extra
> > confused when trying to use e.g. parted(8) or fdisk(8) on the device.
> >
> > FWIW, I'd wish people would stop playing games like this. If you want to
> > do auto-assembly at the system-level, at the very least don't leave the
> > system in a state like this. For example, one way to do auto-assembly
> > without such bugs would be to use libudev to enumerate all md component
> > devices with the same MD_UUID. Then you count the number of components
> > and only start the array if the number of components equals MD_DEVICES.
> > That's much better than incrementally adding to an md device node that
> > might never get used.

Yes:  auto-assembly is hard, and easy to get wrong.

While I don't claim that the current scheme is at all perfect, I don't
think your suggestion is a clear improvement.
The whole point of RAID is to survive drive failure, and that includes
drives being missing.
So I don't think "completely ignore the array if not all expected
drives are present" is the correct answer.

It is very easy to remove unwanted raid metadata 
(mdadm --zero-superblock), and making that easily accessible from a
GUI would probably be a good and useful thing, and might solve some
problems for some people.

One thing that I have contemplated is for md to not claim exclusive
ownership of drives until the array is activated and switch to
read-write.  That would address the 'my drive was stolen by md'
problem, but it may well create other problems in its place.

My general goal at present is to make mdadm sufficiently flexible that
a distro can choose a suitable policy implement it.  If someone comes
up with a policy that works convincingly well, I could then make that
the default approach that mdadm takes.
There is certainly still room for improvement and I am happy to
discuss possibilities.

NeilBrown


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-07-16 11:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-14  9:57 RFC: mdadm and bringing up raid sets from initrd (dracut) Hans de Goede
2009-07-14 13:39 ` Doug Ledford
     [not found]   ` <1955210A-EF27-479F-8C58-BA4FA9018A56-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-14 14:01     ` Hans de Goede
2009-07-14 14:14       ` Doug Ledford
     [not found]         ` <D758972F-0E5A-4860-9011-6B2DA1FA771A-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-14 15:00           ` David Zeuthen
2009-07-16 10:56             ` Harald Hoyer
     [not found] ` <4A5C6501.3080607-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-14 14:30   ` David Zeuthen
     [not found]     ` <1247581847.1991.16.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2009-07-15 18:47       ` Dan Williams
2009-07-16  0:16         ` Jeremy Katz
     [not found]           ` <20090716001651.GB45537-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-16  7:11             ` Victor Lowther
2009-07-16 10:56         ` Neil Brown
2009-07-16 11:09         ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).