* RFC: mdadm and bringing up raid sets from initrd (dracut)
@ 2009-07-14 9:57 Hans de Goede
2009-07-14 13:39 ` Doug Ledford
[not found] ` <4A5C6501.3080607-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
0 siblings, 2 replies; 12+ messages in thread
From: Hans de Goede @ 2009-07-14 9:57 UTC (permalink / raw)
To: initramfs; +Cc: linux-hotplug, Danecki, Jacek, Harald Hoyer, Doug Ledford
Hi,
As you probably know I'm working on making Fedora 12 use mdraid
instead of dmraid for Intel BIOS-RAID setups.
The installer (anaconda) part is mostly done (needs more testing)
and now I'm looking at implementing support for this in dracut
(the new mkinitrd for Fedora 12).
So I've been testing how this works for both imsm mdraid sets
and native mdraid metadata sets, in both cases using a 2 disk
mirror, so that the set can also be brought up in degraded mode.
Currently the udev rules use incremental assembly like this:
mdadm -I /dev/mdraid-member
There are 2 problems with this:
1) When doing this for native mdraid metadata arrays, if only
one disk is present the set never gets activated
2) When doing this for imsm metadata arrays, as soon as the
first disk is incrementally added, the set gets activated
in degraded mode and stays that way, the second disk
will get added to the container, but not to the actual
sets in the container
And these 2 problems have 2 different solutions:
1) An incomplete, but potentially activatable in degraded mode
set can be activated using mdadm --run /dev/md#
2) One can stop this problem by using:
mdadm -I --no-degraded /dev/mdraid-member
instead (this does not change anything for
native mdraid metadata format sets)
But if that is done, the sets in the container never get
activated, this can be fixed by running
mdadm -I /dev/md# on the container device
So my proposed solution for this is when udev is done scanning
(when the event queue is empty, detected using the same mechanism as
dracut is using for dmraid), do the following:
For each /dev/md#
run mdadm --export --detail, and get the MD_LEVEL
if MD_LEVEL = "container":
mdadm -I /dev/md#
else
mdadm --run /dev/md#
This will:
1) Bring up raid sets inside containers (such as imsm raidsets)
2) Bring up incomplete raid sets in degraded mode where possible
I'll post a patch implementing this later today.
Regards,
Hans
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
2009-07-14 9:57 RFC: mdadm and bringing up raid sets from initrd (dracut) Hans de Goede
@ 2009-07-14 13:39 ` Doug Ledford
[not found] ` <1955210A-EF27-479F-8C58-BA4FA9018A56-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
[not found] ` <4A5C6501.3080607-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
1 sibling, 1 reply; 12+ messages in thread
From: Doug Ledford @ 2009-07-14 13:39 UTC (permalink / raw)
To: Hans de Goede; +Cc: initramfs, linux-hotplug, Danecki, Jacek, Harald Hoyer
[-- Attachment #1: Type: text/plain, Size: 3804 bytes --]
On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote:
> Hi,
>
> As you probably know I'm working on making Fedora 12 use mdraid
> instead of dmraid for Intel BIOS-RAID setups.
>
> The installer (anaconda) part is mostly done (needs more testing)
> and now I'm looking at implementing support for this in dracut
> (the new mkinitrd for Fedora 12).
>
> So I've been testing how this works for both imsm mdraid sets
> and native mdraid metadata sets, in both cases using a 2 disk
> mirror, so that the set can also be brought up in degraded mode.
>
> Currently the udev rules use incremental assembly like this:
> mdadm -I /dev/mdraid-member
Hmmm...does dracut use udev during initramfs time? mkinitrd didn't,
so this would be a change. In particular, I didn't have these
problems with mkinitrd because I didn't use udev rules in the initrd,
I ran mdadm -A instead. In fact, the F11 method of bringup of raid
devices is as such:
initrd: use mdadm -As --run <mddevice name with matching ARRAY entry
in /etc/mdadm.conf>
rc.sysinit: use mdadm -As --run (no md device name, which means all
arrays listed in mdadm.conf will get brought up, plus extra arrays not
listed in mdadm.conf but which can be found and identified by metadata)
udev: in 65-md-incremental.rules use mdadm -I <block device> (but only
if /dev/.in.rcsysinit does not exist, so we don't run udev incremental
rules until after the system is up and running, which means for hot
plugged devices...in particular we will never run the udev rule on any
device that was present on boot, instead the previous two calls will
catch these devices, and those previous calls will run degraded
arrays, this allows me to safely refuse to run degraded arrays in the
udev rules file without risking failing to boot, instead a degraded
hot plugged array will need minor manual intervention, but the system
will be fully up and operational no matter what)
I find this setup to be a rather safe, conservative way of handling md
raid array hot plug. Are we going to be totally changing this with
dracut and F12? This method very nicely resolves the issues you posted.
> There are 2 problems with this:
> 1) When doing this for native mdraid metadata arrays, if only
> one disk is present the set never gets activated
> 2) When doing this for imsm metadata arrays, as soon as the
> first disk is incrementally added, the set gets activated
> in degraded mode and stays that way, the second disk
> will get added to the container, but not to the actual
> sets in the container
>
> And these 2 problems have 2 different solutions:
> 1) An incomplete, but potentially activatable in degraded mode
> set can be activated using mdadm --run /dev/md#
> 2) One can stop this problem by using:
> mdadm -I --no-degraded /dev/mdraid-member
> instead (this does not change anything for
> native mdraid metadata format sets)
> But if that is done, the sets in the container never get
> activated, this can be fixed by running
> mdadm -I /dev/md# on the container device
>
> So my proposed solution for this is when udev is done scanning
> (when the event queue is empty, detected using the same mechanism as
> dracut is using for dmraid), do the following:
>
> For each /dev/md#
> run mdadm --export --detail, and get the MD_LEVEL
> if MD_LEVEL == "container":
> mdadm -I /dev/md#
> else
> mdadm --run /dev/md#
>
> This will:
> 1) Bring up raid sets inside containers (such as imsm raidsets)
> 2) Bring up incomplete raid sets in degraded mode where possible
>
> I'll post a patch implementing this later today.
>
> Regards,
>
> Hans
--
Doug Ledford <dledford@redhat.com>
GPG KeyID: CFBFF194
http://people.redhat.com/dledford
InfiniBand Specific RPMS
http://people.redhat.com/dledford/Infiniband
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 203 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
[not found] ` <1955210A-EF27-479F-8C58-BA4FA9018A56-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-14 14:01 ` Hans de Goede
2009-07-14 14:14 ` Doug Ledford
0 siblings, 1 reply; 12+ messages in thread
From: Hans de Goede @ 2009-07-14 14:01 UTC (permalink / raw)
To: Doug Ledford
Cc: initramfs, linux-hotplug-u79uwXL29TY76Z2rM5mHXA, Danecki, Jacek,
Harald Hoyer
Hi,
On 07/14/2009 03:39 PM, Doug Ledford wrote:
> On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote:
>> Hi,
>>
>> As you probably know I'm working on making Fedora 12 use mdraid
>> instead of dmraid for Intel BIOS-RAID setups.
>>
>> The installer (anaconda) part is mostly done (needs more testing)
>> and now I'm looking at implementing support for this in dracut
>> (the new mkinitrd for Fedora 12).
>>
>> So I've been testing how this works for both imsm mdraid sets
>> and native mdraid metadata sets, in both cases using a 2 disk
>> mirror, so that the set can also be brought up in degraded mode.
>>
>> Currently the udev rules use incremental assembly like this:
>> mdadm -I /dev/mdraid-member
>
> Hmmm...does dracut use udev during initramfs time?
Yes, it uses udev for everything, making discovery of / consistent
with the discovery of other storage devices.
<snip>
> Are we going to be totally changing this with
> dracut and F12? This method very nicely resolves the issues you posted.
>
Yes.
Regards,
Hans
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
2009-07-14 14:01 ` Hans de Goede
@ 2009-07-14 14:14 ` Doug Ledford
[not found] ` <D758972F-0E5A-4860-9011-6B2DA1FA771A-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Doug Ledford @ 2009-07-14 14:14 UTC (permalink / raw)
To: Hans de Goede; +Cc: initramfs, linux-hotplug, Danecki, Jacek, Harald Hoyer
[-- Attachment #1: Type: text/plain, Size: 1952 bytes --]
On Jul 14, 2009, at 11:02 AM, Hans de Goede wrote:
> Hi,
> On 07/14/2009 03:39 PM, Doug Ledford wrote:
>> On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote:
>>> Hi,
>>>
>>> As you probably know I'm working on making Fedora 12 use mdraid
>>> instead of dmraid for Intel BIOS-RAID setups.
>>>
>>> The installer (anaconda) part is mostly done (needs more testing)
>>> and now I'm looking at implementing support for this in dracut
>>> (the new mkinitrd for Fedora 12).
>>>
>>> So I've been testing how this works for both imsm mdraid sets
>>> and native mdraid metadata sets, in both cases using a 2 disk
>>> mirror, so that the set can also be brought up in degraded mode.
>>>
>>> Currently the udev rules use incremental assembly like this:
>>> mdadm -I /dev/mdraid-member
>>
>> Hmmm...does dracut use udev during initramfs time?
>
> Yes, it uses udev for everything, making discovery of / consistent
> with the discovery of other storage devices.
I'm not sure I like or agree with that philosophy. I absolutely
*don't* want my / filesystem or raid device treated like some plug in,
temporary, roaming raid device. They *aren't* the same, not in terms
of importance to the running of the machine and not in terms of
reliability requirements. By using mdadm -A in the mkinitrd calls, I
was able to put in an mdadm.conf file and limit what arrays get
started to arrays found non-ambiguously in that mdadm.conf file and
identified by UUID. When you switch to incremental assembly for root,
you risk the possibility of name space collisions and non-
deterministic bring up of your / array.
> <snip>
>
>> Are we going to be totally changing this with
>> dracut and F12? This method very nicely resolves the issues you
>> posted.
>>
>
> Yes.
>
> Regards,
>
> Hans
--
Doug Ledford <dledford@redhat.com>
GPG KeyID: CFBFF194
http://people.redhat.com/dledford
InfiniBand Specific RPMS
http://people.redhat.com/dledford/Infiniband
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 203 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
[not found] ` <4A5C6501.3080607-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-14 14:30 ` David Zeuthen
[not found] ` <1247581847.1991.16.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: David Zeuthen @ 2009-07-14 14:30 UTC (permalink / raw)
To: Hans de Goede
Cc: initramfs, linux-hotplug-u79uwXL29TY76Z2rM5mHXA, Danecki, Jacek,
Harald Hoyer, Doug Ledford
On Tue, 2009-07-14 at 12:59 +0200, Hans de Goede wrote:
> Currently the udev rules use incremental assembly like this:
> mdadm -I /dev/mdraid-member
>
> There are 2 problems with this:
> 1) When doing this for native mdraid metadata arrays, if only
> one disk is present the set never gets activated
> 2) When doing this for imsm metadata arrays, as soon as the
> first disk is incrementally added, the set gets activated
> in degraded mode and stays that way, the second disk
> will get added to the container, but not to the actual
> sets in the container
FWIW, this incremental assembly business in mdadm is actually not a very
good idea. At least not the current implementation. I'm not sure whether
it's still a Fedora-ism or whether it's something that's in upstream
mdadm yet. I'm talking about this udev rule
/lib/udev/rules.d/65-md-incremental.rules:
# This file causes block devices with Linux RAID (mdadm) signatures to
# automatically cause mdadm to be run.
# See udev(8) for syntax
SUBSYSTEM="block", ACTION="add", ENV{ID_FS_TYPE}="linux_raid_member", \
IMPORT{program}="/sbin/mdadm --examine --export $tempnode", \
RUN+="/bin/bash -c '[ ! -f /dev/.in_sysinit ] && mdadm -I $env{DEVNAME}'"
For example if the user plugs in a random old disk that happens to
contain half of a RAID1 mirror, then the incremental assembly bits sets
up an inert md-device and the user is now left to his own devices as to
sort this out when he's told by partitioning tools etc. that the disk
(or partition of) he just plugged in, is "busy" (it is claimed by the
inert md node).
I actually had to add some extra code to the GNOME Disk Utility bits to
handle such things (stop inert md devices) - makes the user experience
quite a bit worse since there's now an extra state to worry about. And
most current users don't use the UI bits yet for this so they get extra
confused when trying to use e.g. parted(8) or fdisk(8) on the device.
FWIW, I'd wish people would stop playing games like this. If you want to
do auto-assembly at the system-level, at the very least don't leave the
system in a state like this. For example, one way to do auto-assembly
without such bugs would be to use libudev to enumerate all md component
devices with the same MD_UUID. Then you count the number of components
and only start the array if the number of components equals MD_DEVICES.
That's much better than incrementally adding to an md device node that
might never get used.
I've complained to Doug about this already for Fedora but, since it's
still broken and, AFAICT, up it's way to upstream mdadm, it's worth
reiterating the complaint.
Thanks,
David
[1] : And, except for booting, it's not clear to me that you want to
have policy like auto-assembling RAID arrays at the system. I'd leave
such policy to desktop bits where the user can control it and the
software can actually interact with the user. And where it's easy to
turn off features like this.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
[not found] ` <D758972F-0E5A-4860-9011-6B2DA1FA771A-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-14 15:00 ` David Zeuthen
2009-07-16 10:56 ` Harald Hoyer
0 siblings, 1 reply; 12+ messages in thread
From: David Zeuthen @ 2009-07-14 15:00 UTC (permalink / raw)
To: Doug Ledford
Cc: Hans de Goede, initramfs, linux-hotplug-u79uwXL29TY76Z2rM5mHXA,
Danecki, Jacek, Harald Hoyer
On Tue, 2009-07-14 at 10:14 -0400, Doug Ledford wrote:
> On Jul 14, 2009, at 11:02 AM, Hans de Goede wrote:
> > Hi,
> > On 07/14/2009 03:39 PM, Doug Ledford wrote:
> >> On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote:
> >>> Hi,
> >>>
> >>> As you probably know I'm working on making Fedora 12 use mdraid
> >>> instead of dmraid for Intel BIOS-RAID setups.
> >>>
> >>> The installer (anaconda) part is mostly done (needs more testing)
> >>> and now I'm looking at implementing support for this in dracut
> >>> (the new mkinitrd for Fedora 12).
> >>>
> >>> So I've been testing how this works for both imsm mdraid sets
> >>> and native mdraid metadata sets, in both cases using a 2 disk
> >>> mirror, so that the set can also be brought up in degraded mode.
> >>>
> >>> Currently the udev rules use incremental assembly like this:
> >>> mdadm -I /dev/mdraid-member
> >>
> >> Hmmm...does dracut use udev during initramfs time?
> >
> > Yes, it uses udev for everything, making discovery of / consistent
> > with the discovery of other storage devices.
>
> I'm not sure I like or agree with that philosophy. I absolutely
> *don't* want my / filesystem or raid device treated like some plug in,
> temporary, roaming raid device. They *aren't* the same, not in terms
> of importance to the running of the machine and not in terms of
> reliability requirements. By using mdadm -A in the mkinitrd calls, I
> was able to put in an mdadm.conf file and limit what arrays get
> started to arrays found non-ambiguously in that mdadm.conf file and
> identified by UUID. When you switch to incremental assembly for root,
> you risk the possibility of name space collisions and non-
> deterministic bring up of your / array.
I'm concerned about this too. To be more specific, I'm concerned about
both automatically assembling things like RAID arrays / LVM logical
volumes and also automounting devices [1].
Anyway, my point with all this is that maybe we are going about things
wrong in the initramfs. My understanding is that dracut roughly works
this way (please let me know if this is wrong)
1. when generating the initramfs image, we leave information in
the kernel command-line about the root filesystem - typically
the UUID - e.g. root=UUIDx6263c4-5e28-4cdc-97b8-1ab6e221c344
2. when the initramfs starts, we trigger all uevents and wait for
things to settle
3. Autoassembly / magic:
- If we see e.g. md components, we activate them via udev rules
- If we see e.g. LUKS devices, we unlock them (by interacting with
the user asking for the passphrase) via udev rules.
- Ditto for e.g. LVM
5. if we see the rootfs (matching on e.g. the UUID passed on the
kernel command line) we create the /dev/root symlink
6. when the system has settled (e.g. no more uevents) we mount
/dev/root and transition to non-early user space. If there
is no /dev/root link, we bail out
Now, my beef is 3. above. I think it is way too optimistic to just
auto-assemble / unlock etc. everything. E.g. we end up doing a lot of
work not related to the rootfs that is better done in non-early user
space.
Instead, just like we specify the UUID for rootfs on the command-line,
we need to leave some instructions to the initramfs logic on _exactly_
what things should be autoassembled / unlocked / etc. in order to find
the rootfs. So the kernel command-line wouldn't really be "just" the
UUID of rootfs; it would be a whole recipe of actions to do. E.g.
ROOTFS=UUID\x1234 \ # this the UUID of my rootfs
MD_ASSEMBLE=UUIDE67 \ # assemble MD array with UUID 4567
LUKS_UNLOCK=UUID‰ab # unlock LUKS device with UUID 89ab
which would work for e.g. cases where rootfs is on a LUKS device which
is on a MD array. In other words, we'd need a whole "recipe" passed to
the initramfs (the mkinitrd tool would generate this recipe), not just
the UUID of the rootfs.
Coincidentally, if we had something like this and the format of the
"recipe" was documented somewhere, it would be easy to e.g. implement
"rescue" functionality as described here
http://www.redhat.com/archives/fedora-desktop-list/2009-July/msg00019.html
since graphical disk utilities would just find /etc/grub.conf (or
similar), read the recipe and then start assembling/unlocking bits and
mount them as appropriate in /mnt/rescue/.
Actually this is very close to what Doug is asking for when he says
(paraphrased) "just include mdadm.conf instead of this magic". The key
difference, however, is that the user _won't_ have to use mdadm.conf or
care about config files - it's all taken care of by the mkinitrd binary
when building the recipe. This is a good thing as having one less config
file to worry about is good.
Thanks for considering, and sorry for the long mail,
David
[1] : As some background information, I've spent a good chunk of my
life, five years or so, dealing with end users complaining about how
plain block devices got automounted when they were plugged in. FWIW, the
complaints ranges from both non-sensical (irritated users: "these
desktop kids shall not decide how UNIX works") to actual bugs where the
on-disk contents were mis-detected and either something wrong got
automounted or we failed to automount at all.
If I've learned anything it's that you need to be very very careful here
- unlike Windows and other operating systems with such capabilities,
Linux is.. different.. mostly because we support so many different ways
to put a file system through things likd md and dm. And you need to make
it very easy to turn things like this off.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
[not found] ` <1247581847.1991.16.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2009-07-15 18:47 ` Dan Williams
2009-07-16 0:16 ` Jeremy Katz
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Dan Williams @ 2009-07-15 18:47 UTC (permalink / raw)
To: David Zeuthen
Cc: Hans de Goede, initramfs, linux-hotplug-u79uwXL29TY76Z2rM5mHXA,
Danecki, Jacek, Harald Hoyer, Doug Ledford, NeilBrown
[ Cc: Neil ]
On Tue, Jul 14, 2009 at 7:30 AM, David Zeuthen<david@fubar.dk> wrote:
> On Tue, 2009-07-14 at 12:59 +0200, Hans de Goede wrote:
>> Currently the udev rules use incremental assembly like this:
>> mdadm -I /dev/mdraid-member
>>
>> There are 2 problems with this:
>> 1) When doing this for native mdraid metadata arrays, if only
>> one disk is present the set never gets activated
>> 2) When doing this for imsm metadata arrays, as soon as the
>> first disk is incrementally added, the set gets activated
>> in degraded mode and stays that way, the second disk
>> will get added to the container, but not to the actual
>> sets in the container
>
> FWIW, this incremental assembly business in mdadm is actually not a very
> good idea. At least not the current implementation. I'm not sure whether
> it's still a Fedora-ism or whether it's something that's in upstream
> mdadm yet. I'm talking about this udev rule
>
> /lib/udev/rules.d/65-md-incremental.rules:
> # This file causes block devices with Linux RAID (mdadm) signatures to
> # automatically cause mdadm to be run.
> # See udev(8) for syntax
>
> SUBSYSTEM="block", ACTION="add", ENV{ID_FS_TYPE}="linux_raid_member", \
> IMPORT{program}="/sbin/mdadm --examine --export $tempnode", \
> RUN+="/bin/bash -c '[ ! -f /dev/.in_sysinit ] && mdadm -I $env{DEVNAME}'"
>
> For example if the user plugs in a random old disk that happens to
> contain half of a RAID1 mirror, then the incremental assembly bits sets
> up an inert md-device and the user is now left to his own devices as to
> sort this out when he's told by partitioning tools etc. that the disk
> (or partition of) he just plugged in, is "busy" (it is claimed by the
> inert md node).
>
> I actually had to add some extra code to the GNOME Disk Utility bits to
> handle such things (stop inert md devices) - makes the user experience
> quite a bit worse since there's now an extra state to worry about. And
> most current users don't use the UI bits yet for this so they get extra
> confused when trying to use e.g. parted(8) or fdisk(8) on the device.
>
> FWIW, I'd wish people would stop playing games like this. If you want to
> do auto-assembly at the system-level, at the very least don't leave the
> system in a state like this. For example, one way to do auto-assembly
> without such bugs would be to use libudev to enumerate all md component
> devices with the same MD_UUID. Then you count the number of components
> and only start the array if the number of components equals MD_DEVICES.
> That's much better than incrementally adding to an md device node that
> might never get used.
>
> I've complained to Doug about this already for Fedora but, since it's
> still broken and, AFAICT, up it's way to upstream mdadm, it's worth
> reiterating the complaint.
>
> Thanks,
> David
>
> [1] : And, except for booting, it's not clear to me that you want to
> have policy like auto-assembling RAID arrays at the system. I'd leave
> such policy to desktop bits where the user can control it and the
> software can actually interact with the user. And where it's easy to
> turn off features like this.
>
mdadm-3.0 has facilities to prevent assembly of certain metadata types
[1] or arrays with certain uuids [2]. I wonder if we also need a
facility to prevent auto-assembly of arrays *not* listed in
mdadm.conf? So the mdadm.conf file installed in the initramfs would
only identify the root array and all other randomly identified md
devices would be ignored (rather than assembled with a foreign name).
Thoughts?
Thanks,
Dan
[1]: http://neil.brown.name/git?p=mdadm;a=commitdiff;h1015d57
[2]: http://neil.brown.name/git?p=mdadm;a=commitdiff;h\x112cace6
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
2009-07-15 18:47 ` Dan Williams
@ 2009-07-16 0:16 ` Jeremy Katz
[not found] ` <20090716001651.GB45537-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-16 10:56 ` Neil Brown
2009-07-16 11:09 ` Neil Brown
2 siblings, 1 reply; 12+ messages in thread
From: Jeremy Katz @ 2009-07-16 0:16 UTC (permalink / raw)
To: Dan Williams
Cc: David Zeuthen, Hans de Goede, initramfs, linux-hotplug,
Danecki, Jacek, Harald Hoyer, Doug Ledford, NeilBrown
On Wednesday, July 15 2009, Dan Williams said:
> mdadm-3.0 has facilities to prevent assembly of certain metadata types
> [1] or arrays with certain uuids [2]. I wonder if we also need a
> facility to prevent auto-assembly of arrays *not* listed in
> mdadm.conf? So the mdadm.conf file installed in the initramfs would
> only identify the root array and all other randomly identified md
> devices would be ignored (rather than assembled with a foreign name).
>
> Thoughts?
There is no mdadm.conf in the initramfs -- in fact, the initramfs may
not even be generated on the system that you're booting and instead be
"generic" for the kernel in question
Jeremy
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
[not found] ` <20090716001651.GB45537-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-16 7:11 ` Victor Lowther
0 siblings, 0 replies; 12+ messages in thread
From: Victor Lowther @ 2009-07-16 7:11 UTC (permalink / raw)
To: Jeremy Katz
Cc: Dan Williams, David Zeuthen, Hans de Goede, initramfs,
linux-hotplug-u79uwXL29TY76Z2rM5mHXA, Danecki, Jacek,
Harald Hoyer, Doug Ledford, NeilBrown
On Wed, Jul 15, 2009 at 7:16 PM, Jeremy Katz<katzj@redhat.com> wrote:
> On Wednesday, July 15 2009, Dan Williams said:
>> mdadm-3.0 has facilities to prevent assembly of certain metadata types
>> [1] or arrays with certain uuids [2]. I wonder if we also need a
>> facility to prevent auto-assembly of arrays *not* listed in
>> mdadm.conf? So the mdadm.conf file installed in the initramfs would
>> only identify the root array and all other randomly identified md
>> devices would be ignored (rather than assembled with a foreign name).
>>
>> Thoughts?
>
> There is no mdadm.conf in the initramfs -- in fact, the initramfs may
> not even be generated on the system that you're booting and instead be
> "generic" for the kernel in question
Still, it sounds like a good feature to be added for a --hostonly initramfs.
(especially on systems that are attaching to iscsi and/or fibre channel luns)
> Jeremy
> --
> To unsubscribe from this list: send the line "unsubscribe initramfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
2009-07-14 15:00 ` David Zeuthen
@ 2009-07-16 10:56 ` Harald Hoyer
0 siblings, 0 replies; 12+ messages in thread
From: Harald Hoyer @ 2009-07-16 10:56 UTC (permalink / raw)
To: David Zeuthen
Cc: Doug Ledford, Hans de Goede, initramfs, linux-hotplug,
Danecki, Jacek
On 07/14/2009 05:00 PM, David Zeuthen wrote:
> On Tue, 2009-07-14 at 10:14 -0400, Doug Ledford wrote:
>> On Jul 14, 2009, at 11:02 AM, Hans de Goede wrote:
>>> Hi,
>>> On 07/14/2009 03:39 PM, Doug Ledford wrote:
>>>> On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote:
>>>>> Hi,
>>>>>
>>>>> As you probably know I'm working on making Fedora 12 use mdraid
>>>>> instead of dmraid for Intel BIOS-RAID setups.
>>>>>
>>>>> The installer (anaconda) part is mostly done (needs more testing)
>>>>> and now I'm looking at implementing support for this in dracut
>>>>> (the new mkinitrd for Fedora 12).
>>>>>
>>>>> So I've been testing how this works for both imsm mdraid sets
>>>>> and native mdraid metadata sets, in both cases using a 2 disk
>>>>> mirror, so that the set can also be brought up in degraded mode.
>>>>>
>>>>> Currently the udev rules use incremental assembly like this:
>>>>> mdadm -I /dev/mdraid-member
>>>> Hmmm...does dracut use udev during initramfs time?
>>> Yes, it uses udev for everything, making discovery of / consistent
>>> with the discovery of other storage devices.
>> I'm not sure I like or agree with that philosophy. I absolutely
>> *don't* want my / filesystem or raid device treated like some plug in,
>> temporary, roaming raid device. They *aren't* the same, not in terms
>> of importance to the running of the machine and not in terms of
>> reliability requirements. By using mdadm -A in the mkinitrd calls, I
>> was able to put in an mdadm.conf file and limit what arrays get
>> started to arrays found non-ambiguously in that mdadm.conf file and
>> identified by UUID. When you switch to incremental assembly for root,
>> you risk the possibility of name space collisions and non-
>> deterministic bring up of your / array.
>
> I'm concerned about this too. To be more specific, I'm concerned about
> both automatically assembling things like RAID arrays / LVM logical
> volumes and also automounting devices [1].
>
> Anyway, my point with all this is that maybe we are going about things
> wrong in the initramfs. My understanding is that dracut roughly works
> this way (please let me know if this is wrong)
>
> 1. when generating the initramfs image, we leave information in
> the kernel command-line about the root filesystem - typically
> the UUID - e.g. root=UUIDx6263c4-5e28-4cdc-97b8-1ab6e221c344
>
> 2. when the initramfs starts, we trigger all uevents and wait for
> things to settle
>
> 3. Autoassembly / magic:
>
> - If we see e.g. md components, we activate them via udev rules
> - If we see e.g. LUKS devices, we unlock them (by interacting with
> the user asking for the passphrase) via udev rules.
> - Ditto for e.g. LVM
>
> 5. if we see the rootfs (matching on e.g. the UUID passed on the
> kernel command line) we create the /dev/root symlink
>
> 6. when the system has settled (e.g. no more uevents) we mount
> /dev/root and transition to non-early user space. If there
> is no /dev/root link, we bail out
>
> Now, my beef is 3. above. I think it is way too optimistic to just
> auto-assemble / unlock etc. everything. E.g. we end up doing a lot of
> work not related to the rootfs that is better done in non-early user
> space.
>
> Instead, just like we specify the UUID for rootfs on the command-line,
> we need to leave some instructions to the initramfs logic on _exactly_
> what things should be autoassembled / unlocked / etc. in order to find
> the rootfs. So the kernel command-line wouldn't really be "just" the
> UUID of rootfs; it would be a whole recipe of actions to do. E.g.
>
> ROOTFS=UUID\x1234 \ # this the UUID of my rootfs
> MD_ASSEMBLE=UUIDE67 \ # assemble MD array with UUID 4567
> LUKS_UNLOCK=UUID‰ab # unlock LUKS device with UUID 89ab
>
> which would work for e.g. cases where rootfs is on a LUKS device which
> is on a MD array. In other words, we'd need a whole "recipe" passed to
> the initramfs (the mkinitrd tool would generate this recipe), not just
> the UUID of the rootfs.
>
> Coincidentally, if we had something like this and the format of the
> "recipe" was documented somewhere, it would be easy to e.g. implement
> "rescue" functionality as described here
>
> http://www.redhat.com/archives/fedora-desktop-list/2009-July/msg00019.html
>
> since graphical disk utilities would just find /etc/grub.conf (or
> similar), read the recipe and then start assembling/unlocking bits and
> mount them as appropriate in /mnt/rescue/.
>
> Actually this is very close to what Doug is asking for when he says
> (paraphrased) "just include mdadm.conf instead of this magic". The key
> difference, however, is that the user _won't_ have to use mdadm.conf or
> care about config files - it's all taken care of by the mkinitrd binary
> when building the recipe. This is a good thing as having one less config
> file to worry about is good.
>
> Thanks for considering, and sorry for the long mail,
> David
>
> [1] : As some background information, I've spent a good chunk of my
> life, five years or so, dealing with end users complaining about how
> plain block devices got automounted when they were plugged in. FWIW, the
> complaints ranges from both non-sensical (irritated users: "these
> desktop kids shall not decide how UNIX works") to actual bugs where the
> on-disk contents were mis-detected and either something wrong got
> automounted or we failed to automount at all.
>
> If I've learned anything it's that you need to be very very careful here
> - unlike Windows and other operating systems with such capabilities,
> Linux is.. different.. mostly because we support so many different ways
> to put a file system through things likd md and dm. And you need to make
> it very easy to turn things like this off.
>
>
>
David, thanks for your suggestion. As of yesterday, dracut recognizes now the
following command line parameters:
LVM
rd_NO_LVM
disable LVM detection
rd_LVM_VG=<volume group name>
only activate the volume groups with the given name
crypto LUKS
rd_NO_LUKS
disable crypto LUKS detection
rd_LUKS_UUID=<luks uuid>
only activate the LUKS partitions with the given UUID
MD
rd_NO_MD
disable MD RAID detection
rd_MD_UUID=<md uuid>
only activate the raid sets with the given UUID
DMRAID
rd_NO_DM
disable DM RAID detection
rd_DM_UUID=<dmraid uuid>
only activate the raid sets with the given UUID
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
2009-07-15 18:47 ` Dan Williams
2009-07-16 0:16 ` Jeremy Katz
@ 2009-07-16 10:56 ` Neil Brown
2009-07-16 11:09 ` Neil Brown
2 siblings, 0 replies; 12+ messages in thread
From: Neil Brown @ 2009-07-16 10:56 UTC (permalink / raw)
To: Dan Williams
Cc: David Zeuthen, Hans de Goede, initramfs,
linux-hotplug-u79uwXL29TY76Z2rM5mHXA, Danecki, Jacek,
Harald Hoyer, Doug Ledford
On Wednesday July 15, dan.j.williams@intel.com wrote:
>
> mdadm-3.0 has facilities to prevent assembly of certain metadata types
> [1] or arrays with certain uuids [2]. I wonder if we also need a
> facility to prevent auto-assembly of arrays *not* listed in
> mdadm.conf? So the mdadm.conf file installed in the initramfs would
> only identify the root array and all other randomly identified md
> devices would be ignored (rather than assembled with a foreign name).
>
> Thoughts?
This is exactly the functionality that the 'AUTO' line provides.
Arrays that are listed in mdadm.conf will always be assembled when they
are found.
Arrays that are not listed may or may not be assembled depending on
their metadata type and the setting of 'AUTO'.
So
AUTO -all
will disable all auto-assemble of arrays that are not listed in
mdadm.conf.
NeilBrown
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
2009-07-15 18:47 ` Dan Williams
2009-07-16 0:16 ` Jeremy Katz
2009-07-16 10:56 ` Neil Brown
@ 2009-07-16 11:09 ` Neil Brown
2 siblings, 0 replies; 12+ messages in thread
From: Neil Brown @ 2009-07-16 11:09 UTC (permalink / raw)
To: Dan Williams
Cc: David Zeuthen, Hans de Goede, initramfs,
linux-hotplug-u79uwXL29TY76Z2rM5mHXA, Danecki, Jacek,
Harald Hoyer, Doug Ledford
On Wednesday July 15, dan.j.williams@intel.com wrote:
> [ Cc: Neil ]
>
> On Tue, Jul 14, 2009 at 7:30 AM, David Zeuthen<david@fubar.dk> wrote:
> > On Tue, 2009-07-14 at 12:59 +0200, Hans de Goede wrote:
> >> Currently the udev rules use incremental assembly like this:
> >> mdadm -I /dev/mdraid-member
> >>
> >> There are 2 problems with this:
> >> 1) When doing this for native mdraid metadata arrays, if only
> >> one disk is present the set never gets activated
> >> 2) When doing this for imsm metadata arrays, as soon as the
> >> first disk is incrementally added, the set gets activated
> >> in degraded mode and stays that way, the second disk
> >> will get added to the container, but not to the actual
> >> sets in the container
> >
> > FWIW, this incremental assembly business in mdadm is actually not a very
> > good idea. At least not the current implementation. I'm not sure whether
> > it's still a Fedora-ism or whether it's something that's in upstream
> > mdadm yet. I'm talking about this udev rule
> >
> > /lib/udev/rules.d/65-md-incremental.rules:
> > # This file causes block devices with Linux RAID (mdadm) signatures to
> > # automatically cause mdadm to be run.
> > # See udev(8) for syntax
> >
> > SUBSYSTEM="block", ACTION="add", ENV{ID_FS_TYPE}="linux_raid_member", \
> > IMPORT{program}="/sbin/mdadm --examine --export $tempnode", \
> > RUN+="/bin/bash -c '[ ! -f /dev/.in_sysinit ] && mdadm -I $env{DEVNAME}'"
> >
> > For example if the user plugs in a random old disk that happens to
> > contain half of a RAID1 mirror, then the incremental assembly bits sets
> > up an inert md-device and the user is now left to his own devices as to
> > sort this out when he's told by partitioning tools etc. that the disk
> > (or partition of) he just plugged in, is "busy" (it is claimed by the
> > inert md node).
> >
> > I actually had to add some extra code to the GNOME Disk Utility bits to
> > handle such things (stop inert md devices) - makes the user experience
> > quite a bit worse since there's now an extra state to worry about. And
> > most current users don't use the UI bits yet for this so they get extra
> > confused when trying to use e.g. parted(8) or fdisk(8) on the device.
> >
> > FWIW, I'd wish people would stop playing games like this. If you want to
> > do auto-assembly at the system-level, at the very least don't leave the
> > system in a state like this. For example, one way to do auto-assembly
> > without such bugs would be to use libudev to enumerate all md component
> > devices with the same MD_UUID. Then you count the number of components
> > and only start the array if the number of components equals MD_DEVICES.
> > That's much better than incrementally adding to an md device node that
> > might never get used.
Yes: auto-assembly is hard, and easy to get wrong.
While I don't claim that the current scheme is at all perfect, I don't
think your suggestion is a clear improvement.
The whole point of RAID is to survive drive failure, and that includes
drives being missing.
So I don't think "completely ignore the array if not all expected
drives are present" is the correct answer.
It is very easy to remove unwanted raid metadata
(mdadm --zero-superblock), and making that easily accessible from a
GUI would probably be a good and useful thing, and might solve some
problems for some people.
One thing that I have contemplated is for md to not claim exclusive
ownership of drives until the array is activated and switch to
read-write. That would address the 'my drive was stolen by md'
problem, but it may well create other problems in its place.
My general goal at present is to make mdadm sufficiently flexible that
a distro can choose a suitable policy implement it. If someone comes
up with a policy that works convincingly well, I could then make that
the default approach that mdadm takes.
There is certainly still room for improvement and I am happy to
discuss possibilities.
NeilBrown
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2009-07-16 11:09 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-14 9:57 RFC: mdadm and bringing up raid sets from initrd (dracut) Hans de Goede
2009-07-14 13:39 ` Doug Ledford
[not found] ` <1955210A-EF27-479F-8C58-BA4FA9018A56-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-14 14:01 ` Hans de Goede
2009-07-14 14:14 ` Doug Ledford
[not found] ` <D758972F-0E5A-4860-9011-6B2DA1FA771A-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-14 15:00 ` David Zeuthen
2009-07-16 10:56 ` Harald Hoyer
[not found] ` <4A5C6501.3080607-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-14 14:30 ` David Zeuthen
[not found] ` <1247581847.1991.16.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2009-07-15 18:47 ` Dan Williams
2009-07-16 0:16 ` Jeremy Katz
[not found] ` <20090716001651.GB45537-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-16 7:11 ` Victor Lowther
2009-07-16 10:56 ` Neil Brown
2009-07-16 11:09 ` Neil Brown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).