Re: let md auto-detect 128+ raid members, fix potential race condition

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: let md auto-detect 128+ raid members, fix potential race condition
       [not found]   ` <orac6qerr4.fsf@free.oliva.athome.lsd.ic.unicamp.br>
@ 2006-07-30 23:20     ` Neil Brown
  2006-07-31 16:34       ` Helge Hafting
                         ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Neil Brown @ 2006-07-30 23:20 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: Andrew Morton, linux-kernel, linux-raid

[linux-raid added to cc.
 Background: patch was submitted to remove the current hard limit
 of 127 partitions that can be auto-detected - limit set by 
 'detected_devices array in md.c.
]

My first inclination is not to fix this problem.

I consider md auto-detect to be a legacy feature.
I don't use it and I recommend that other people don't use it.
However I cannot justify removing it, so it stays there.
Having this limitation could be seen as a good motivation for some
more users to stop using it.

Why not use auto-detect?
I have three issues with it.

 1/
    It just isn't "right".  We don't mount filesystems from partitions
    just because they have type 'Linux'.  We don't enable swap on
    partitions just because they have type 'Linux swap'.  So why do we
    assemble md/raid from partitions that have type 'Linux raid
    autodetect'? 

 2/ 
    It can cause problems when moving devices.  If you have two
    machines, both with an 'md0' array and you move the drives from one
    on to the other - say because the first lost a powersupply - and
    then reboot the machine that received the drives, which array gets
    assembled as 'md0' ?? You might be lucky, you might not. This
    isn't purely theoretical - there have been pleas for help on
    linux-raid resulting from exactly this - though they have been
    few. 

 3/ 
    The information redundancy can cause a problem when it gets out of
    sync.  i.e. you add a partition to a raid array without setting
    the partition type to 'fd'.  This works, but on the next reboot
    the partition doesn't get added back into the array and you have
    to manually add it yourself.
    This too is not purely theory - it has been reported slightly more
    often than '2'.

So my preferred solution to the problem is to tell people not to use
autodetect.  Quite possibly this should be documented in the code, and
maybe even have a KERN_INFO message if more than 64 devices are
autodetected. 

Now one doesn't always go with one's first inclination so I should
discuss the approaches to fixing the problem, should a fix really be
the right thing to do.

The idea of having a generic notifier is, I think, missing the point -
so I'd better explain what the point is.....

The kernel already has the hotplug mechanism for alerting user-space
about new partitions, and I'm sure kernel clients could hook into this
some how.  But even if they could, md wouldn't want to.
md doesn't really want to know about new partitions.  It simply wants a
list of all partitions which are of type 'autodetect'.   Getting a
list of all partitions is quite easy (/proc/partitions does it, so it
cannot be hard).  But 'struct hd_struct' doesn't record what the
partition type is at all.  It spare bits, so it could. (->policy is
currently a 'ro' flag.  A little work would allow multiple flags there).

The point of the current detected_devices array is precisely to find
partitions which have this flag set.

If we were to 'fix' this problem, I think the cleanest approach (which
I haven't actually coded, so it might not work...) would be to define
a new flag to go in hd_struct->policy to say if the partition type
suggested auto-detect, and get partitions/check.c to set this.
Then have md iterate all partitions looking for this flag.

That could be considered to be intrusive to bits of the kernel which
don't need to be intruded in to, and could run the risk of someone
wanting to expose that flag to user-space (with in /proc/partitions or
/sys) and I would be against that.  But otherwise it would be a
fairly clean approach.

The minimal (non-empty) approach of replacing the array by a linked
list as the original patch did would also be quite reasonable.  Maybe
not as clean as a flag in hd_struct, but maybe we don't need this
code to be particularly clean(?).

So:  Do you *really* need to *fix* this, or can you just use 'mdadm'
to assemble you arrays instead?

NeilBrown

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: let md auto-detect 128+ raid members, fix potential race condition
  2006-07-30 23:20     ` let md auto-detect 128+ raid members, fix potential race condition Neil Brown
@ 2006-07-31 16:34       ` Helge Hafting
  2006-07-31 20:27       ` Alexandre Oliva
  2006-08-01 17:40       ` Bill Davidsen
  2 siblings, 0 replies; 16+ messages in thread
From: Helge Hafting @ 2006-07-31 16:34 UTC (permalink / raw)
  To: Neil Brown; +Cc: Alexandre Oliva, Andrew Morton, linux-kernel, linux-raid

On Mon, Jul 31, 2006 at 09:20:58AM +1000, Neil Brown wrote:
> 
> My first inclination is not to fix this problem.
> 
> I consider md auto-detect to be a legacy feature.
> I don't use it and I recommend that other people don't use it.
> However I cannot justify removing it, so it stays there.
> Having this limitation could be seen as a good motivation for some
> more users to stop using it.
> 
> Why not use auto-detect?
[Arguments deleted]

Well, if autodetection is removed, what is then the preferred
way of booting off a raid-1 device?

Kernel parameters?

An initrd with mdadm just for this?  Some people want to do even
partition detection from initrd, in order to have a smaller
kernel.  We aren't there yet though.

Autotetect is nice from an administrator viewpoint - compile
it in and it "just works".  The trouble when you connect
an array from some other machine is to be expected, but 
that isn't exactly everyday stuff.  

Helge Hafting

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: let md auto-detect 128+ raid members, fix potential race condition
  2006-07-30 23:20     ` let md auto-detect 128+ raid members, fix potential race condition Neil Brown
  2006-07-31 16:34       ` Helge Hafting
@ 2006-07-31 20:27       ` Alexandre Oliva
  2006-07-31 21:48         ` David Greaves
  2006-08-01  1:19         ` Neil Brown
  2006-08-01 17:40       ` Bill Davidsen
  2 siblings, 2 replies; 16+ messages in thread
From: Alexandre Oliva @ 2006-07-31 20:27 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, linux-kernel, linux-raid

On Jul 30, 2006, Neil Brown <neilb@suse.de> wrote:

>  1/
>     It just isn't "right".  We don't mount filesystems from partitions
>     just because they have type 'Linux'.  We don't enable swap on
>     partitions just because they have type 'Linux swap'.  So why do we
>     assemble md/raid from partitions that have type 'Linux raid
>     autodetect'? 

Similar reason to why vgscan finds and attempts to use any partitions
that have the appropriate type/signature (difference being that raid
auto-detect looks at the actual partition type, whereas vgscan looks
at the actual data, just like mdadm, IIRC): when you have to bootstrap
from an initrd, you don't want to be forced to have the correct data
in the initrd image, since then any reconfiguration requires the info
to be introduced in the initrd image before the machine goes down.
Sometimes, especially in case of disk failures, you just can't do
that.

>  2/ 
>     It can cause problems when moving devices.

It can, indeed, and it has caused such problems to me before, but
they're the exception, not the rule, and one should optimize for the
rule, not the exception.

>  3/ 
>     The information redundancy can cause a problem when it gets out of
>     sync.  i.e. you add a partition to a raid array without setting
>     the partition type to 'fd'.  This works, but on the next reboot
>     the partition doesn't get added back into the array and you have
>     to manually add it yourself.
>     This too is not purely theory - it has been reported slightly more
>     often than '2'.

This has happened to me as well, and I remember it was extremely
confusing when it first happened :-)  But that's an argument to change
the behavior so as to look for the superblock instead of trusting the
partition type, not an argument to remove the auto-detection feature.

And then, the reliance on partition type has been useful at times as
well, when I explicitly did *not* want a certain raid device or raid
member to be brought up on boot.

> So my preferred solution to the problem is to tell people not to use
> autodetect.  Quite possibly this should be documented in the code, and
> maybe even have a KERN_INFO message if more than 64 devices are
> autodetected. 

I wouldn't have a problem with that, since then distros would probably
switch to a more recommended mechanism that works just as well, i.e.,
ideally without requiring initrd-regeneration after reconfigurations
such as adding one more raid device to the logical volume group
containing the root filesystem.

> If we were to 'fix' this problem, I think the cleanest approach (which
> I haven't actually coded, so it might not work...) would be to define
> a new flag to go in hd_struct->policy to say if the partition type
> suggested auto-detect, and get partitions/check.c to set this.
> Then have md iterate all partitions looking for this flag.

AFAICT we'd still need a list or an array, since we add stuff back to
the list in various situations.

> So:  Do you *really* need to *fix* this, or can you just use 'mdadm'
> to assemble you arrays instead?

I'm not sure.  I'd expect not to need it, but the limited feature
currently in place, that initrd uses to bring up the raid1 devices
containing the physical volumes that form the volume group where the
logical volume with my root filesystem is also brings up various raid6
physical volumes that form an unrelated volume group, and it does so
in such a way that the last of them, containing the 128th fd-type
partition in the box, ends up being left out, so the raid device it's
a member of is brought up either degraded or missing the spare member,
none of which are good.

I don't know that I can easily get initrd to replace nash's
raidautorun for mdadm unless mdadm has a mode to bring up any arrays
it can find, as opposed to bringing up a specific array out of a given
list of members or scanning for members.  Either way, this won't fix
the problem 2) that you mentioned, but requiring initrd-regeneration
after extending the volume group containing the root device is another
problem that the current modes of operation of mdadm AFAIK won't
contemplate, so switching to it will trade one problem for another,
and the latter is IMHO more common than the former.

-- 
Alexandre Oliva         http://www.lsd.ic.unicamp.br/~oliva/
Secretary for FSF Latin America        http://www.fsfla.org/
Red Hat Compiler Engineer   aoliva@{redhat.com, gcc.gnu.org}
Free Software Evangelist  oliva@{lsd.ic.unicamp.br, gnu.org}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: let md auto-detect 128+ raid members, fix potential race condition
  2006-07-31 20:27       ` Alexandre Oliva
@ 2006-07-31 21:48         ` David Greaves
  2006-08-01  2:20           ` Alexandre Oliva
  2006-08-01  1:19         ` Neil Brown
  1 sibling, 1 reply; 16+ messages in thread
From: David Greaves @ 2006-07-31 21:48 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: Neil Brown, Andrew Morton, linux-kernel, linux-raid

Alexandre Oliva wrote:
> On Jul 30, 2006, Neil Brown <neilb@suse.de> wrote:
> 
>>  1/
>>     It just isn't "right".  We don't mount filesystems from partitions
>>     just because they have type 'Linux'.  We don't enable swap on
>>     partitions just because they have type 'Linux swap'.  So why do we
>>     assemble md/raid from partitions that have type 'Linux raid
>>     autodetect'? 
> 
> Similar reason to why vgscan finds and attempts to use any partitions
> that have the appropriate type/signature (difference being that raid
> auto-detect looks at the actual partition type, whereas vgscan looks
> at the actual data, just like mdadm, IIRC): when you have to bootstrap
> from an initrd, you don't want to be forced to have the correct data
> in the initrd image, since then any reconfiguration requires the info
> to be introduced in the initrd image before the machine goes down.
> Sometimes, especially in case of disk failures, you just can't do
> that.
> 
This debate is not about generic autodetection - a good thing (tm) - but
 in-kernel vs userspace autodetection.

Your example supports Neil's case - the proposal is to use initrd to run
mdadm which thne (kinda) does what vgscan does.


> 
>> So my preferred solution to the problem is to tell people not to use
(in kernel)
>> autodetect.  Quite possibly this should be documented in the code, and
>> maybe even have a KERN_INFO message if more than 64 devices are
>> autodetected. 
> 
> I wouldn't have a problem with that, since then distros would probably
> switch to a more recommended mechanism that works just as well, i.e.,
> ideally without requiring initrd-regeneration after reconfigurations
> such as adding one more raid device to the logical volume group
> containing the root filesystem.
That's supported in today's mdadm.

look at --uuid and --name

>> So:  Do you *really* need to *fix* this, or can you just use 'mdadm'
>> to assemble you arrays instead?
> 
> I'm not sure.  I'd expect not to need it, but the limited feature
> currently in place, that initrd uses to bring up the raid1 devices
> containing the physical volumes that form the volume group where the
> logical volume with my root filesystem is also brings up various raid6
> physical volumes that form an unrelated volume group, and it does so
> in such a way that the last of them, containing the 128th fd-type
> partition in the box, ends up being left out, so the raid device it's
> a member of is brought up either degraded or missing the spare member,
> none of which are good.
> 
> I don't know that I can easily get initrd to replace nash's
> raidautorun for mdadm unless mdadm has a mode to bring up any arrays
> it can find, as opposed to bringing up a specific array out of a given
> list of members or scanning for members.  Either way, this won't fix
> the problem 2) that you mentioned, but requiring initrd-regeneration
> after extending the volume group containing the root device is another
> problem that the current modes of operation of mdadm AFAIK won't
> contemplate, so switching to it will trade one problem for another,
> and the latter is IMHO more common than the former.
> 

I think you should name your raid1 (maybe "hostname-root") and use
initrd to bring it up by --name using:
 mdadm --assemble --scan --config partitions --name hostname-root


It could also, later in the boot process, bring up "hostname-raid6" by
--name too.
 mdadm --assemble --scan --config partitions --name hostname-raid6

David


-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: let md auto-detect 128+ raid members, fix potential race condition
  2006-07-31 21:48         ` David Greaves
@ 2006-08-01  2:20           ` Alexandre Oliva
  2006-08-01  8:28             ` Michael Tokarev
  0 siblings, 1 reply; 16+ messages in thread
From: Alexandre Oliva @ 2006-08-01  2:20 UTC (permalink / raw)
  To: David Greaves; +Cc: Neil Brown, Andrew Morton, linux-kernel, linux-raid

On Jul 31, 2006, David Greaves <david@dgreaves.com> wrote:

> Alexandre Oliva wrote:

>> in the initrd image, since then any reconfiguration requires the info
>> to be introduced in the initrd image before the machine goes down.
>> Sometimes, especially in case of disk failures, you just can't do
>> that.

> Your example supports Neil's case - the proposal is to use initrd to run
> mdadm which thne (kinda) does what vgscan does.

If mdadm can indeed scan all partitions to bring up all raid devices
in them, like nash's raidautorun does, great.  I'll give that a try,
since Neil suggested it should already work in the version of mdadm
that I got here.  I didn't get that impression while skimming through
the man page, but upon closer inspection now I see it's all there.
Oops :-)

>> I wouldn't have a problem with that, since then distros would probably
>> switch to a more recommended mechanism that works just as well, i.e.,
>> ideally without requiring initrd-regeneration after reconfigurations
>> such as adding one more raid device to the logical volume group
>> containing the root filesystem.

> That's supported in today's mdadm.

> look at --uuid and --name

--uuid and --name won't help at all.  I'm talking about adding raid
physical volumes to a volume group, which means new uuid and name, so
whatever already is in initrd won't get it.  Neil's-posted command
line should take care of that though.

Even if the root device doesn't use the newly-added physical volume,
initrd's vgscan needs to find *all* physical volumes in the volume
group, otherwise the volume group will be started in `degraded' mode,
i.e., with the missing physical volumes mapped to a device mapper node
that will produce I/O errors on access IIRC, and everything else
read-only, without any way to switch to read-write when the remaining
devices are made available, which is arguably a missing feature in the
LVM subsystem.

-- 
Alexandre Oliva         http://www.lsd.ic.unicamp.br/~oliva/
Secretary for FSF Latin America        http://www.fsfla.org/
Red Hat Compiler Engineer   aoliva@{redhat.com, gcc.gnu.org}
Free Software Evangelist  oliva@{lsd.ic.unicamp.br, gnu.org}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: let md auto-detect 128+ raid members, fix potential race condition
  2006-08-01  2:20           ` Alexandre Oliva
@ 2006-08-01  8:28             ` Michael Tokarev
  2006-08-01 21:24               ` Alexandre Oliva
  0 siblings, 1 reply; 16+ messages in thread
From: Michael Tokarev @ 2006-08-01  8:28 UTC (permalink / raw)
  To: Alexandre Oliva
  Cc: David Greaves, Neil Brown, Andrew Morton, linux-kernel,
	linux-raid

Alexandre Oliva wrote:
[]
> If mdadm can indeed scan all partitions to bring up all raid devices
> in them, like nash's raidautorun does, great.  I'll give that a try,

Never, ever, try to do that (again).  Mdadm (or vgscan, or whatever)
should NOT assemble ALL arrays found, but only those which it has
been told to assemble.  This is it again: you bring another disk into
a system (disk which comes from another machine), and mdadm finds
FOREIGN arrays and brings them up as /dev/md0, where YOUR root
filesystem should be.  That's what 'homehost' option is for, for
example.

If initrd should be reconfigured after some changes (be it raid
arrays, LVM volumes, hostname, whatever), -- I for one am fine
with that.  Hopefully no one will argue that if you forgot to
install an MBR into your replacement drive, it was entirely your
own fault that your system become unbootable, after all ;)

/mjt

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: let md auto-detect 128+ raid members, fix potential race condition
  2006-08-01  8:28             ` Michael Tokarev
@ 2006-08-01 21:24               ` Alexandre Oliva
  0 siblings, 0 replies; 16+ messages in thread
From: Alexandre Oliva @ 2006-08-01 21:24 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: David Greaves, Neil Brown, Andrew Morton, linux-kernel,
	linux-raid

On Aug  1, 2006, Michael Tokarev <mjt@tls.msk.ru> wrote:

> Alexandre Oliva wrote:
> []
>> If mdadm can indeed scan all partitions to bring up all raid devices
>> in them, like nash's raidautorun does, great.  I'll give that a try,

> Never, ever, try to do that (again).  Mdadm (or vgscan, or whatever)
> should NOT assemble ALL arrays found, but only those which it has
> been told to assemble.  This is it again: you bring another disk into
> a system (disk which comes from another machine), and mdadm finds
> FOREIGN arrays and brings them up as /dev/md0, where YOUR root
> filesystem should be.  That's what 'homehost' option is for, for
> example.

Exactly.  So make it /all/all local/, if you must.  It's the same as
far as I'm concerned.

> If initrd should be reconfigured after some changes (be it raid
> arrays, LVM volumes, hostname, whatever), -- I for one am fine
> with that.

Feel free to be fine with it, as long as you also let me be free to
not be fine with it and try to cut a better deal :-)

-- 
Alexandre Oliva         http://www.lsd.ic.unicamp.br/~oliva/
Secretary for FSF Latin America        http://www.fsfla.org/
Red Hat Compiler Engineer   aoliva@{redhat.com, gcc.gnu.org}
Free Software Evangelist  oliva@{lsd.ic.unicamp.br, gnu.org}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: let md auto-detect 128+ raid members, fix potential race condition
  2006-07-31 20:27       ` Alexandre Oliva
  2006-07-31 21:48         ` David Greaves
@ 2006-08-01  1:19         ` Neil Brown
  2006-08-01  2:35           ` Alexandre Oliva
  1 sibling, 1 reply; 16+ messages in thread
From: Neil Brown @ 2006-08-01  1:19 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: Andrew Morton, linux-kernel, linux-raid

On Monday July 31, aoliva@redhat.com wrote:
> On Jul 30, 2006, Neil Brown <neilb@suse.de> wrote:
> 
> >  1/
> >     It just isn't "right".  We don't mount filesystems from partitions
> >     just because they have type 'Linux'.  We don't enable swap on
> >     partitions just because they have type 'Linux swap'.  So why do we
> >     assemble md/raid from partitions that have type 'Linux raid
> >     autodetect'? 
> 
> Similar reason to why vgscan finds and attempts to use any partitions
> that have the appropriate type/signature (difference being that raid
> auto-detect looks at the actual partition type, whereas vgscan looks
> at the actual data, just like mdadm, IIRC): when you have to bootstrap
> from an initrd, you don't want to be forced to have the correct data
> in the initrd image, since then any reconfiguration requires the info
> to be introduced in the initrd image before the machine goes down.
> Sometimes, especially in case of disk failures, you just can't do
> that.

The initrd need to 'know' how to find the root filesystem, whether by
devnum or uuid or whatever.
In exactly the same way it needs to know how to find the components
for the root md array - uuid is the best.  There is no need to
reconfigure this in the case of a disk failure.

Current mdadm will assemble arrays for you given only a hostname.  You
still need to get the hostname into the initrd, but that is no
different from a root device number.


> 
> >  2/ 
> >     It can cause problems when moving devices.
> 
> It can, indeed, and it has caused such problems to me before, but
> they're the exception, not the rule, and one should optimize for the
> rule, not the exception.

We aren't talking about optimisation.  We are talking about whether it
actually works or not.  A system that stops booting just because you
plugged a couple of extra drives in is a badly configured system.

> 
> >  3/ 
> >     The information redundancy can cause a problem when it gets out of
> >     sync.  i.e. you add a partition to a raid array without setting
> >     the partition type to 'fd'.  This works, but on the next reboot
> >     the partition doesn't get added back into the array and you have
> >     to manually add it yourself.
> >     This too is not purely theory - it has been reported slightly more
> >     often than '2'.
> 
> This has happened to me as well, and I remember it was extremely
> confusing when it first happened :-)  But that's an argument to change
> the behavior so as to look for the superblock instead of trusting the
> partition type, not an argument to remove the auto-detection
> feature.

As has been said, I don't want to remove auto-detection.  I want to do
it right, and do it from userspace.  It is in-kernel autodetection
that I have no interest in improving.

> 
> And then, the reliance on partition type has been useful at times as
> well, when I explicitly did *not* want a certain raid device or raid
> member to be brought up on boot.

Well, at boot it should only bring up the raid array containing the
root filesystem.  Everything else is best done by /etc/init.d
scripts.  And you can stop those from running by booting with -s (or
whatever it is to get single-user).

> 
> > So my preferred solution to the problem is to tell people not to use
> > autodetect.  Quite possibly this should be documented in the code, and
> > maybe even have a KERN_INFO message if more than 64 devices are
> > autodetected. 
> 
> I wouldn't have a problem with that, since then distros would probably
> switch to a more recommended mechanism that works just as well, i.e.,
> ideally without requiring initrd-regeneration after reconfigurations
> such as adding one more raid device to the logical volume group
> containing the root filesystem.
> 
> > If we were to 'fix' this problem, I think the cleanest approach (which
> > I haven't actually coded, so it might not work...) would be to define
> > a new flag to go in hd_struct->policy to say if the partition type
> > suggested auto-detect, and get partitions/check.c to set this.
> > Then have md iterate all partitions looking for this flag.
> 
> AFAICT we'd still need a list or an array, since we add stuff back to
> the list in various situations.

No.  We just need a list of partitions of the appropriate type.
Taking items of the list and putting them back on later is simply a
non-essential optimisation.

> 
> > So:  Do you *really* need to *fix* this, or can you just use 'mdadm'
> > to assemble you arrays instead?
> 
> I'm not sure.  I'd expect not to need it, but the limited feature
> currently in place, that initrd uses to bring up the raid1 devices
> containing the physical volumes that form the volume group where the
> logical volume with my root filesystem is also brings up various raid6
> physical volumes that form an unrelated volume group, and it does so
> in such a way that the last of them, containing the 128th fd-type
> partition in the box, ends up being left out, so the raid device it's
> a member of is brought up either degraded or missing the spare member,
> none of which are good.
> 
> I don't know that I can easily get initrd to replace nash's
> raidautorun for mdadm unless mdadm has a mode to bring up any arrays
> it can find, as opposed to bringing up a specific array out of a given
> list of members or scanning for members.  Either way, this won't fix
> the problem 2) that you mentioned, but requiring initrd-regeneration
> after extending the volume group containing the root device is another
> problem that the current modes of operation of mdadm AFAIK won't
> contemplate, so switching to it will trade one problem for another,
> and the latter is IMHO more common than the former.

Get mdadm 2.5.2 (or 2.5.3 if I get that out soon enough) and try

  mdadm --assemble --scan --homehost='<system>' --auto-update-homehost \
  --auto=yes --run

in your initrd, having set the hostname correctly first.  It might do
exactly what you want.

NeilBrown

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: let md auto-detect 128+ raid members, fix potential race condition
  2006-08-01  1:19         ` Neil Brown
@ 2006-08-01  2:35           ` Alexandre Oliva
  2006-08-01  3:33             ` Alexandre Oliva
  0 siblings, 1 reply; 16+ messages in thread
From: Alexandre Oliva @ 2006-08-01  2:35 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, linux-kernel, linux-raid

On Jul 31, 2006, Neil Brown <neilb@suse.de> wrote:

> The initrd need to 'know' how to find the root filesystem, whether by
> devnum or uuid or whatever.

Yeah, the tricky bit is the `whatever' alternative, when / is a
logical volume, and you need to bring up all of the physical volumes
in order for vgscan to bring up the volume group in a usable way.

> In exactly the same way it needs to know how to find the components
> for the root md array - uuid is the best.  There is no need to
> reconfigure this in the case of a disk failure.

When you add physical volumes to the volume group, you'd have to
reconfigure initrd if it wasn't for mdadm's ability to scan all
partitions.

> Current mdadm will assemble arrays for you given only a hostname.  You
> still need to get the hostname into the initrd, but that is no
> different from a root device number.

Yep, this should work, at least until someone changes the hostname,
creates a new array with the new option and then gets puzzled because
only that array isn't brought up.

Or, worse, does all of the above and then rebuilds initrd ``just in
case'', and then ends up unable to reboot because the root device
won't be brought up.  Oops :-)

>> 
>> >  2/ 
>> >     It can cause problems when moving devices.

>> It can, indeed, and it has caused such problems to me before, but
>> they're the exception, not the rule, and one should optimize for the
>> rule, not the exception.

> We aren't talking about optimisation.  We are talking about whether it
> actually works or not.

Yes, I'm talking about getting it to work most often in the most
common case.  Obviously we can't get it to work in every possible
case, since there are the various corner cases involving moving disks
around, renaming hosts and creating arrays, some of which must
necessarily fail in order for others to work.  It's finding the right
balance between them that is tricky, and some people will always be
unhappy because their particularly rare case failed, even without
realizing that this was in order to enable a more common case they
happened to rely on to work.

> A system that stops booting just because you plugged a couple of
> extra drives in is a badly configured system.

I tend to agree, although I used to exercise a case that wouldn't be
covered by this new policy: I used to move a pair of raid 1 external
disks between two hosts, and have them configured to be optionally
mounted on boot, depending on whether the raid devices were in place
or not.  With hostname identification, this wouldn't quite work :-)

> Well, at boot it should only bring up the raid array containing the
> root filesystem.

If all you have is in a single LVM volume group, then that must be
everything :-/

> Everything else is best done by /etc/init.d scripts.  And you can
> stop those from running by booting with -s (or whatever it is to get
> single-user).

Booting into single user mode actually attempts to mount everything
that is local, after bringing up raid devices et al, so that would be
too late.  But there's always init=/bin/bash :-)

> Get mdadm 2.5.2 (or 2.5.3 if I get that out soon enough) and try

>   mdadm --assemble --scan --homehost='<system>' --auto-update-homehost \
>   --auto=yes --run

> in your initrd, having set the hostname correctly first.  It might do
> exactly what you want.

Awesome, thanks, I'd missed that in the docs.  It might make sense to
spell it out as an example, instead of requiring someone to figure out
all of the bits and pieces from the extensive documentation.  Not
complaining about the extent of the documentation, BTW :-)

I'll give it a try some time tomorrow, since I won't turn on that
noisy box today any more; my daughter is already asleep :-)

Anyhow, unless there's a good reason to keep the code the way it is,
wasting valuable bytes of memory in the fixed-size array, I guess it
would make more sense to just merge the patch in, no? :-)

-- 
Alexandre Oliva         http://www.lsd.ic.unicamp.br/~oliva/
Secretary for FSF Latin America        http://www.fsfla.org/
Red Hat Compiler Engineer   aoliva@{redhat.com, gcc.gnu.org}
Free Software Evangelist  oliva@{lsd.ic.unicamp.br, gnu.org}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: let md auto-detect 128+ raid members, fix potential race condition
  2006-08-01  2:35           ` Alexandre Oliva
@ 2006-08-01  3:33             ` Alexandre Oliva
  2006-08-01 20:46               ` Alexandre Oliva
  0 siblings, 1 reply; 16+ messages in thread
From: Alexandre Oliva @ 2006-08-01  3:33 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, linux-kernel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 705 bytes --]

On Jul 31, 2006, Alexandre Oliva <aoliva@redhat.com> wrote:

>> mdadm --assemble --scan --homehost='<system>' --auto-update-homehost \
>> --auto=yes --run

>> in your initrd, having set the hostname correctly first.  It might do
>> exactly what you want.

> I'll give it a try some time tomorrow, since I won't turn on that
> noisy box today any more; my daughter is already asleep :-)

But then, I could use my own desktop to test it :-)

FWIW, here's the patch for Fedora rawhide's mkinitrd that worked for
me.  I figured even without --homehost it worked fine, even without
HOMEHOST set in mdadm.conf.

I hope copying mdadm.conf to initrd won't ever hurt, can you think of
any case in which it would?


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: mkinitrd-mdadm.patch --]
[-- Type: text/x-patch, Size: 650 bytes --]

--- /sbin/mkinitrd	2006-07-26 15:43:41.000000000 -0300
+++ /tmp/mkinitrd	2006-08-01 00:06:14.000000000 -0300
@@ -1240,10 +1240,19 @@
 emitdms
 
 if [ -n "$raiddevices" ]; then
+  if test -f /sbin/mdadm.static; then
+    if test -f /etc/mdadm.conf; then
+      inst /etc/mdadm.conf "$MNTIMAGE/etc/mdadm.conf"
+    fi
+    inst /sbin/mdadm.static "$MNTIMAGE/sbin/mdadm"
+    emit "mkdir /dev/md"
+    emit "mdadm --quiet --assemble --scan --auto-update-homehost --auto=yes --run"
+  else
     for dev in $raiddevices; do
         cp -a /dev/${dev} $MNTIMAGE/dev
         emit "raidautorun /dev/${dev}"
     done
+  fi
 fi
 
 if [ -n "$vg_list" ]; then

[-- Attachment #3: Type: text/plain, Size: 249 bytes --]


-- 
Alexandre Oliva         http://www.lsd.ic.unicamp.br/~oliva/
Secretary for FSF Latin America        http://www.fsfla.org/
Red Hat Compiler Engineer   aoliva@{redhat.com, gcc.gnu.org}
Free Software Evangelist  oliva@{lsd.ic.unicamp.br, gnu.org}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: let md auto-detect 128+ raid members, fix potential race condition
  2006-08-01  3:33             ` Alexandre Oliva
@ 2006-08-01 20:46               ` Alexandre Oliva
  2006-08-02  6:37                 ` Luca Berra
  0 siblings, 1 reply; 16+ messages in thread
From: Alexandre Oliva @ 2006-08-01 20:46 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, linux-kernel, linux-raid

On Aug  1, 2006, Alexandre Oliva <aoliva@redhat.com> wrote:

>> I'll give it a try some time tomorrow, since I won't turn on that
>> noisy box today any more; my daughter is already asleep :-)

> But then, I could use my own desktop to test it :-)

But then, I wouldn't be testing quite the same scenario.

My boot-required RAID devices were all raid 1, whereas the larger,
separate volume group was all raid 6.

Using the mkinitrd patch that I posted before, the result was that
mdadm did try to bring up all raid devices but, because the raid456
module was not loaded in initrd, the raid devices were left inactive.

Then, when rc.sysinit tried to bring them up with mdadm -A -s, that
did nothing to the inactive devices, since they didn't have to be
assembled.  Adding --run didn't help.

My current work-around is to add raid456 to initrd, but that's ugly.
Scanning /proc/mdstat for inactive devices in rc.sysinit and doing
mdadm --run on them is feasible, but it looks ugly and error-prone.

Would it be reasonable to change mdadm so as to, erhm, disassemble ;-)
the raid devices it tried to bring up but that, for whatever reason,
it couldn't activate?  (say, missing module, not enough members,
whatever)

-- 
Alexandre Oliva         http://www.lsd.ic.unicamp.br/~oliva/
Secretary for FSF Latin America        http://www.fsfla.org/
Red Hat Compiler Engineer   aoliva@{redhat.com, gcc.gnu.org}
Free Software Evangelist  oliva@{lsd.ic.unicamp.br, gnu.org}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: let md auto-detect 128+ raid members, fix potential race condition
  2006-08-01 20:46               ` Alexandre Oliva
@ 2006-08-02  6:37                 ` Luca Berra
  0 siblings, 0 replies; 16+ messages in thread
From: Luca Berra @ 2006-08-02  6:37 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: Neil Brown, Andrew Morton, linux-kernel, linux-raid

On Tue, Aug 01, 2006 at 05:46:38PM -0300, Alexandre Oliva wrote:
>Using the mkinitrd patch that I posted before, the result was that
>mdadm did try to bring up all raid devices but, because the raid456
>module was not loaded in initrd, the raid devices were left inactive.

probably your initrd is broken, it should not have even tried to bring
up an md array that was not needed to mount root.

>Then, when rc.sysinit tried to bring them up with mdadm -A -s, that
>did nothing to the inactive devices, since they didn't have to be
>assembled.  Adding --run didn't help.
>
>My current work-around is to add raid456 to initrd, but that's ugly.
>Scanning /proc/mdstat for inactive devices in rc.sysinit and doing
>mdadm --run on them is feasible, but it looks ugly and error-prone.
>
>Would it be reasonable to change mdadm so as to, erhm, disassemble ;-)
>the raid devices it tried to bring up but that, for whatever reason,
>it couldn't activate?  (say, missing module, not enough members,
>whatever)

this would make sense if it were an option, patches welcome :)

L.

-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: let md auto-detect 128+ raid members, fix potential race condition
  2006-07-30 23:20     ` let md auto-detect 128+ raid members, fix potential race condition Neil Brown
  2006-07-31 16:34       ` Helge Hafting
  2006-07-31 20:27       ` Alexandre Oliva
@ 2006-08-01 17:40       ` Bill Davidsen
  2006-08-01 21:32         ` Alexandre Oliva
  2 siblings, 1 reply; 16+ messages in thread
From: Bill Davidsen @ 2006-08-01 17:40 UTC (permalink / raw)
  To: Neil Brown; +Cc: Alexandre Oliva, Andrew Morton, linux-kernel, linux-raid

Neil Brown wrote:

>[linux-raid added to cc.
> Background: patch was submitted to remove the current hard limit
> of 127 partitions that can be auto-detected - limit set by 
> 'detected_devices array in md.c.
>]
>
>My first inclination is not to fix this problem.
>
>I consider md auto-detect to be a legacy feature.
>I don't use it and I recommend that other people don't use it.
>However I cannot justify removing it, so it stays there.
>Having this limitation could be seen as a good motivation for some
>more users to stop using it.
>
>Why not use auto-detect?
>I have three issues with it.
>
> 1/
>    It just isn't "right".  We don't mount filesystems from partitions
>    just because they have type 'Linux'.  We don't enable swap on
>    partitions just because they have type 'Linux swap'.  So why do we
>    assemble md/raid from partitions that have type 'Linux raid
>    autodetect'? 
>  
>

I rarely think you are totally wrong about anything RAID, but I do 
believe you have missed the point of autodetect. It is intended to work 
as it does now, building the array without depending on some user level 
functionality. The name "autodetect" clearly differentiates this type 
from the others you mentioned, there is no implication that swap or 
Linux partitions should do anything automatically.

This is not a case of my using a feature and defending it, I don't use 
it currently. for all of the reasons you enumerate. That doesn't mean 
that I haven't used the autodetect in the past or that I won't in the 
future, particularly with embedded systems.

> 2/ 
>    It can cause problems when moving devices.  If you have two
>    machines, both with an 'md0' array and you move the drives from one
>    on to the other - say because the first lost a powersupply - and
>    then reboot the machine that received the drives, which array gets
>    assembled as 'md0' ?? You might be lucky, you might not. This
>    isn't purely theoretical - there have been pleas for help on
>    linux-raid resulting from exactly this - though they have been
>    few. 
>
> 3/ 
>    The information redundancy can cause a problem when it gets out of
>    sync.  i.e. you add a partition to a raid array without setting
>    the partition type to 'fd'.  This works, but on the next reboot
>    the partition doesn't get added back into the array and you have
>    to manually add it yourself.
>    This too is not purely theory - it has been reported slightly more
>    often than '2'.
>
>So my preferred solution to the problem is to tell people not to use
>autodetect.  Quite possibly this should be documented in the code, and
>maybe even have a KERN_INFO message if more than 64 devices are
>autodetected. 
>  
>
I don't personally see the value of autodetect for putting together the 
huge number of drives people configure. I see this as a way to improve 
boot reliability, if someone needs 64 drives for root and boot, they 
need to read a few essays on filesystem configuration. However, I'm 
aware that there are some really bizarre special cases out there.

Maybe the limit should be in KCONFIG, with a default of 16 or so.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: let md auto-detect 128+ raid members, fix potential race condition
  2006-08-01 17:40       ` Bill Davidsen
@ 2006-08-01 21:32         ` Alexandre Oliva
  2006-08-02  6:47           ` Luca Berra
  2006-08-02 16:47           ` Bill Davidsen
  0 siblings, 2 replies; 16+ messages in thread
From: Alexandre Oliva @ 2006-08-01 21:32 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Neil Brown, Andrew Morton, linux-kernel, linux-raid

On Aug  1, 2006, Bill Davidsen <davidsen@tmr.com> wrote:

> I rarely think you are totally wrong about anything RAID, but I do
> believe you have missed the point of autodetect. It is intended to
> work as it does now, building the array without depending on some user
> level functionality.

Well, it clearly depends on at least some user level functionality
(the ioctl that triggers autodetect).  Going from that to a
full-fledged mdadm doesn't sound like such a big deal to me.

> I don't personally see the value of autodetect for putting together
> the huge number of drives people configure. I see this as a way to
> improve boot reliability, if someone needs 64 drives for root and
> boot, they need to read a few essays on filesystem
> configuration. However, I'm aware that there are some really bizarre
> special cases out there.

There's LVM.  If you have to keep root out of the VG just because
people say so, you lose lots of benefits from LVM, such as being able
to grow root with the system running, take snapshots of root, etc.

Sure enough the LVM subsystem could make things better for one to not
need all of the PVs in the root-containing VG in order to be able to
mount root read-write, or at all, but if you think about it, if initrd
is set up such that you only bring up the devices that hold the actual
root device within the VG and then you change that, say by taking a
snapshot of root, moving it around, growing it, etc, you'd be better
off if you could still boot.  So you do want all of the VG members to
be around, just in case.

This is trivially-accomplished for regular disks whose drivers are
loaded by initrd, but for raid devices, you need to tentatively bring
up every raid member you can, just in case some piece of root is
there, otherwise you may end up unable to boot.

Yes, this is an argument against root on LVM, but there are arguments
*for* root on LVM as well, and there's no reason to not support both
behaviors equally well and let people figure out what works best for
them.

-- 
Alexandre Oliva         http://www.lsd.ic.unicamp.br/~oliva/
Secretary for FSF Latin America        http://www.fsfla.org/
Red Hat Compiler Engineer   aoliva@{redhat.com, gcc.gnu.org}
Free Software Evangelist  oliva@{lsd.ic.unicamp.br, gnu.org}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: let md auto-detect 128+ raid members, fix potential race condition
  2006-08-01 21:32         ` Alexandre Oliva
@ 2006-08-02  6:47           ` Luca Berra
  2006-08-02 16:47           ` Bill Davidsen
  1 sibling, 0 replies; 16+ messages in thread
From: Luca Berra @ 2006-08-02  6:47 UTC (permalink / raw)
  To: Alexandre Oliva
  Cc: Bill Davidsen, Neil Brown, Andrew Morton, linux-kernel,
	linux-raid

On Tue, Aug 01, 2006 at 06:32:33PM -0300, Alexandre Oliva wrote:
>Sure enough the LVM subsystem could make things better for one to not
>need all of the PVs in the root-containing VG in order to be able to
>mount root read-write, or at all, but if you think about it, if initrd
it shouldn't need all of the PVs you just need all the pv where the
rootfs is.

>is set up such that you only bring up the devices that hold the actual
>root device within the VG and then you change that, say by taking a
>snapshot of root, moving it around, growing it, etc, you'd be better
>off if you could still boot.  So you do want all of the VG members to
>be around, just in case.
in this case just regenerate the initramfs after modifying the vg that
contains root. I am fairly sure that kernel upgrades are far more
frequent than the addirion of PVs to the root VG.

>Yes, this is an argument against root on LVM, but there are arguments
>*for* root on LVM as well, and there's no reason to not support both
>behaviors equally well and let people figure out what works best for
>them.

No, this is just an argument against misusing root on lvm.

L.

-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: let md auto-detect 128+ raid members, fix potential race condition
  2006-08-01 21:32         ` Alexandre Oliva
  2006-08-02  6:47           ` Luca Berra
@ 2006-08-02 16:47           ` Bill Davidsen
  1 sibling, 0 replies; 16+ messages in thread
From: Bill Davidsen @ 2006-08-02 16:47 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: Neil Brown, Andrew Morton, linux-kernel, linux-raid

Alexandre Oliva wrote:
> On Aug  1, 2006, Bill Davidsen <davidsen@tmr.com> wrote:
> 
>> I rarely think you are totally wrong about anything RAID, but I do
>> believe you have missed the point of autodetect. It is intended to
>> work as it does now, building the array without depending on some user
>> level functionality.
> 
> Well, it clearly depends on at least some user level functionality
> (the ioctl that triggers autodetect).  Going from that to a
> full-fledged mdadm doesn't sound like such a big deal to me.
> 
>> I don't personally see the value of autodetect for putting together
>> the huge number of drives people configure. I see this as a way to
>> improve boot reliability, if someone needs 64 drives for root and
>> boot, they need to read a few essays on filesystem
>> configuration. However, I'm aware that there are some really bizarre
>> special cases out there.
> 
> There's LVM.  If you have to keep root out of the VG just because
> people say so, you lose lots of benefits from LVM, such as being able
> to grow root with the system running, take snapshots of root, etc.
> 
But it's MY system. I don't have to anything. More to the point, growing 
root while the system is running is done a lot less than booting. In 
general the root f/s has very little in it, and that's a good thing.

-- 
Bill Davidsen <davidsen@tmr.com>
   Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a
normal user and is setuid root, with the "vi" line edit mode selected,
and the character set is "big5," an off-by-one errors occurs during
wildcard (glob) expansion.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2006-08-02 16:47 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <ork65veg2y.fsf@free.oliva.athome.lsd.ic.unicamp.br>
     [not found] ` <20060730124139.45861b47.akpm@osdl.org>
     [not found]   ` <orac6qerr4.fsf@free.oliva.athome.lsd.ic.unicamp.br>
2006-07-30 23:20     ` let md auto-detect 128+ raid members, fix potential race condition Neil Brown
2006-07-31 16:34       ` Helge Hafting
2006-07-31 20:27       ` Alexandre Oliva
2006-07-31 21:48         ` David Greaves
2006-08-01  2:20           ` Alexandre Oliva
2006-08-01  8:28             ` Michael Tokarev
2006-08-01 21:24               ` Alexandre Oliva
2006-08-01  1:19         ` Neil Brown
2006-08-01  2:35           ` Alexandre Oliva
2006-08-01  3:33             ` Alexandre Oliva
2006-08-01 20:46               ` Alexandre Oliva
2006-08-02  6:37                 ` Luca Berra
2006-08-01 17:40       ` Bill Davidsen
2006-08-01 21:32         ` Alexandre Oliva
2006-08-02  6:47           ` Luca Berra
2006-08-02 16:47           ` Bill Davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).